'How do I download the source code for a webpage with tags created by JavaScript? [closed]

How do I download through R the source code for a webpage that has tags that were created by JavaScript?

When I use the FireFox ‘Inspect Element’ function, tags are sometimes not shown in the HTML source file. In other words, information I need is in the JavaScript code. Is there a way to read this information into R?

Related question: How to view webpage source code using R?



Solution 1:[1]

You can use getURL from RCurl to get the HTTP response.

library(RCurl)
address <- "https://discussions.apple.com/thread/4356115?tstart=0"
txt <- getURL(address)

Now you can spit the string on the opening tag, then split that on the closing tag

ss <- strsplit(txt, "<strong class=\"jive-thread-reply-message-correct-label\">")[[1]]
strsplit(ss[2], "</strong>")[[1]][1]

Which gives:

[1] "This solved my question"

It turns out that there is more than one of the div tag you wanted, and the above gets the wrong one. I don't know how to do it purely in R, but I followed the post you referenced by VitoshKa and I got it to work.

First, in Firefox go to Tools -> Add-ons. Search for and install mozrepl. Then, in Firefox click Tools -> MozRepl -> Start.

Now, in R:

mz <- socketConnection("localhost", "4242")
writeLines("var w=window.open(\"https://discussions.apple.com/thread/4356115?tstart=0\")\n",mz)
out <- readLines(mz) #empty the buffer
writeLines("w.document.getElementsByTagName('html')[0].innerHTML\n", mz)
out <- readLines(mz)

(loc <- grep("jive-thread-reply-message-correct-label", out))
#[1] 1150 2845

Now, out is a vectorloc holds the positions of the strings that contain your tag. It appears twice. The first one is the one you're interested in.

out[loc[1]]

You can extract the information from this the same way I showed above with strsplit, or with a regular expression and gsub


You can close the window that opens with writeLines("w.window.close()", mz)

Solution 2:[2]

You would have to run a full javascript interpreter on the html.

You can use Rhino. It will be slow.

Otherwise you will need to drive a browser like selenium RC does. (You can use the selenium .net libraries)

You would be better off figuring out what the javascript does by inspection, rather than naive scraping.

Also learn XPATH queries if you are serious about scraping.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 Byron Whitlock