```{r setup, include=FALSE} opts_chunk$set(cache=TRUE) ``` Web Scraping part 3: Scaling up ======================================================== width: 1200 author: Rolf Fredheim and Aiora Zabala date: University of Cambridge font-family: 'Rockwell' 04/03/2014 Catch up ================== Slides from week 1: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html Slides from week 2: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-part2-digging-deeper.html Today we will: ======================================================== - Revision: loop over a scraper - Collect relevant URLs - Download files - Look at copyright issues - Basic text manipulation in R - Introduce APIs (subject of the final session) Get the docs: http://fredheir.github.io/WebScraping/Lecture3/p3.html http://fredheir.github.io/WebScraping/Lecture3/p3.Rpres http://fredheir.github.io/WebScraping/Lecture3/p3.r Digital data collection ======================= - **Devise a means of accessing data** - Retrieve that data - **tabulate and store the data** Today we focus less on interacting with JSON or HTML; more on automating and repeating Using a scraper =============== In code =================== Remember this from last week? ```{r} bbcScraper <- function(url){ SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE,encoding="UTF-8") title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue) date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content")) if (is.null(date)) date <- NA if (is.null(title)) title <- NA return(c(title,date)) } ``` We want to scale up ====================== How? ============ - Loop and rbind - sapply - sink into database sqlite in R: http://sandymuspratt.blogspot.co.uk/2012/11/r-and-sqlite-part-1.html Loop ============== type:sq ```{r} require(RCurl) require(XML) urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963") results=NULL for (url in urls){ newEntry <- bbcScraper(url) results <- rbind(results,newEntry) } data.frame(results) #ignore the warning ``` Still to do: fix column names Disadvantage of loop ===================== type:sq2 left:80 copying data this way is inefficient: ```{r} temp=NULL #1st loop temp <- rbind(temp,results[1,]) temp #2nd loop temp <- rbind(temp,results[2,]) #3d loop temp <- rbind(temp,results[3,]) #4th loop temp <- rbind(temp,results[4,]) temp ``` ============= In each case we are copying the whole table in order to add a single line. - this is slow - need to keep two copied in memory (means we can only ever use at most half of computer's RAM) sapply =========== A bit more efficient. Takes a vector. Applies a formula to each item in the vector: ```{r} dat <- c(1,2,3,4,5) sapply(dat,function(x) x*2) ``` syntax: sapply(data,function) function: can be your own function, or a standard one: ```{r} sapply(dat,sqrt) ``` in our case: ======================== type:sq ```{r} urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963") sapply(urls,bbcScraper) ``` we don't really want the data in this format. We can reshape it, or use a related function: ================= ldply: For each element of a list, apply function then combine results into a data frame. ```{r} require(plyr) dat <- ldply(urls,bbcScraper) dat ``` Task =============== type:section - Scrape ten BBC news articles - Put them in a data frame Link harvesting ==================== Entering URLs by hand is tedious We can speed this up by automating the collection of links How can we find relevant links? - ? - Scraping URLs in a search result - those on the front page Search result ============ Go to page, make a search, repeat that query in R Press next page to get pagination pattern. E.g: http://www.bbc.co.uk/search/news/?page=2&q=Russia Task ========== type:section Can you collect the links from that page? Can you restrict it to the search results (find the right div) One solution ================ type:sq1 All links ```{r results="hide"} url="http://www.bbc.co.uk/search/news/?page=3&q=Russia" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) xpathSApply(PARSED, "//a/@href") ``` filtered links: ```{r results="hide"} unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href")) #OR xpathSApply(PARSED, "//div[@id='news content']/@href") ``` Why 'unique'? Another option: xpathSApply(PARSED, "//div[@id='news content']/@href") Create a table ===================== type:sq1 ```{r} require(plyr) targets <- unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href")) results <- ldply(targets[1:5],bbcScraper) #limiting it to first five pages results ``` Scale up further ============== Account for pagination by writing a scraper that searches: http://www.bbc.co.uk/search/news/?page=3&q=Russia http://www.bbc.co.uk/search/news/?page=4&q=Russia http://www.bbc.co.uk/search/news/?page=5&q=Russia etc. hint: use paste or paste0 Let's not do this in class though, and give the BBC's servers some peace Copyright ====================== type:sq Reading - http://about.bloomberglaw.com/practitioner-contributions/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes/ - http://www.theguardian.com/media-tech-law/tangled-web-of-copyright-law - http://matthewsag.com/googlebooks-decision-fair-use/ - http://www.bbc.co.uk/news/technology-26187730 - http://matthewsag.com/anotherbestpracitcescode/ - http://www.ipo.gov.uk//response-2011-copyright-final.pdf - http://www.arl.org/storage/documents/publications/code-of-best-practices-fair-use.pdf > One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted. =================== > PubMed Central UK has strong provisions against automated and systematic download of articles: Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the UKPMC web site. Bulk downloading of articles from the main UKPMC web site, in any way, is prohibited because of copyright restrictions. --http://www.technollama.co.uk/wp-content/uploads/2013/04/Data-Mining-Paper.pdf Rough guidelines ======================= type:sq I don't know IP law. So all the below are no more than guidelines Most legal cases relate to copying and republishing of information for profit (see article1 above) - Don't cause damage (through republishing, taking-down servers, prevention of profit) - Make sure your use is 'transformative' - Read the TOCs In 2013 Google Books was found to be 'fair use' - Judgement good news for scholars and libraries Archiving and downloading - databases can be used to 'facilitate non-consumptive research' Distinction made between freely available content, and that behind a pay-wall, requiring to accept conditions, etc. Downloading =========== type:sq Don't download loads of journal articles. Just don't If you need/want to download something requiring authentication, be careful. -attach cookie Packages: httr Many resources on how to do this, many situations in which this is totally OK. It is not hard, but often quite shady. If you need this for your work, consider the explanations below, and exercise restraint. - http://stackoverflow.com/questions/13204503 - http://stackoverflow.com/questions/10213194 - http://stackoverflow.com/questions/16118140 - http://stackoverflow.com/questions/15853204 - http://stackoverflow.com/questions/19074359 - http://stackoverflow.com/questions/8510528 - http://stackoverflow.com/questions/9638451 - http://stackoverflow.com/questions/2388974 Principles of downloading ===================== Function: download.file(url,destfile) destfile = filename on disk option: mode="wb" Often files downloaded are corrupted. Setting mode ="wb" prevents "\n" causing havoc, e.g. with image files Example =============== http://lib.ru Library of Russian language copyright-free works. (like Gutenberg) This is a 1747 translation of Hamlet into Russian Run this in your terminal: ```{r, eval=F} url <- "http://lib.ru/SHAKESPEARE/hamlet8.pdf" download.file(url,"hamlet.pdf",mode="wb") ``` navigate to your working directory, and you should have the pdf there Automating downloads ================= type:sq1 As with the newspaper articles, downloads are facilitated by getting the right links. Let's search for pdf files and download the first ten results: ```{r} url <- "http://lib.ru/GrepSearch?Search=pdf" SOURCE <- getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters PARSED <- htmlParse(SOURCE) links <- (xpathSApply(PARSED, "//a/@href")) links[grep("pdf",links)][1] links <- paste0("http://lib.ru",links[grep("pdf",links)]) links[1] ``` Can you write a loop to download the first ten links? Solutions ======================== ```{r eval=F } for (i in 1:10){ parts <- unlist(str_split(links[i],"/")) outName <- parts[length(parts)] print(outName) download.file(links[i],outName) } ``` String manipulation in R ============== type:sq2 Hardest part of task above: meaningful filenames Done using str_split Top string manipulation functions: - grep - gsub - str_split (library: stringr) - paste - nchar - tolower (also toupper, capitalize) - str_trim (library: stringr) - Encoding read here Reading: - http://en.wikibooks.org/wiki/R_Programming/Text_Processing - http://chemicalstatistician.wordpress.com/2014/02/27/useful-functions-in-r-for-manipulating-text-data/ What do they do: grep ===================== type:sq1 Grep + regex: find stuff ```{r} grep("SHAKESPEARE",links) links[grep("SHAKESPEARE",links)] #or: grep("SHAKESPEARE",links,value=T) ``` Grep 2 ============ type:sq useful options: invert=T : get all non-matches ignore.case=T : what it says on the box value = T : return values rather than positions Especially good with regex for partial matches: ```{r} grep("hamlet*",links,value=T)[1] ``` Regex ======== Check out - ?regex - http://www.rexegg.com/regex-quickstart.html Can match beginning or end of word, e.g.: ```{r} grep("stalin",c("stalin","stalingrad"),value=T) grep("stalin\\b",c("stalin","stalingrad"),value=T) ``` What do they do: gsub ===================== ```{r} author <- "By Rolf Fredheim" gsub("By ","",author) gsub("Rolf Fredheim","Tom",author) ``` Gsub can also use regex str_split ============== type:sq - Manipulating URLs - Editing time stamps, etc syntax: str_split(inputString,pattern) returns a list ```{r} str_split(links[1],"/") unlist(str_split(links[1],"/")) ``` we wanted the last element (perewody.df): ```{r} parts <- unlist(str_split(links[1],"/")) length(parts) parts[length(parts)] ``` The rest ============ type:sq1 - nchar - tolower (also toupper) - str_trim (library: stringr) ```{r} annoyingString <- "\n something HERE \t\t\t" ``` *** ```{r} nchar(annoyingString) str_trim(annoyingString) tolower(str_trim(annoyingString)) nchar(str_trim(annoyingString)) ``` Formatting dates ============ type:sq1 Easiest way to read in dates: ```{r} require(lubridate) as.Date("2014-01-31") date <- as.Date("2014-01-31") str(date) ``` Correctly formatting dates is useful: *** ```{r} date <- as.Date("2014-01-31") str(date) date+years(1) date-months(6) date-days(1110) ``` ============ type:sq1 the full way to enter dates: ```{r} as.Date("2014-01-31","%Y-%m-%d") ``` The funny characters preceded by percentage characters denote date formatting - %Y = 4 digits, 2004 - %y = 2 digits, 04 - %m = month - %d = day - %b = month in characters ============== Most of the time you won't need this. But what about: ```{r} date <- "04 March 2014" as.Date(date,"%d %b %Y") ``` Probably worth spelling out, as Brits tend to write dates d-m-y, while Americans prefer m-d-y. Confusions possible for the first 12 days of every month. Lubridate ======== type:sq Reading: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/ Lubridate makes this much, much easier: ```{r} require(lubridate) date <- "04 March 2014" dmy(date) time <- "04 March 2014 16:10:00" dmy_hms(time,tz="GMT") time2 <- "2014/03/01 07:44:22" ymd_hms(time2,tz="GMT") ``` Task ================== type:sq1 Last week we wrote a scraper for the telegraph: ```{r eval=F} url <- 'http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html' SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue) author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue) ``` author: "\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter" time: "1:30PM GMT 25 Feb 2014" With the functions above, rewrite the scraper to correctly format the dates Blank slide ============== Possible solution =================== type:sq1 ```{r} author <- "\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter" a1 <- str_trim(author) a2 <- gsub("By ","",a1) a3 <- unlist(str_split(a2,","))[1] a3 time <- "1:30PM GMT 25 Feb 2014" t <- unlist(str_split(time,"GMT "))[2] dmy(t) ``` Intro to APIs ======================= type:sq1 Last week we looked at the Guardian's social media buttons: How do these work? http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky Go to source by right-clicking, select view page source, and go to line 368 Open this javascript in a new browser window About twenty lines down there is the helpful comment: > Look for facebook share buttons and get share count from Facebook GraphAPI This is followed by some JQuery statements and an ajax call Ajax ================== From Wikipedia: > With Ajax, web applications can send data to, and retrieve data from, a server asynchronously (in the background) without interfering with the display and behavior of the existing page. Data can be retrieved using the XMLHttpRequest object. Despite the name, the use of XML is not required (JSON is often used instead. How does this work? ================ - You load the webpage - There is a script in the code. - As well as placeholders, empty fields - The script runs, and executes the Ajax call. - This connects, in this case, with the Facebook API - the API returns data about the page from Facebook's servers - The JQuery syntax interprets the JSON and fills the blanks in the html - The user sees the number of shares. Problem: when we download the page, we see only the first three steps. Solution: intercept the Ajax call, or, go straight to the source APIs ================ type:sq1 > When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. > The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web -Wikipedia All the social buttons script is doing is accessing, in turn, the Facebook, the Twitter, Google, Pinterest and Linkedin APIs, collecting data, and pasting that into the website How does this work ================== Here is the javascript code: > var fqlQuery = 'select share_count,like_count from link_stat where url="' + url + '"' > queryUrl = 'http://graph.facebook.com/fql?q='+fqlQuery+'&callback=?'; Can we work with that? You bet. Code translated to R: ================= type:sq1 ```{r} fqlQuery='select share_count,like_count,comment_count from link_stat where url="' url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky" queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part lookUp <- URLencode(queryUrl) #What do you think this does? lookUp ``` Paste that into your browser (lose the quotation marks!), and what do we find....? Retrieving data ============== type:sq Our old pal JSON. This should look familiar by now Why not try it for a few other articles: this works for any url. Here's how to check the stats for the slides from our first session: ```{r} require(rjson) url="http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html" queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part lookUp <- URLencode(queryUrl) rd <- readLines(lookUp, warn="F") dat <- fromJSON(rd) dat ``` Accessing the numbers ========= type:sq1 Here's how we grab the numbers from the list ```{r} dat$data[[1]]$like_count dat$data[[1]]$share_count dat$data[[1]]$comment_count ``` Pretty modest. task ====================== type:section If there's any time left in class, why not: - write a scraper to find which of the articles from the BBC earlier have been shared the most. Finally ================== type:sq APIs are really useful: we don't (normally!) have to worry about terms of use Next week I'll bring a few along, and we'll spend most of class looking at writing scrapers for them. Examples: - Maps - Cricket scores - YouTube - Lyrics - Weather - Stock market (ticker) info - etc. etc.