Web Scraping part 3: Scaling up

Rolf Fredheim and Aiora Zabala
University of Cambridge

04/03/2014

Today we will:

  • Revision: loop over a scraper
  • Collect relevant URLs
  • Download files
  • Look at copyright issues
  • Basic text manipulation in R
  • Introduce APIs (subject of the final session)

Get the docs: http://fredheir.github.io/WebScraping/Lecture3/p3.html

http://fredheir.github.io/WebScraping/Lecture3/p3.Rpres

http://fredheir.github.io/WebScraping/Lecture3/p3.r

Digital data collection

  • Devise a means of accessing data
  • Retrieve that data
  • tabulate and store the data

Today we focus less on interacting with JSON or HTML; more on automating and repeating

Using a scraper

In code

Remember this from last week?

bbcScraper <- function(url){
  SOURCE <-  getURL(url,encoding="UTF-8")
  PARSED <- htmlParse(SOURCE)
  title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)
  date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
  if (is.null(date))    date <- NA
  if (is.null(title))    title <- NA
  return(c(title,date))
}

We want to scale up

How?

  • Loop and rbind
  • sapply
  • sink into database

sqlite in R: http://sandymuspratt.blogspot.co.uk/2012/11/r-and-sqlite-part-1.html

Loop

require(RCurl)
require(XML)
urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")
results=NULL
for (url in urls){
  newEntry <- bbcScraper(url)
  results <- rbind(results,newEntry)
}
data.frame(results) #ignore the warning
                                                       X1
1 Russian rouble hits new low against the dollar and euro
2                    'Two Together Railcard' goes on sale
3            Australia: Snake eats crocodile after battle
4   Missing Megan Roberts: Police find body in River Ouse
                   X2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50

Still to do: fix column names

Disadvantage of loop

copying data this way is inefficient:

temp=NULL
#1st loop
temp <- rbind(temp,results[1,])
temp
     [,1]                                                     
[1,] "Russian rouble hits new low against the dollar and euro"
     [,2]                 
[1,] "2014/03/03 16:06:54"

#2nd loop
temp <- rbind(temp,results[2,])

#3d loop
temp <- rbind(temp,results[3,])

#4th loop
temp <- rbind(temp,results[4,])
temp
     [,1]                                                     
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "'Two Together Railcard' goes on sale"                   
[3,] "Australia: Snake eats crocodile after battle"           
[4,] "Missing Megan Roberts: Police find body in River Ouse"  
     [,2]                 
[1,] "2014/03/03 16:06:54"
[2,] "2014/03/03 02:02:11"
[3,] "2014/03/03 08:34:05"
[4,] "2014/03/03 06:59:50"

In each case we are copying the whole table in order to add a single line.

  • this is slow
  • need to keep two copied in memory (means we can only ever use at most half of computer's RAM)

sapply

A bit more efficient.

Takes a vector. Applies a formula to each item in the vector:

dat <- c(1,2,3,4,5)
sapply(dat,function(x) x*2)
[1]  2  4  6  8 10

syntax: sapply(data,function) function: can be your own function, or a standard one:

sapply(dat,sqrt)
[1] 1.000 1.414 1.732 2.000 2.236

in our case:

urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")

sapply(urls,bbcScraper)
     http://www.bbc.co.uk/news/business-26414285              
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "2014/03/03 16:06:54"                                    
     http://www.bbc.co.uk/news/uk-26407840 
[1,] "'Two Together Railcard' goes on sale"
[2,] "2014/03/03 02:02:11"                 
     http://www.bbc.co.uk/news/world-asia-26413101 
[1,] "Australia: Snake eats crocodile after battle"
[2,] "2014/03/03 08:34:05"                         
     http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963
[1,] "Missing Megan Roberts: Police find body in River Ouse"           
[2,] "2014/03/03 06:59:50"                                             

we don't really want the data in this format. We can reshape it, or use a related function:

ldply: For each element of a list, apply function then combine results into a data frame.

require(plyr)
dat <- ldply(urls,bbcScraper)
dat
                                                       V1
1 Russian rouble hits new low against the dollar and euro
2                    'Two Together Railcard' goes on sale
3            Australia: Snake eats crocodile after battle
4   Missing Megan Roberts: Police find body in River Ouse
                   V2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50

Task

  • Scrape ten BBC news articles
  • Put them in a data frame

Link harvesting

Entering URLs by hand is tedious

We can speed this up by automating the collection of links

How can we find relevant links?

  • ?
  • Scraping URLs in a search result
  • those on the front page

Search result

Go to page, make a search, repeat that query in R

Press next page to get pagination pattern. E.g: http://www.bbc.co.uk/search/news/?page=2&q=Russia

Task

Can you collect the links from that page? Can you restrict it to the search results (find the right div)

One solution

All links

url="http://www.bbc.co.uk/search/news/?page=3&q=Russia"
SOURCE <-  getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
xpathSApply(PARSED, "//a/@href")

filtered links:

unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
#OR xpathSApply(PARSED, "//div[@id='news content']/@href")

Why 'unique'?

Another option: xpathSApply(PARSED, “//div[@id='news content']/@href”)

Create a table

require(plyr)
targets <- unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
results <- ldply(targets[1:5],bbcScraper) #limiting it to first five pages
results
                                                               V1
1         Ukraine crisis: Obama warns Russia against intervention
2    Russian 'invasion', plea to Boris and Susanna Reid in papers
3                             The day I enraged Viktor Yanukovych
4            Ukraine accuses Russia of deploying troops in Crimea
5 Cygnet left in Gloucestershire after parents migrate without it
                   V2
1 2014/03/01 07:44:22
2 2014/03/01 05:49:12
3 2014/03/01 01:17:16
4 2014/02/28 22:47:30
5 2014/02/28 21:14:41

Scale up further

Account for pagination by writing a scraper that searches:

http://www.bbc.co.uk/search/news/?page=3&q=Russia http://www.bbc.co.uk/search/news/?page=4&q=Russia http://www.bbc.co.uk/search/news/?page=5&q=Russia

etc.

hint: use paste or paste0

Let's not do this in class though, and give the BBC's servers some peace

Copyright

PubMed Central UK has strong provisions against automated and systematic download of articles: Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the UKPMC web site. Bulk downloading of articles from the main UKPMC web site, in any way, is prohibited because of copyright restrictions.

http://www.technollama.co.uk/wp-content/uploads/2013/04/Data-Mining-Paper.pdf

Rough guidelines

I don't know IP law. So all the below are no more than guidelines

Most legal cases relate to copying and republishing of information for profit (see article1 above)

  • Don't cause damage (through republishing, taking-down servers, prevention of profit)
  • Make sure your use is 'transformative'
  • Read the TOCs

In 2013 Google Books was found to be 'fair use'

  • Judgement good news for scholars and libraries

Archiving and downloading

  • databases can be used to 'facilitate non-consumptive research'

Distinction made between freely available content, and that behind a pay-wall, requiring to accept conditions, etc.

Downloading

Don't download loads of journal articles. Just don't

If you need/want to download something requiring authentication, be careful.

-attach cookie Packages: httr

Many resources on how to do this, many situations in which this is totally OK. It is not hard, but often quite shady. If you need this for your work, consider the explanations below, and exercise restraint.

Principles of downloading

Function: download.file(url,destfile)

destfile = filename on disk

option: mode=“wb”

Often files downloaded are corrupted. Setting mode =“wb” prevents “\n” causing havoc, e.g. with image files

Example

http://lib.ru

Library of Russian language copyright-free works. (like Gutenberg)

This is a 1747 translation of Hamlet into Russian

Run this in your terminal:

url <- "http://lib.ru/SHAKESPEARE/hamlet8.pdf"
download.file(url,"hamlet.pdf",mode="wb")

navigate to your working directory, and you should have the pdf there

Automating downloads

As with the newspaper articles, downloads are facilitated by getting the right links. Let's search for pdf files and download the first ten results:

url <- "http://lib.ru/GrepSearch?Search=pdf"

SOURCE <-  getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
PARSED <- htmlParse(SOURCE)
links <- (xpathSApply(PARSED, "//a/@href"))
links[grep("pdf",links)][1]
                           href 
"/ANEKDOTY/REZNIK/perewody.pdf" 
links <- paste0("http://lib.ru",links[grep("pdf",links)])
links[1]
[1] "http://lib.ru/ANEKDOTY/REZNIK/perewody.pdf"

Can you write a loop to download the first ten links?

Solutions

for (i in 1:10){
  parts <- unlist(str_split(links[i],"/"))
  outName <- parts[length(parts)]
  print(outName)
  download.file(links[i],outName)
}

String manipulation in R

Hardest part of task above: meaningful filenames

Done using str_split

Top string manipulation functions:

  • grep
  • gsub
  • str_split (library: stringr)
  • paste
  • nchar
  • tolower (also toupper, capitalize)
  • str_trim (library: stringr)
  • Encoding read here

Reading:

What do they do: grep

Grep + regex: find stuff

grep("SHAKESPEARE",links)
 [1]  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
[18] 116 117 118
links[grep("SHAKESPEARE",links)] #or: grep("SHAKESPEARE",links,value=T)
 [1] "http://lib.ru/SHAKESPEARE/shks_romeo7.pdf"        
 [2] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be.pdf" 
 [3] "http://lib.ru/SHAKESPEARE/gamlet_mihalowskij.pdf" 
 [4] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be3.pdf"
 [5] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be4.pdf"
 [6] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be5.pdf"
 [7] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be6.pdf"
 [8] "http://lib.ru/SHAKESPEARE/hamlet8.pdf"            
 [9] "http://lib.ru/SHAKESPEARE/hamlet9.pdf"            
[10] "http://lib.ru/SHAKESPEARE/hamlet10.pdf"           
[11] "http://lib.ru/SHAKESPEARE/hamlet11.pdf"           
[12] "http://lib.ru/SHAKESPEARE/shks_hamlet13.pdf"      
[13] "http://lib.ru/SHAKESPEARE/shks_hamlet14.pdf"      
[14] "http://lib.ru/SHAKESPEARE/shks_hamlet17.pdf"      
[15] "http://lib.ru/SHAKESPEARE/shks_hamlet22.pdf"      
[16] "http://lib.ru/SHAKESPEARE/shks_hamlet24.pdf"      
[17] "http://lib.ru/SHAKESPEARE/shks_henry_IV_5.pdf"    
[18] "http://lib.ru/SHAKESPEARE/shks_henry_V_2.pdf"     
[19] "http://lib.ru/SHAKESPEARE/shks_henryVI_3_2.pdf"   
[20] "http://lib.ru/SHAKESPEARE/henry8_2.pdf"           

Grep 2

useful options: invert=T : get all non-matches ignore.case=T : what it says on the box value = T : return values rather than positions

Especially good with regex for partial matches:

grep("hamlet*",links,value=T)[1]
[1] "http://lib.ru/LITRA/SUMAROKOW/hamlet8.pdf"

Regex

Check out

Can match beginning or end of word, e.g.:

grep("stalin",c("stalin","stalingrad"),value=T)
[1] "stalin"     "stalingrad"
grep("stalin\\b",c("stalin","stalingrad"),value=T)
[1] "stalin"

What do they do: gsub

author <- "By Rolf Fredheim"
gsub("By ","",author)
[1] "Rolf Fredheim"
gsub("Rolf Fredheim","Tom",author)
[1] "By Tom"

Gsub can also use regex

str_split

  • Manipulating URLs
  • Editing time stamps, etc

syntax: str_split(inputString,pattern) returns a list

str_split(links[1],"/")
[[1]]
[1] "http:"        ""             "lib.ru"       "ANEKDOTY"    
[5] "REZNIK"       "perewody.pdf"
unlist(str_split(links[1],"/"))
[1] "http:"        ""             "lib.ru"       "ANEKDOTY"    
[5] "REZNIK"       "perewody.pdf"

we wanted the last element (perewody.df):

parts <- unlist(str_split(links[1],"/"))
length(parts)
[1] 6
parts[length(parts)]
[1] "perewody.pdf"

The rest

  • nchar
  • tolower (also toupper)
  • str_trim (library: stringr)
annoyingString <- "\n    something HERE  \t\t\t"
nchar(annoyingString)
[1] 24
str_trim(annoyingString)
[1] "something HERE"
tolower(str_trim(annoyingString))
[1] "something here"
nchar(str_trim(annoyingString))
[1] 14

Formatting dates

Easiest way to read in dates:

require(lubridate)
as.Date("2014-01-31")
[1] "2014-01-31"

date <- as.Date("2014-01-31")
str(date)
 Date[1:1], format: "2014-01-31"

Correctly formatting dates is useful:

date <- as.Date("2014-01-31")
str(date)
 Date[1:1], format: "2014-01-31"
date+years(1)
[1] "2015-01-31"
date-months(6)
[1] "2013-07-31"
date-days(1110)
[1] "2011-01-17"

the full way to enter dates:

as.Date("2014-01-31","%Y-%m-%d")
[1] "2014-01-31"

The funny characters preceded by percentage characters denote date formatting

  • %Y = 4 digits, 2004
  • %y = 2 digits, 04
  • %m = month
  • %d = day
  • %b = month in characters

Most of the time you won't need this. But what about:

date  <- "04 March 2014"
as.Date(date,"%d %b %Y")
[1] "2014-03-04"

Probably worth spelling out, as Brits tend to write dates d-m-y, while Americans prefer m-d-y. Confusions possible for the first 12 days of every month.

Lubridate

Reading: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

Lubridate makes this much, much easier:

require(lubridate)
date  <- "04 March 2014"
dmy(date)
[1] "2014-03-04 UTC"
time <- "04 March 2014 16:10:00"
dmy_hms(time,tz="GMT")
[1] "2014-03-04 16:10:00 GMT"
time2 <- "2014/03/01 07:44:22"
ymd_hms(time2,tz="GMT")
[1] "2014-03-01 07:44:22 GMT"

Task

Last week we wrote a scraper for the telegraph:

url <- 'http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html'
SOURCE <-  getURL(url,encoding="UTF-8") 
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue)
author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue)

author: “\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter”

time: “1:30PM GMT 25 Feb 2014”

With the functions above, rewrite the scraper to correctly format the dates

Blank slide

Possible solution

author <-  "\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter"
a1 <- str_trim(author)
a2 <- gsub("By ","",a1)
a3 <- unlist(str_split(a2,","))[1]
a3
[1] "Miranda Prynne"

time <-  "1:30PM GMT 25 Feb 2014"
t <- unlist(str_split(time,"GMT "))[2]
dmy(t)
[1] "2014-02-25 UTC"

Intro to APIs

Last week we looked at the Guardian's social media buttons:

How do these work?

http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky

Go to source by right-clicking, select view page source, and go to line 368

Open this javascript in a new browser window

About twenty lines down there is the helpful comment:

Look for facebook share buttons and get share count from Facebook GraphAPI

This is followed by some JQuery statements and an ajax call

Ajax

From Wikipedia:

With Ajax, web applications can send data to, and retrieve data from, a server asynchronously (in the background) without interfering with the display and behavior of the existing page. Data can be retrieved using the XMLHttpRequest object. Despite the name, the use of XML is not required (JSON is often used instead.

How does this work?

  • You load the webpage
  • There is a script in the code.
  • As well as placeholders, empty fields

  • The script runs, and executes the Ajax call.

  • This connects, in this case, with the Facebook API

  • the API returns data about the page from Facebook's servers

  • The JQuery syntax interprets the JSON and fills the blanks in the html

  • The user sees the number of shares.

Problem: when we download the page, we see only the first three steps. Solution: intercept the Ajax call, or, go straight to the source

APIs

When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web

-Wikipedia

All the social buttons script is doing is accessing, in turn, the Facebook, the Twitter, Google, Pinterest and Linkedin APIs, collecting data, and pasting that into the website

How does this work

Here is the javascript code:

var fqlQuery = 'select share_count,like_count from link_stat where url=“' + url + '”'

queryUrl = 'http://graph.facebook.com/fql?q='+fqlQuery+'&callback=?';

Can we work with that? You bet.

Code translated to R:

fqlQuery='select share_count,like_count,comment_count from link_stat where url="'
url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"')  #ignoring the callback part
lookUp <- URLencode(queryUrl) #What do you think this does?
lookUp
[1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"

Paste that into your browser (lose the quotation marks!), and what do we find….?

Retrieving data

Our old pal JSON. This should look familiar by now

Why not try it for a few other articles: this works for any url. Here's how to check the stats for the slides from our first session:

require(rjson)
url="http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"')  #ignoring the callback part
lookUp <- URLencode(queryUrl)
rd <- readLines(lookUp, warn="F") 
dat <- fromJSON(rd)
dat
$data
$data[[1]]
$data[[1]]$share_count
[1] 3

$data[[1]]$like_count
[1] 5

$data[[1]]$comment_count
[1] 1

Accessing the numbers

Here's how we grab the numbers from the list

dat$data[[1]]$like_count
[1] 5
dat$data[[1]]$share_count
[1] 3
dat$data[[1]]$comment_count
[1] 1

Pretty modest.

task

If there's any time left in class, why not:

  • write a scraper to find which of the articles from the BBC earlier have been shared the most.

Finally

APIs are really useful: we don't (normally!) have to worry about terms of use

Next week I'll bring a few along, and we'll spend most of class looking at writing scrapers for them.

Examples:

  • Maps
  • Cricket scores
  • YouTube
  • Lyrics
  • Weather
  • Stock market (ticker) info
  • etc. etc.