Web Scraping part 3: Scaling up

Rolf Fredheim and Aiora Zabala
University of Cambridge

04/03/2014

Today we will:

Revision: loop over a scraper
Collect relevant URLs
Download files
Look at copyright issues
Basic text manipulation in R
Introduce APIs (subject of the final session)

Get the docs: http://fredheir.github.io/WebScraping/Lecture3/p3.html

http://fredheir.github.io/WebScraping/Lecture3/p3.Rpres

http://fredheir.github.io/WebScraping/Lecture3/p3.r

Digital data collection

Devise a means of accessing data
Retrieve that data
tabulate and store the data

Today we focus less on interacting with JSON or HTML; more on automating and repeating

Using a scraper

In code

Remember this from last week?

bbcScraper <- function(url){
  SOURCE <-  getURL(url,encoding="UTF-8")
  PARSED <- htmlParse(SOURCE)
  title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)
  date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
  if (is.null(date))    date <- NA
  if (is.null(title))    title <- NA
  return(c(title,date))
}

We want to scale up

How?

Loop and rbind
sapply
sink into database

sqlite in R: http://sandymuspratt.blogspot.co.uk/2012/11/r-and-sqlite-part-1.html

Loop

require(RCurl)
require(XML)
urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")
results=NULL
for (url in urls){
  newEntry <- bbcScraper(url)
  results <- rbind(results,newEntry)
}
data.frame(results) #ignore the warning

                                                       X1
1 Russian rouble hits new low against the dollar and euro
2                    'Two Together Railcard' goes on sale
3            Australia: Snake eats crocodile after battle
4   Missing Megan Roberts: Police find body in River Ouse
                   X2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50

Still to do: fix column names

Disadvantage of loop

copying data this way is inefficient:

temp=NULL
#1st loop
temp <- rbind(temp,results[1,])
temp

     [,1]                                                     
[1,] "Russian rouble hits new low against the dollar and euro"
     [,2]                 
[1,] "2014/03/03 16:06:54"


#2nd loop
temp <- rbind(temp,results[2,])

#3d loop
temp <- rbind(temp,results[3,])

#4th loop
temp <- rbind(temp,results[4,])
temp

     [,1]                                                     
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "'Two Together Railcard' goes on sale"                   
[3,] "Australia: Snake eats crocodile after battle"           
[4,] "Missing Megan Roberts: Police find body in River Ouse"  
     [,2]                 
[1,] "2014/03/03 16:06:54"
[2,] "2014/03/03 02:02:11"
[3,] "2014/03/03 08:34:05"
[4,] "2014/03/03 06:59:50"

In each case we are copying the whole table in order to add a single line.

this is slow
need to keep two copied in memory (means we can only ever use at most half of computer's RAM)

sapply

A bit more efficient.

Takes a vector. Applies a formula to each item in the vector:

dat <- c(1,2,3,4,5)
sapply(dat,function(x) x*2)

[1]  2  4  6  8 10

syntax: sapply(data,function) function: can be your own function, or a standard one:

sapply(dat,sqrt)

[1] 1.000 1.414 1.732 2.000 2.236

in our case:

urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")

sapply(urls,bbcScraper)

     http://www.bbc.co.uk/news/business-26414285              
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "2014/03/03 16:06:54"                                    
     http://www.bbc.co.uk/news/uk-26407840 
[1,] "'Two Together Railcard' goes on sale"
[2,] "2014/03/03 02:02:11"                 
     http://www.bbc.co.uk/news/world-asia-26413101 
[1,] "Australia: Snake eats crocodile after battle"
[2,] "2014/03/03 08:34:05"                         
     http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963
[1,] "Missing Megan Roberts: Police find body in River Ouse"           
[2,] "2014/03/03 06:59:50"

we don't really want the data in this format. We can reshape it, or use a related function:

ldply: For each element of a list, apply function then combine results into a data frame.

require(plyr)
dat <- ldply(urls,bbcScraper)
dat

                                                       V1
1 Russian rouble hits new low against the dollar and euro
2                    'Two Together Railcard' goes on sale
3            Australia: Snake eats crocodile after battle
4   Missing Megan Roberts: Police find body in River Ouse
                   V2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50

Task

Scrape ten BBC news articles
Put them in a data frame

Link harvesting

Entering URLs by hand is tedious

We can speed this up by automating the collection of links

How can we find relevant links?

?
Scraping URLs in a search result
those on the front page

Search result

Go to page, make a search, repeat that query in R

Press next page to get pagination pattern. E.g: http://www.bbc.co.uk/search/news/?page=2&q=Russia

Task

Can you collect the links from that page? Can you restrict it to the search results (find the right div)

One solution

All links

url="http://www.bbc.co.uk/search/news/?page=3&q=Russia"
SOURCE <-  getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
xpathSApply(PARSED, "//a/@href")

filtered links:

unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
#OR xpathSApply(PARSED, "//div[@id='news content']/@href")

Why 'unique'?

Another option: xpathSApply(PARSED, “//div[@id='news content']/@href”)

Create a table

require(plyr)
targets <- unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
results <- ldply(targets[1:5],bbcScraper) #limiting it to first five pages
results

                                                               V1
1         Ukraine crisis: Obama warns Russia against intervention
2    Russian 'invasion', plea to Boris and Susanna Reid in papers
3                             The day I enraged Viktor Yanukovych
4            Ukraine accuses Russia of deploying troops in Crimea
5 Cygnet left in Gloucestershire after parents migrate without it
                   V2
1 2014/03/01 07:44:22
2 2014/03/01 05:49:12
3 2014/03/01 01:17:16
4 2014/02/28 22:47:30
5 2014/02/28 21:14:41

Scale up further

Account for pagination by writing a scraper that searches:

http://www.bbc.co.uk/search/news/?page=3&q=Russia http://www.bbc.co.uk/search/news/?page=4&q=Russia http://www.bbc.co.uk/search/news/?page=5&q=Russia

etc.

hint: use paste or paste0

Let's not do this in class though, and give the BBC's servers some peace

Copyright

Reading

One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.

PubMed Central UK has strong provisions against automated and systematic download of articles: Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the UKPMC web site. Bulk downloading of articles from the main UKPMC web site, in any way, is prohibited because of copyright restrictions.

–http://www.technollama.co.uk/wp-content/uploads/2013/04/Data-Mining-Paper.pdf

Rough guidelines

I don't know IP law. So all the below are no more than guidelines

Most legal cases relate to copying and republishing of information for profit (see article1 above)

Don't cause damage (through republishing, taking-down servers, prevention of profit)
Make sure your use is 'transformative'
Read the TOCs

In 2013 Google Books was found to be 'fair use'

Judgement good news for scholars and libraries

Archiving and downloading

databases can be used to 'facilitate non-consumptive research'

Distinction made between freely available content, and that behind a pay-wall, requiring to accept conditions, etc.

Downloading

Don't download loads of journal articles. Just don't

If you need/want to download something requiring authentication, be careful.

-attach cookie Packages: httr

Many resources on how to do this, many situations in which this is totally OK. It is not hard, but often quite shady. If you need this for your work, consider the explanations below, and exercise restraint.

http://stackoverflow.com/questions/13204503

http://stackoverflow.com/questions/10213194

http://stackoverflow.com/questions/16118140

http://stackoverflow.com/questions/15853204

http://stackoverflow.com/questions/19074359

http://stackoverflow.com/questions/8510528

http://stackoverflow.com/questions/9638451

http://stackoverflow.com/questions/2388974

Principles of downloading

Function: download.file(url,destfile)

destfile = filename on disk

option: mode=“wb”

Often files downloaded are corrupted. Setting mode =“wb” prevents “\n” causing havoc, e.g. with image files

Example

http://lib.ru

Library of Russian language copyright-free works. (like Gutenberg)

This is a 1747 translation of Hamlet into Russian

Run this in your terminal:

url <- "http://lib.ru/SHAKESPEARE/hamlet8.pdf"
download.file(url,"hamlet.pdf",mode="wb")

navigate to your working directory, and you should have the pdf there

Automating downloads

As with the newspaper articles, downloads are facilitated by getting the right links. Let's search for pdf files and download the first ten results:

url <- "http://lib.ru/GrepSearch?Search=pdf"

SOURCE <-  getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
PARSED <- htmlParse(SOURCE)
links <- (xpathSApply(PARSED, "//a/@href"))
links[grep("pdf",links)][1]

                           href 
"/ANEKDOTY/REZNIK/perewody.pdf"

links <- paste0("http://lib.ru",links[grep("pdf",links)])
links[1]

[1] "http://lib.ru/ANEKDOTY/REZNIK/perewody.pdf"

Can you write a loop to download the first ten links?

Solutions

for (i in 1:10){
  parts <- unlist(str_split(links[i],"/"))
  outName <- parts[length(parts)]
  print(outName)
  download.file(links[i],outName)
}

String manipulation in R

Hardest part of task above: meaningful filenames

Done using str_split

Top string manipulation functions:

grep

gsub

str_split (library: stringr)

paste

nchar

tolower (also toupper, capitalize)

str_trim (library: stringr)

Encoding read here

Reading:

http://en.wikibooks.org/wiki/R_Programming/Text_Processing

http://chemicalstatistician.wordpress.com/2014/02/27/useful-functions-in-r-for-manipulating-text-data/

What do they do: grep

Grep + regex: find stuff

grep("SHAKESPEARE",links)

 [1]  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
[18] 116 117 118

links[grep("SHAKESPEARE",links)] #or: grep("SHAKESPEARE",links,value=T)

 [1] "http://lib.ru/SHAKESPEARE/shks_romeo7.pdf"        
 [2] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be.pdf" 
 [3] "http://lib.ru/SHAKESPEARE/gamlet_mihalowskij.pdf" 
 [4] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be3.pdf"
 [5] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be4.pdf"
 [6] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be5.pdf"
 [7] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be6.pdf"
 [8] "http://lib.ru/SHAKESPEARE/hamlet8.pdf"            
 [9] "http://lib.ru/SHAKESPEARE/hamlet9.pdf"            
[10] "http://lib.ru/SHAKESPEARE/hamlet10.pdf"           
[11] "http://lib.ru/SHAKESPEARE/hamlet11.pdf"           
[12] "http://lib.ru/SHAKESPEARE/shks_hamlet13.pdf"      
[13] "http://lib.ru/SHAKESPEARE/shks_hamlet14.pdf"      
[14] "http://lib.ru/SHAKESPEARE/shks_hamlet17.pdf"      
[15] "http://lib.ru/SHAKESPEARE/shks_hamlet22.pdf"      
[16] "http://lib.ru/SHAKESPEARE/shks_hamlet24.pdf"      
[17] "http://lib.ru/SHAKESPEARE/shks_henry_IV_5.pdf"    
[18] "http://lib.ru/SHAKESPEARE/shks_henry_V_2.pdf"     
[19] "http://lib.ru/SHAKESPEARE/shks_henryVI_3_2.pdf"   
[20] "http://lib.ru/SHAKESPEARE/henry8_2.pdf"

Grep 2

useful options: invert=T : get all non-matches ignore.case=T : what it says on the box value = T : return values rather than positions

Especially good with regex for partial matches:

grep("hamlet*",links,value=T)[1]

[1] "http://lib.ru/LITRA/SUMAROKOW/hamlet8.pdf"

Regex

Check out

?regex
http://www.rexegg.com/regex-quickstart.html

Can match beginning or end of word, e.g.:

grep("stalin",c("stalin","stalingrad"),value=T)

[1] "stalin"     "stalingrad"

grep("stalin\\b",c("stalin","stalingrad"),value=T)

[1] "stalin"

What do they do: gsub

author <- "By Rolf Fredheim"
gsub("By ","",author)

[1] "Rolf Fredheim"

gsub("Rolf Fredheim","Tom",author)

[1] "By Tom"

Gsub can also use regex

str_split

Manipulating URLs
Editing time stamps, etc

syntax: str_split(inputString,pattern) returns a list

str_split(links[1],"/")

[[1]]
[1] "http:"        ""             "lib.ru"       "ANEKDOTY"    
[5] "REZNIK"       "perewody.pdf"

unlist(str_split(links[1],"/"))

[1] "http:"        ""             "lib.ru"       "ANEKDOTY"    
[5] "REZNIK"       "perewody.pdf"

we wanted the last element (perewody.df):

parts <- unlist(str_split(links[1],"/"))
length(parts)

[1] 6

parts[length(parts)]

[1] "perewody.pdf"

The rest

nchar
tolower (also toupper)
str_trim (library: stringr)

annoyingString <- "\n    something HERE  \t\t\t"

nchar(annoyingString)

[1] 24

str_trim(annoyingString)

[1] "something HERE"

tolower(str_trim(annoyingString))

[1] "something here"

nchar(str_trim(annoyingString))

[1] 14

Formatting dates

Easiest way to read in dates:

require(lubridate)
as.Date("2014-01-31")

[1] "2014-01-31"


date <- as.Date("2014-01-31")
str(date)

 Date[1:1], format: "2014-01-31"

Correctly formatting dates is useful:

date <- as.Date("2014-01-31")
str(date)

 Date[1:1], format: "2014-01-31"

date+years(1)

[1] "2015-01-31"

date-months(6)

[1] "2013-07-31"

date-days(1110)

[1] "2011-01-17"

the full way to enter dates:

as.Date("2014-01-31","%Y-%m-%d")

[1] "2014-01-31"

The funny characters preceded by percentage characters denote date formatting

%Y = 4 digits, 2004
%y = 2 digits, 04
%m = month
%d = day
%b = month in characters

Most of the time you won't need this. But what about:

date  <- "04 March 2014"
as.Date(date,"%d %b %Y")

[1] "2014-03-04"

Probably worth spelling out, as Brits tend to write dates d-m-y, while Americans prefer m-d-y. Confusions possible for the first 12 days of every month.

Lubridate

Reading: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

Lubridate makes this much, much easier:

require(lubridate)
date  <- "04 March 2014"
dmy(date)

[1] "2014-03-04 UTC"

time <- "04 March 2014 16:10:00"
dmy_hms(time,tz="GMT")

[1] "2014-03-04 16:10:00 GMT"

time2 <- "2014/03/01 07:44:22"
ymd_hms(time2,tz="GMT")

[1] "2014-03-01 07:44:22 GMT"

Task

Last week we wrote a scraper for the telegraph:

url <- 'http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html'
SOURCE <-  getURL(url,encoding="UTF-8") 
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue)
author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue)

author: “\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter”

time: “1:30PM GMT 25 Feb 2014”

With the functions above, rewrite the scraper to correctly format the dates

Blank slide

Possible solution

author <-  "\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter"
a1 <- str_trim(author)
a2 <- gsub("By ","",a1)
a3 <- unlist(str_split(a2,","))[1]
a3

[1] "Miranda Prynne"


time <-  "1:30PM GMT 25 Feb 2014"
t <- unlist(str_split(time,"GMT "))[2]
dmy(t)

[1] "2014-02-25 UTC"

Intro to APIs

Last week we looked at the Guardian's social media buttons:

How do these work?

http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky

Go to source by right-clicking, select view page source, and go to line 368

Open this javascript in a new browser window

About twenty lines down there is the helpful comment:

Look for facebook share buttons and get share count from Facebook GraphAPI

This is followed by some JQuery statements and an ajax call

Ajax

From Wikipedia:

With Ajax, web applications can send data to, and retrieve data from, a server asynchronously (in the background) without interfering with the display and behavior of the existing page. Data can be retrieved using the XMLHttpRequest object. Despite the name, the use of XML is not required (JSON is often used instead.

How does this work?

You load the webpage
There is a script in the code.
As well as placeholders, empty fields
The script runs, and executes the Ajax call.
This connects, in this case, with the Facebook API
the API returns data about the page from Facebook's servers
The JQuery syntax interprets the JSON and fills the blanks in the html
The user sees the number of shares.

Problem: when we download the page, we see only the first three steps. Solution: intercept the Ajax call, or, go straight to the source

APIs

When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web

-Wikipedia

All the social buttons script is doing is accessing, in turn, the Facebook, the Twitter, Google, Pinterest and Linkedin APIs, collecting data, and pasting that into the website

How does this work

Here is the javascript code:

var fqlQuery = 'select share_count,like_count from link_stat where url=“' + url + '”'

queryUrl = 'http://graph.facebook.com/fql?q='+fqlQuery+'&callback=?';

Can we work with that? You bet.

Code translated to R:

fqlQuery='select share_count,like_count,comment_count from link_stat where url="'
url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"')  #ignoring the callback part
lookUp <- URLencode(queryUrl) #What do you think this does?
lookUp

[1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"

Paste that into your browser (lose the quotation marks!), and what do we find….?

Retrieving data

Our old pal JSON. This should look familiar by now

Why not try it for a few other articles: this works for any url. Here's how to check the stats for the slides from our first session:

require(rjson)
url="http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"')  #ignoring the callback part
lookUp <- URLencode(queryUrl)
rd <- readLines(lookUp, warn="F") 
dat <- fromJSON(rd)
dat

$data
$data[[1]]
$data[[1]]$share_count
[1] 3

$data[[1]]$like_count
[1] 5

$data[[1]]$comment_count
[1] 1

Accessing the numbers

Here's how we grab the numbers from the list

dat$data[[1]]$like_count

[1] 5

dat$data[[1]]$share_count

[1] 3

dat$data[[1]]$comment_count

[1] 1

Pretty modest.

task

If there's any time left in class, why not:

write a scraper to find which of the articles from the BBC earlier have been shared the most.

Finally

APIs are really useful: we don't (normally!) have to worry about terms of use

Next week I'll bring a few along, and we'll spend most of class looking at writing scrapers for them.

Examples:

Maps
Cricket scores
YouTube
Lyrics
Weather
Stock market (ticker) info
etc. etc.