Rolf Fredheim and Aiora Zabala
University of Cambridge
04/03/2014
Get the docs: http://fredheir.github.io/WebScraping/Lecture3/p3.html
Today we focus less on interacting with JSON or HTML; more on automating and repeating
Remember this from last week?
bbcScraper <- function(url){
SOURCE <- getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)
date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
if (is.null(date)) date <- NA
if (is.null(title)) title <- NA
return(c(title,date))
}
sqlite in R: http://sandymuspratt.blogspot.co.uk/2012/11/r-and-sqlite-part-1.html
require(RCurl)
require(XML)
urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")
results=NULL
for (url in urls){
newEntry <- bbcScraper(url)
results <- rbind(results,newEntry)
}
data.frame(results) #ignore the warning
X1
1 Russian rouble hits new low against the dollar and euro
2 'Two Together Railcard' goes on sale
3 Australia: Snake eats crocodile after battle
4 Missing Megan Roberts: Police find body in River Ouse
X2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50
Still to do: fix column names
copying data this way is inefficient:
temp=NULL
#1st loop
temp <- rbind(temp,results[1,])
temp
[,1]
[1,] "Russian rouble hits new low against the dollar and euro"
[,2]
[1,] "2014/03/03 16:06:54"
#2nd loop
temp <- rbind(temp,results[2,])
#3d loop
temp <- rbind(temp,results[3,])
#4th loop
temp <- rbind(temp,results[4,])
temp
[,1]
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "'Two Together Railcard' goes on sale"
[3,] "Australia: Snake eats crocodile after battle"
[4,] "Missing Megan Roberts: Police find body in River Ouse"
[,2]
[1,] "2014/03/03 16:06:54"
[2,] "2014/03/03 02:02:11"
[3,] "2014/03/03 08:34:05"
[4,] "2014/03/03 06:59:50"
In each case we are copying the whole table in order to add a single line.
A bit more efficient.
Takes a vector. Applies a formula to each item in the vector:
dat <- c(1,2,3,4,5)
sapply(dat,function(x) x*2)
[1] 2 4 6 8 10
syntax: sapply(data,function) function: can be your own function, or a standard one:
sapply(dat,sqrt)
[1] 1.000 1.414 1.732 2.000 2.236
urls <- c("http://www.bbc.co.uk/news/business-26414285","http://www.bbc.co.uk/news/uk-26407840","http://www.bbc.co.uk/news/world-asia-26413101","http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963")
sapply(urls,bbcScraper)
http://www.bbc.co.uk/news/business-26414285
[1,] "Russian rouble hits new low against the dollar and euro"
[2,] "2014/03/03 16:06:54"
http://www.bbc.co.uk/news/uk-26407840
[1,] "'Two Together Railcard' goes on sale"
[2,] "2014/03/03 02:02:11"
http://www.bbc.co.uk/news/world-asia-26413101
[1,] "Australia: Snake eats crocodile after battle"
[2,] "2014/03/03 08:34:05"
http://www.bbc.co.uk/news/uk-england-york-north-yorkshire-26413963
[1,] "Missing Megan Roberts: Police find body in River Ouse"
[2,] "2014/03/03 06:59:50"
we don't really want the data in this format. We can reshape it, or use a related function:
ldply: For each element of a list, apply function then combine results into a data frame.
require(plyr)
dat <- ldply(urls,bbcScraper)
dat
V1
1 Russian rouble hits new low against the dollar and euro
2 'Two Together Railcard' goes on sale
3 Australia: Snake eats crocodile after battle
4 Missing Megan Roberts: Police find body in River Ouse
V2
1 2014/03/03 16:06:54
2 2014/03/03 02:02:11
3 2014/03/03 08:34:05
4 2014/03/03 06:59:50
Entering URLs by hand is tedious
We can speed this up by automating the collection of links
How can we find relevant links?
Go to page, make a search, repeat that query in R
Press next page to get pagination pattern. E.g: http://www.bbc.co.uk/search/news/?page=2&q=Russia
Can you collect the links from that page? Can you restrict it to the search results (find the right div)
All links
url="http://www.bbc.co.uk/search/news/?page=3&q=Russia"
SOURCE <- getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
xpathSApply(PARSED, "//a/@href")
filtered links:
unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
#OR xpathSApply(PARSED, "//div[@id='news content']/@href")
Why 'unique'?
Another option: xpathSApply(PARSED, “//div[@id='news content']/@href”)
require(plyr)
targets <- unique(xpathSApply(PARSED, "//a[@class='title linktrack-title']/@href"))
results <- ldply(targets[1:5],bbcScraper) #limiting it to first five pages
results
V1
1 Ukraine crisis: Obama warns Russia against intervention
2 Russian 'invasion', plea to Boris and Susanna Reid in papers
3 The day I enraged Viktor Yanukovych
4 Ukraine accuses Russia of deploying troops in Crimea
5 Cygnet left in Gloucestershire after parents migrate without it
V2
1 2014/03/01 07:44:22
2 2014/03/01 05:49:12
3 2014/03/01 01:17:16
4 2014/02/28 22:47:30
5 2014/02/28 21:14:41
Account for pagination by writing a scraper that searches:
http://www.bbc.co.uk/search/news/?page=3&q=Russia http://www.bbc.co.uk/search/news/?page=4&q=Russia http://www.bbc.co.uk/search/news/?page=5&q=Russia
etc.
hint: use paste or paste0
Let's not do this in class though, and give the BBC's servers some peace
Reading
One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.
PubMed Central UK has strong provisions against automated and systematic download of articles: Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the UKPMC web site. Bulk downloading of articles from the main UKPMC web site, in any way, is prohibited because of copyright restrictions.
–http://www.technollama.co.uk/wp-content/uploads/2013/04/Data-Mining-Paper.pdf
I don't know IP law. So all the below are no more than guidelines
Most legal cases relate to copying and republishing of information for profit (see article1 above)
In 2013 Google Books was found to be 'fair use'
Archiving and downloading
Distinction made between freely available content, and that behind a pay-wall, requiring to accept conditions, etc.
Don't download loads of journal articles. Just don't
If you need/want to download something requiring authentication, be careful.
-attach cookie Packages: httr
Many resources on how to do this, many situations in which this is totally OK. It is not hard, but often quite shady. If you need this for your work, consider the explanations below, and exercise restraint.
Function: download.file(url,destfile)
destfile = filename on disk
option: mode=“wb”
Often files downloaded are corrupted. Setting mode =“wb” prevents “\n” causing havoc, e.g. with image files
Library of Russian language copyright-free works. (like Gutenberg)
This is a 1747 translation of Hamlet into Russian
Run this in your terminal:
url <- "http://lib.ru/SHAKESPEARE/hamlet8.pdf"
download.file(url,"hamlet.pdf",mode="wb")
navigate to your working directory, and you should have the pdf there
As with the newspaper articles, downloads are facilitated by getting the right links. Let's search for pdf files and download the first ten results:
url <- "http://lib.ru/GrepSearch?Search=pdf"
SOURCE <- getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
PARSED <- htmlParse(SOURCE)
links <- (xpathSApply(PARSED, "//a/@href"))
links[grep("pdf",links)][1]
href
"/ANEKDOTY/REZNIK/perewody.pdf"
links <- paste0("http://lib.ru",links[grep("pdf",links)])
links[1]
[1] "http://lib.ru/ANEKDOTY/REZNIK/perewody.pdf"
Can you write a loop to download the first ten links?
for (i in 1:10){
parts <- unlist(str_split(links[i],"/"))
outName <- parts[length(parts)]
print(outName)
download.file(links[i],outName)
}
Hardest part of task above: meaningful filenames
Done using str_split
Top string manipulation functions:
Reading:
Grep + regex: find stuff
grep("SHAKESPEARE",links)
[1] 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
[18] 116 117 118
links[grep("SHAKESPEARE",links)] #or: grep("SHAKESPEARE",links,value=T)
[1] "http://lib.ru/SHAKESPEARE/shks_romeo7.pdf"
[2] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be.pdf"
[3] "http://lib.ru/SHAKESPEARE/gamlet_mihalowskij.pdf"
[4] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be3.pdf"
[5] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be4.pdf"
[6] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be5.pdf"
[7] "http://lib.ru/SHAKESPEARE/to_be_or_not_to_be6.pdf"
[8] "http://lib.ru/SHAKESPEARE/hamlet8.pdf"
[9] "http://lib.ru/SHAKESPEARE/hamlet9.pdf"
[10] "http://lib.ru/SHAKESPEARE/hamlet10.pdf"
[11] "http://lib.ru/SHAKESPEARE/hamlet11.pdf"
[12] "http://lib.ru/SHAKESPEARE/shks_hamlet13.pdf"
[13] "http://lib.ru/SHAKESPEARE/shks_hamlet14.pdf"
[14] "http://lib.ru/SHAKESPEARE/shks_hamlet17.pdf"
[15] "http://lib.ru/SHAKESPEARE/shks_hamlet22.pdf"
[16] "http://lib.ru/SHAKESPEARE/shks_hamlet24.pdf"
[17] "http://lib.ru/SHAKESPEARE/shks_henry_IV_5.pdf"
[18] "http://lib.ru/SHAKESPEARE/shks_henry_V_2.pdf"
[19] "http://lib.ru/SHAKESPEARE/shks_henryVI_3_2.pdf"
[20] "http://lib.ru/SHAKESPEARE/henry8_2.pdf"
useful options: invert=T : get all non-matches ignore.case=T : what it says on the box value = T : return values rather than positions
Especially good with regex for partial matches:
grep("hamlet*",links,value=T)[1]
[1] "http://lib.ru/LITRA/SUMAROKOW/hamlet8.pdf"
Check out
Can match beginning or end of word, e.g.:
grep("stalin",c("stalin","stalingrad"),value=T)
[1] "stalin" "stalingrad"
grep("stalin\\b",c("stalin","stalingrad"),value=T)
[1] "stalin"
author <- "By Rolf Fredheim"
gsub("By ","",author)
[1] "Rolf Fredheim"
gsub("Rolf Fredheim","Tom",author)
[1] "By Tom"
Gsub can also use regex
syntax: str_split(inputString,pattern) returns a list
str_split(links[1],"/")
[[1]]
[1] "http:" "" "lib.ru" "ANEKDOTY"
[5] "REZNIK" "perewody.pdf"
unlist(str_split(links[1],"/"))
[1] "http:" "" "lib.ru" "ANEKDOTY"
[5] "REZNIK" "perewody.pdf"
we wanted the last element (perewody.df):
parts <- unlist(str_split(links[1],"/"))
length(parts)
[1] 6
parts[length(parts)]
[1] "perewody.pdf"
annoyingString <- "\n something HERE \t\t\t"
nchar(annoyingString)
[1] 24
str_trim(annoyingString)
[1] "something HERE"
tolower(str_trim(annoyingString))
[1] "something here"
nchar(str_trim(annoyingString))
[1] 14
Easiest way to read in dates:
require(lubridate)
as.Date("2014-01-31")
[1] "2014-01-31"
date <- as.Date("2014-01-31")
str(date)
Date[1:1], format: "2014-01-31"
Correctly formatting dates is useful:
date <- as.Date("2014-01-31")
str(date)
Date[1:1], format: "2014-01-31"
date+years(1)
[1] "2015-01-31"
date-months(6)
[1] "2013-07-31"
date-days(1110)
[1] "2011-01-17"
the full way to enter dates:
as.Date("2014-01-31","%Y-%m-%d")
[1] "2014-01-31"
The funny characters preceded by percentage characters denote date formatting
Most of the time you won't need this. But what about:
date <- "04 March 2014"
as.Date(date,"%d %b %Y")
[1] "2014-03-04"
Probably worth spelling out, as Brits tend to write dates d-m-y, while Americans prefer m-d-y. Confusions possible for the first 12 days of every month.
Reading: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
Lubridate makes this much, much easier:
require(lubridate)
date <- "04 March 2014"
dmy(date)
[1] "2014-03-04 UTC"
time <- "04 March 2014 16:10:00"
dmy_hms(time,tz="GMT")
[1] "2014-03-04 16:10:00 GMT"
time2 <- "2014/03/01 07:44:22"
ymd_hms(time2,tz="GMT")
[1] "2014-03-01 07:44:22 GMT"
Last week we wrote a scraper for the telegraph:
url <- 'http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html'
SOURCE <- getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue)
author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue)
author: “\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter”
time: “1:30PM GMT 25 Feb 2014”
With the functions above, rewrite the scraper to correctly format the dates
author <- "\r\n\t\t\t\t\t\t\tBy Miranda Prynne, News Reporter"
a1 <- str_trim(author)
a2 <- gsub("By ","",a1)
a3 <- unlist(str_split(a2,","))[1]
a3
[1] "Miranda Prynne"
time <- "1:30PM GMT 25 Feb 2014"
t <- unlist(str_split(time,"GMT "))[2]
dmy(t)
[1] "2014-02-25 UTC"
Last week we looked at the Guardian's social media buttons:
How do these work?
http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky
Go to source by right-clicking, select view page source, and go to line 368
Open this javascript in a new browser window
About twenty lines down there is the helpful comment:
Look for facebook share buttons and get share count from Facebook GraphAPI
This is followed by some JQuery statements and an ajax call
From Wikipedia:
With Ajax, web applications can send data to, and retrieve data from, a server asynchronously (in the background) without interfering with the display and behavior of the existing page. Data can be retrieved using the XMLHttpRequest object. Despite the name, the use of XML is not required (JSON is often used instead.
As well as placeholders, empty fields
The script runs, and executes the Ajax call.
This connects, in this case, with the Facebook API
the API returns data about the page from Facebook's servers
The JQuery syntax interprets the JSON and fills the blanks in the html
The user sees the number of shares.
Problem: when we download the page, we see only the first three steps. Solution: intercept the Ajax call, or, go straight to the source
When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.
The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web
-Wikipedia
All the social buttons script is doing is accessing, in turn, the Facebook, the Twitter, Google, Pinterest and Linkedin APIs, collecting data, and pasting that into the website
Here is the javascript code:
var fqlQuery = 'select share_count,like_count from link_stat where url=“' + url + '”'
queryUrl = 'http://graph.facebook.com/fql?q='+fqlQuery+'&callback=?';
Can we work with that? You bet.
fqlQuery='select share_count,like_count,comment_count from link_stat where url="'
url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part
lookUp <- URLencode(queryUrl) #What do you think this does?
lookUp
[1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
Paste that into your browser (lose the quotation marks!), and what do we find….?
Our old pal JSON. This should look familiar by now
Why not try it for a few other articles: this works for any url. Here's how to check the stats for the slides from our first session:
require(rjson)
url="http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part
lookUp <- URLencode(queryUrl)
rd <- readLines(lookUp, warn="F")
dat <- fromJSON(rd)
dat
$data
$data[[1]]
$data[[1]]$share_count
[1] 3
$data[[1]]$like_count
[1] 5
$data[[1]]$comment_count
[1] 1
Here's how we grab the numbers from the list
dat$data[[1]]$like_count
[1] 5
dat$data[[1]]$share_count
[1] 3
dat$data[[1]]$comment_count
[1] 1
Pretty modest.
If there's any time left in class, why not:
APIs are really useful: we don't (normally!) have to worry about terms of use
Next week I'll bring a few along, and we'll spend most of class looking at writing scrapers for them.
Examples: