Rolf Fredheim and Aiora Zabala
University of Cambridge
11/03/2014
Slides from week 1: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html
Slides from week 2: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-part2-digging-deeper.html
Slides from week 3: http://quantifyingmemory.blogspot.com/2014/03/web-scraping-scaling-up-digital-data.html
Get the docs: http://fredheir.github.io/WebScraping/Lecture4/p4.html
Today is all about accessing diverse data soruces
The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web
-Wikipedia f
APIs allow applications to communicate with each other
E.g?
APIs allow applications to communicate with each other
When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. -Wikipedia
In: JSON or XML
We can handle both processes through R
See week 1 for working with JSON, week 2 for XML
Last week we explored the Facebook API:
fqlQuery='select share_count,like_count,comment_count from link_stat where url="'
url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part
lookUp <- URLencode(queryUrl) #What do you think this does?
lookUp
[1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
require(rjson)
rd <- readLines(lookUp, warn="F")
dat <- fromJSON(rd)
dat
$data
$data[[1]]
$data[[1]]$share_count
[1] 387
$data[[1]]$like_count
[1] 428
$data[[1]]$comment_count
[1] 231
All very well, but how to cobble together that code?
First, find an API:
http://www.programmableweb.com/apis/directory
Google maps most popular.
reading: http://www.jose-gonzalez.org/using-google-maps-api-r/#.Ux2LEflgy3M
Wrapper functions in package dismo (also explore packages ggmaps, maptools)
Two APIs of interest:
What will these APIs give us?
Docs:
https://developers.google.com/maps/documentation/geocoding/ https://developers.google.com/maps/documentation/staticmaps/
https://developers.google.com/maps/documentation/geocoding/
Required parameters:
options: bounds,key,language, region
What will these options do? How would you go about using them?
Query:
separate with '&'
https://maps.googleapis.com/maps/api/geocode/json?address=Kremlin,Moscow&sensor=false
Which of the JSON fields might we be interested in?
One variable: address
getUrl <- function(address,sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/json?"
u <- paste0(root,"address=", address, "&sensor=false")
return(URLencode(u))
}
getUrl("Kremlin, Moscow")
[1] "http://maps.google.com/maps/api/geocode/json?address=Kremlin,%20Moscow&sensor=false"
require(RJSONIO)
target <- getUrl("Kremlin, Moscow")
dat <- fromJSON(target)
latitude <- dat$results[[1]]$geometry$location["lat"]
longitude <- dat$results[[1]]$geometry$location["lng"]
place <- dat$results[[1]]$formatted_address
latitude
lat
55.75
longitude
lng
37.62
place
[1] "The Moscow Kremlin, Moscow, Russia, 103073"
https://developers.google.com/maps/documentation/staticmaps/
base=“http://maps.googleapis.com/maps/api/staticmap?center=”“
center= latitude (e.g 55.75), longitude (e.g. 37.62)
OR: centre =place (Kremlin, Moscow)
zoom (1= zoomed right out, 18 zoomed right in)
maptype="hybrid” #satellite, hybrid, terrain, roadmap
suffix = “&size=800x800&sensor=false&format=png”“
be careful about commas and &s
base="http://maps.googleapis.com/maps/api/staticmap?center="
latitude=55.75
longitude=37.62
zoom=13
maptype="hybrid"
suffix ="&size=800x800&sensor=false&format=png"
base="http://maps.googleapis.com/maps/api/staticmap?center="
latitude=55.75
longitude=37.62
zoom=13
maptype="hybrid"
suffix ="&size=800x800&sensor=false&format=png"
target <- paste0(base,latitude,",",longitude,
"&zoom=",zoom,"&maptype=",maptype,suffix)
Download the map:
download.file(target,"test.png", mode = "wb")
Use it as a background image for plots, e.g. geo location as here http://4.bp.blogspot.com/-gcmmsncQriY/UWvBOmwa8-I/AAAAAAAAAFY/LXRf8SXkzZc/s1600/gdelt4.png
non-latin strings in scraper output:
PARSED <- htmlParse(SOURCE,encoding=“UTF-8”)
bbcScraper <- function(url){
SOURCE <- getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE,encoding="UTF-8")
title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)
date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
if (is.null(date)) date <- NA
if (is.null(title)) title <- NA
return(c(title,date))
}
The following slides have bits of code to get you started with APIs
Select from four tasks:
1) More social stats
2) Use Yandex maps instead of Google
3) YouTube data
4) See what the other APIs can do, or find your own.
Share counts. These work similarly to Facebook API.
Twitter undocumented API:
url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts"
target=paste0("http://urls.api.twitter.com/1/urls/count.json?url=",url)
rd <- readLines(target, warn="F")
dat <- fromJSON(rd)
dat
$count
[1] 940
$url
[1] "http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts/"
shares <- dat$count
Using official documentation, get the analogous stats from Linkedin and StumbleUpon
Linkedin: https://developer.linkedin.com/retrieving-share-counts-custom-buttons
StumbleUpon: http://help.stumbleupon.com/customer/portal/articles/665227-badges
Yandex maps have a very similar syntax to Google maps.
To get comfortable scripting with APIs, replicate the code above (slides 13:20) for Yandex maps, based on information here:
http://api.yandex.com/maps/doc/staticapi/1.x/dg/concepts/input_params.xml
https://developers.google.com/youtube/2.0/developers_guide_protocol_audience
V2 deprecated. (what does this tell us about using APIs?)
To get you started:
Video stats (id = video ID, e.g. “Ya2elsR5s5s”): url=paste0(“https://gdata.youtube.com/feeds/api/videos/”,id,“?v=2&alt=json”)
Comments (id = video ID, e.g. “Ya2elsR5s5s”): url=paste0(“http://gdata.youtube.com/feeds/api/videos/”,id,“/comments?v=2&alt=json”)
Search (ukrainian protests): url=“https://gdata.youtube.com/feeds/api/videos?q=ukrainian+protests&alt=json”
Figure out what the JSON structure is, and how to extract the data you need
I have not used this one. But:
Documentation: https://developers.google.com/books/docs/v1/getting_started
Example query: https://www.googleapis.com/books/v1/volumes?q=harry+potter&callback=handleResponse
Rotten tomatoes http://developer.rottentomatoes.com/docs
Bing translator http://blogs.msdn.com/b/translation/p/gettingstarted1.aspx
Weather http://www.worldweatheronline.com/free-weather-feed.aspx https://developer.forecast.io/
Last.fm
http://www.last.fm/api
Also reading: http://rcrastinate.blogspot.co.uk/2013/03/peace-through-music-country-clustering.html
#Linkedin
url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts"
target=paste0("http://www.linkedin.com/countserv/count/share?url=$",url,"&format=json")
rd <- readLines(target, warn="F")
dat <- fromJSON(rd)
#StumbleUpon
url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts"
target=paste0("http://www.stumbleupon.com/services/1.01/badge.getinfo?url=",url)
rd <- readLines(target, warn="F")
dat <- fromJSON(rd)
Geocoding
query="cambridge university"
target=paste0("http://geocode-maps.yandex.ru/1.x/?format=json&lang=en-BR&geocode=",query)
rd <- readLines(target, warn="F")
dat <- fromJSON(rd)
#Exctract address and location data
address <- dat$response$GeoObjectCollection$featureMember[[1]]$
GeoObject$metaDataProperty$GeocoderMetaData$AddressDetails$Country$AddressLine
pos <- dat$response$GeoObjectCollection$featureMember[[1]]$
GeoObject$Point
require(stringr)
temp <- unlist(str_split(pos," "))
latitude=as.numeric(temp)[1]
longitude=as.numeric(temp)[2]
Download the map
zoom=13
lang="en-US"
maptype="map" #pmap,map,sat,trf (traffic!) Note: if using sat, file is in JPG format, not PNG
target <- paste0("http://static-maps.yandex.ru/1.x/?ll=",latitude,",",longitude,"&size=450,450&z=",zoom,"&l=map&lang=",lang,"&l=",maptype)
download.file(target,"test.png", mode = "wb")
Function to return stats about a single video
getStats <- function(id){
url=paste0("https://gdata.youtube.com/feeds/api/videos/",id,"?v=2&alt=json")
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
dop <- as.character(rd$entry$published)
term <- rd$entry$category[[2]]["term"]
label <- rd$entry$category[[2]]["label"]
title <- rd$entry$title
author <- rd$entry$author[[1]]$name
duration <- rd$entry$`media$group`$`media$content`[[1]]$duration
favs <- rd$entry$`yt$statistics`["favoriteCount"]
views <- rd$entry$`yt$statistics`["viewCount"]
dislikes <- rd$entry$`yt$rating`["numDislikes"]
likes <- rd$entry$`yt$rating`["numLikes"]
return(data.frame(id,dop,term,label,title,author,duration,favs,views,dislikes,likes))
}
Function to return comments about a video
getComments <- function(id){
url=paste0("http://gdata.youtube.com/feeds/api/videos/",id,"/comments?v=2&alt=json")
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
comments <- as.character(sapply(1:length(rd$feed$entry), function(x) (rd$feed$entry[[x]]$content)))
return(comments)
}
Share your skills - not too many people in Cambridge do digital data collection
Most online resources assume Python. Worth learning. Most knowledge applicable to R
http://chimera.labs.oreilly.com/books/1234000001583
Links in first week's 'reading' section
Good luck!