```{r setup, include=FALSE} opts_chunk$set(cache=TRUE) ``` Web Scraping part 4: APIs ======================================================== width: 1200 author: Rolf Fredheim and Aiora Zabala date: University of Cambridge font-family: 'Rockwell' 11/03/2014 Catch up ================== Slides from week 1: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-basics.html Slides from week 2: http://quantifyingmemory.blogspot.com/2014/02/web-scraping-part2-digging-deeper.html Slides from week 3: http://quantifyingmemory.blogspot.com/2014/03/web-scraping-scaling-up-digital-data.html Today we will: ======================================================== - Learn how to use an API - Get data from YouTube - Google and Yandex maps - ... and other APIs Get the docs: http://fredheir.github.io/WebScraping/Lecture4/p4.html http://fredheir.github.io/WebScraping/Lecture4/p4.Rpres http://fredheir.github.io/WebScraping/Lecture4/p4.r Digital data collection ======================= - Devise a means of accessing data - **Retrieve that data** - tabulate and store the data Today is all about accessing diverse data soruces APIs ================ type:sq1 > The practice of publishing APIs has allowed web communities to create an open architecture for sharing content and data between communities and applications. In this way, content that is created in one place can be dynamically posted and updated in multiple locations on the web -Wikipedia f - e.g. Facebook releases its API so third parties can develop software drawing on Facebook's data. - Why might Facebook want to do that? Examples ================== APIs allow applications to communicate with each other E.g? Examples ================ APIs allow applications to communicate with each other - Amazon API allows web-sites to link directly to products - up-to-date prices, option to buy - Buying stuff online: verification of credit-card data - Smartphone apps: e.g. for accessing Twitter - Maps with location data, e.g. Yelp - Share content between social networking sites - Embed videos - Log in via Facebook ============== > When used in the context of web development, an API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. -Wikipedia - Out: HTTP request - In: JSON or XML - We can handle both processes through R - See week 1 for working with JSON, week 2 for XML Facebook API ================== type:sq Last week we explored the Facebook API: ```{r} fqlQuery='select share_count,like_count,comment_count from link_stat where url="' url="http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky" queryUrl = paste0('http://graph.facebook.com/fql?q=',fqlQuery,url,'"') #ignoring the callback part lookUp <- URLencode(queryUrl) #What do you think this does? lookUp ``` Read it in: =============== ```{r} require(rjson) rd <- readLines(lookUp, warn="F") dat <- fromJSON(rd) dat ``` Finding this information ================== All very well, but how to cobble together that code? First, find an API: http://www.programmableweb.com/apis/directory Google maps most popular. Google maps API =============== reading: http://www.jose-gonzalez.org/using-google-maps-api-r/#.Ux2LEflgy3M Wrapper functions in package dismo (also explore packages ggmaps, maptools) Two APIs of interest: - geo location - static maps What will these APIs give us? Docs: https://developers.google.com/maps/documentation/geocoding/ https://developers.google.com/maps/documentation/staticmaps/ Geocoding ============ https://developers.google.com/maps/documentation/geocoding/ Required parameters: - address [place] - sensor [=false] options: bounds,key,language, region What will these options do? How would you go about using them? Example Query =================== Query: - https://maps.googleapis.com/maps/api/geocode/json? - address=Kremlin,Moscow - sensor=false separate with '&' https://maps.googleapis.com/maps/api/geocode/json?address=Kremlin,Moscow&sensor=false Which of the JSON fields might we be interested in? write a function ============ One variable: address ```{r} getUrl <- function(address,sensor = "false") { root <- "http://maps.google.com/maps/api/geocode/json?" u <- paste0(root,"address=", address, "&sensor=false") return(URLencode(u)) } getUrl("Kremlin, Moscow") ``` In use ================= type:sq ```{r} require(RJSONIO) target <- getUrl("Kremlin, Moscow") dat <- fromJSON(target) latitude <- dat$results[[1]]$geometry$location["lat"] longitude <- dat$results[[1]]$geometry$location["lng"] place <- dat$results[[1]]$formatted_address latitude longitude place ``` Getting a static map =================== type:sq2 https://developers.google.com/maps/documentation/staticmaps/ base="http://maps.googleapis.com/maps/api/staticmap?center="" center= latitude (e.g 55.75), longitude (e.g. 37.62) OR: centre =place (Kremlin, Moscow) zoom (1= zoomed right out, 18 zoomed right in) maptype="hybrid" #satellite, hybrid, terrain, roadmap suffix = "&size=800x800&sensor=false&format=png"" http://maps.googleapis.com/maps/api/staticmap?center=55.75,37.62&zoom=13&maptype=hybrid&size=800x800&sensor=false&format=png Construct that URL in R using paste? ============= type:section be careful about commas and &s ```{r} base="http://maps.googleapis.com/maps/api/staticmap?center=" latitude=55.75 longitude=37.62 zoom=13 maptype="hybrid" suffix ="&size=800x800&sensor=false&format=png" ``` Possible solution ==================== type:sq1 ```{r} base="http://maps.googleapis.com/maps/api/staticmap?center=" latitude=55.75 longitude=37.62 zoom=13 maptype="hybrid" suffix ="&size=800x800&sensor=false&format=png" target <- paste0(base,latitude,",",longitude, "&zoom=",zoom,"&maptype=",maptype,suffix) ``` What to do next...? =============== Download the map: ```{r} download.file(target,"test.png", mode = "wb") ``` Use it as a background image for plots, e.g. geo location as here http://4.bp.blogspot.com/-gcmmsncQriY/UWvBOmwa8-I/AAAAAAAAAFY/LXRf8SXkzZc/s1600/gdelt4.png Leftovers =========== non-latin strings in scraper output: PARSED <- htmlParse(SOURCE,**encoding="UTF-8"**) ```{r} bbcScraper <- function(url){ SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE,encoding="UTF-8") title=xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue) date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content")) if (is.null(date)) date <- NA if (is.null(title)) title <- NA return(c(title,date)) } ``` Rest of the class is yours! ======================= The following slides have bits of code to get you started with APIs Select from four tasks: 1) More social stats 2) Use Yandex maps instead of Google 3) YouTube data 4) See what the other APIs can do, or find your own. Social APIs =============== type:sq Share counts. These work similarly to Facebook API. Twitter undocumented API: ```{r} url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts" target=paste0("http://urls.api.twitter.com/1/urls/count.json?url=",url) rd <- readLines(target, warn="F") dat <- fromJSON(rd) dat shares <- dat$count ``` ================= type: section Using official documentation, get the analogous stats from Linkedin and StumbleUpon Linkedin: https://developer.linkedin.com/retrieving-share-counts-custom-buttons StumbleUpon: http://help.stumbleupon.com/customer/portal/articles/665227-badges Map making 2 ================== type: section type:sq Yandex maps have a very similar syntax to Google maps. To get comfortable scripting with APIs, replicate the code above (slides 13:20) for Yandex maps, based on information here: http://api.yandex.com/maps/doc/staticapi/1.x/dg/concepts/input_params.xml YouTube API =============== type:sq https://developers.google.com/youtube/2.0/developers_guide_protocol_audience V2 deprecated. (what does this tell us about using APIs?) To get you started: Video stats (id = video ID, e.g. "Ya2elsR5s5s"): url=paste0("https://gdata.youtube.com/feeds/api/videos/",id,"?v=2&alt=json") Comments (id = video ID, e.g. "Ya2elsR5s5s"): url=paste0("http://gdata.youtube.com/feeds/api/videos/",id,"/comments?v=2&alt=json") Search (ukrainian protests): url="https://gdata.youtube.com/feeds/api/videos?q=ukrainian+protests&alt=json" Figure out what the JSON structure is, and how to extract the data you need Google books ============ I have not used this one. But: Documentation: https://developers.google.com/books/docs/v1/getting_started Example query: https://www.googleapis.com/books/v1/volumes?q=harry+potter&callback=handleResponse Song lyrics ========= - Documentation http://api.wikia.com/wiki/LyricWiki_API - Example http://lyrics.wikia.com/api.php?artist=Smashing Pumpkins&song=1979&fmt=json Cricket score ================= - http://cricscore-api.appspot.com/ - http://cricscore-api.appspot.com/csa#current matches - http://cricscore-api.appspot.com/csa?id=MATCH_ID_HERE Some APIs require setup: ============== Rotten tomatoes http://developer.rottentomatoes.com/docs Bing translator http://blogs.msdn.com/b/translation/p/gettingstarted1.aspx Weather http://www.worldweatheronline.com/free-weather-feed.aspx https://developer.forecast.io/ Last.fm http://www.last.fm/api
Also reading: http://rcrastinate.blogspot.co.uk/2013/03/peace-through-music-country-clustering.html Flickr https://www.flickr.com/services/api/ Find your own API ======================== http://www.programmableweb.com/apis/directory Social APIs, my solutions ====================== type: sq ```{r} #Linkedin url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts" target=paste0("http://www.linkedin.com/countserv/count/share?url=$",url,"&format=json") rd <- readLines(target, warn="F") dat <- fromJSON(rd) #StumbleUpon url="http://www.theguardian.com/uk-news/2014/mar/10/rise-zero-hours-contracts" target=paste0("http://www.stumbleupon.com/services/1.01/badge.getinfo?url=",url) rd <- readLines(target, warn="F") dat <- fromJSON(rd) ``` Map making 2: my approach ==================== type:sq1 Geocoding ```{r eval=F} query="cambridge university" target=paste0("http://geocode-maps.yandex.ru/1.x/?format=json&lang=en-BR&geocode=",query) rd <- readLines(target, warn="F") dat <- fromJSON(rd) #Exctract address and location data address <- dat$response$GeoObjectCollection$featureMember[[1]]$ GeoObject$metaDataProperty$GeocoderMetaData$AddressDetails$Country$AddressLine pos <- dat$response$GeoObjectCollection$featureMember[[1]]$ GeoObject$Point require(stringr) temp <- unlist(str_split(pos," ")) latitude=as.numeric(temp)[1] longitude=as.numeric(temp)[2] ``` Map making 2: my approach 2 ================== Download the map ```{r} zoom=13 lang="en-US" maptype="map" #pmap,map,sat,trf (traffic!) Note: if using sat, file is in JPG format, not PNG target <- paste0("http://static-maps.yandex.ru/1.x/?ll=",latitude,",",longitude,"&size=450,450&z=",zoom,"&l=map&lang=",lang,"&l=",maptype) download.file(target,"test.png", mode = "wb") ``` YouTube stats ================= type:sq1 Function to return stats about a single video ```{r} getStats <- function(id){ url=paste0("https://gdata.youtube.com/feeds/api/videos/",id,"?v=2&alt=json") raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) dop <- as.character(rd$entry$published) term <- rd$entry$category[[2]]["term"] label <- rd$entry$category[[2]]["label"] title <- rd$entry$title author <- rd$entry$author[[1]]$name duration <- rd$entry$`media$group`$`media$content`[[1]]$duration favs <- rd$entry$`yt$statistics`["favoriteCount"] views <- rd$entry$`yt$statistics`["viewCount"] dislikes <- rd$entry$`yt$rating`["numDislikes"] likes <- rd$entry$`yt$rating`["numLikes"] return(data.frame(id,dop,term,label,title,author,duration,favs,views,dislikes,likes)) } ``` YouTube Comments =============== type:sq1 Function to return comments about a video ```{r} getComments <- function(id){ url=paste0("http://gdata.youtube.com/feeds/api/videos/",id,"/comments?v=2&alt=json") raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) comments <- as.character(sapply(1:length(rd$feed$entry), function(x) (rd$feed$entry[[x]]$content))) return(comments) } ``` What next =========== Share your skills - not too many people in Cambridge do digital data collection Most online resources assume Python. Worth learning. Most knowledge applicable to R http://chimera.labs.oreilly.com/books/1234000001583 Links in first week's 'reading' section Good luck!