Rolf Fredheim and Aiora Zabala
University of Cambridge
18/02/2014
Before you sit down:
download.file('https://raw.github.com/fredheir/WebScraping/master/Lecture1/p1.Rpres', destfile='p1.Rpres')
OR: download.file('http://goo.gl/ulTloi', destfile='p1.Rpres')
OLD RSTUDIO: markdown:
download.file('https://raw.github.com/fredheir/WebScraping/master/Lecture1/p1.Rpres', destfile='p1.Rmd')
slides in browser. Open: http://fredheir.github.io/WebScraping/Lecture1/p1.html
Open the file p1.Rpres or p1.Rmd in RStudio, and press preview to view the slides
for more info about editing these slides, go here
Computer scientists
Anyone with some minimal background in coding and good computer literacy
Created a system to extract text and numbers from a large number of web pages
Learnt to harvest links
Worked with an API to gather data, e.g. from YouTube
Convert messy data into tabular data
A windows Computer
A modern browser - Chrome or Firefox
An up to date version of Rstudio
Theory
Practice
From Wikipedia
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
When might this be useful? (your examples)
Again, from Wikipedia
… Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.
a<-fetchData(1,c("Web_scraping","Data_mining"))
ggplot(a,aes(date,rd.views,colour=target))+geom_line()
1) manipulating URLs
2) downloading stuff from within R
3) automate collection through loops and functions
Approximate number of web pages
require (ggplot2)
clubs <- c("Tottenham","Arsenal","Liverpool",
"Everton","ManU","ManC","Chelsea")
nPages <- c(23.3,68.4,78.9,35.5,102,90.5,110)
df <- data.frame(clubs,nPages)
df
clubs nPages
1 Tottenham 23.3
2 Arsenal 68.4
3 Liverpool 78.9
4 Everton 35.5
5 ManU 102.0
6 ManC 90.5
7 Chelsea 110.0
ggplot(df,aes(clubs,nPages,fill=clubs))+
geom_bar(stat="identity")+
coord_flip()+theme_bw(base_size=70)
Programming with Humanists: Reflections on Raising an Army of Hacker-Scholars in the Digital Humanities http://openbookpublishers.com/htmlreader/DHP/chap09.html#ch09
the agent machines (slave zombies) begin to send a large volume of packets to the victim, flooding its system with useless load and exhausting its resources.
source: cisco.com
We will not:
we will:
http://stats.grok.se/en/201401/web_scraping
en.wikipedia.org/wiki/Web_scraping
http://stats.grok.se/json/en/201401/web_scraping
{“daily_views”: {“2013-01-12”: 542, “2013-01-13”: 593, “2013-01-10”: 941, “2013-01-11”: 798, “2013-01-16”: 1119, “2013-01-17”: 1124, “2013-01-14”: 908, “2013-01-15”: 1040, “2013-01-30”: 1367, “2013-01-18”: 1027, “2013-01-19”: 743, “2013-01-31”: 1151, “2013-01-29”: 1210, “2013-01-28”: 1130, “2013-01-23”: 1275, “2013-01-22”: 1131, “2013-01-21”: 1008, “2013-01-20”: 707, “2013-01-27”: 789, “2013-01-26”: 747, “2013-01-25”: 1073, “2013-01-24”: 1204, “2013-01-01”: 379, “2013-01-03”: 851, “2013-01-02”: 807, “2013-01-05”: 511, “2013-01-04”: 818, “2013-01-07”: 745, “2013-01-06”: 469, “2013-01-09”: 946, “2013-01-08”: 912}, “project”: “en”, “month”: “201301”, “rank”: -1, “title”: “web_scraping”}
Theory
Practice
1) how do we read the data from this page
2) how do we generate a list of links, say for the period 2012-2013?
Check out ?paste if you are unsure about this
Bonus: check out ?paste0
var=123
paste("url",var,sep="")
[1] "url123"
paste("url",var,sep=" ")
[1] "url 123"
var=123
paste("url",rep(var,3),sep="_")
[1] "url_123" "url_123" "url_123"
paste(rep("url",3),var,sep="_")
[1] "url_123" "url_123" "url_123"
var=c(123,421)
paste(var,collapse="_")
[1] "123_421"
var=201401
paste("http://stats.grok.se/json/en/",var,"/web_scraping")
[1] "http://stats.grok.se/json/en/ 201401 /web_scraping"
paste("http://stats.grok.se/json/en/",var,"/web_scraping",sep="")
[1] "http://stats.grok.se/json/en/201401/web_scraping"
a=“test”
b=“scrape”
c=94
merge variables a,b,c into a string, separated by an underscore (“_”)
“test_scrape_94”
merge variables a,b,c into a string without any separating character
“testscrape94”
print the letter 'a' followed by the numbers 1:10, without a separating character
“a1” “a2” “a3” “a4” “a5” “a6” “a7” “a8” “a9” “a10”
Run this in your terminal:
var=201401
url=paste(“http://stats.grok.se/json/en/”,var,“/web_scraping”,sep=“”)
url
browseURL(url)
var=201401
url=paste("http://stats.grok.se/json/en/",var,"/web_scraping",sep="")
raw.data <- readLines(url, warn="F")
raw.data
[1] "{\"daily_views\": {\"2014-01-15\": 779, \"2014-01-14\": 806, \"2014-01-17\": 827, \"2014-01-16\": 981, \"2014-01-11\": 489, \"2014-01-10\": 782, \"2014-01-13\": 756, \"2014-01-12\": 476, \"2014-01-19\": 507, \"2014-01-18\": 473, \"2014-01-28\": 789, \"2014-01-29\": 799, \"2014-01-20\": 816, \"2014-01-21\": 857, \"2014-01-22\": 899, \"2014-01-23\": 792, \"2014-01-24\": 749, \"2014-01-25\": 508, \"2014-01-26\": 488, \"2014-01-27\": 769, \"2014-01-07\": 786, \"2014-01-04\": 456, \"2014-01-05\": 77, \"2014-01-02\": 674, \"2014-01-03\": 586, \"2014-01-01\": 348, \"2014-01-08\": 765, \"2014-01-09\": 787, \"2014-01-31\": 874, \"2014-01-30\": 1159}, \"project\": \"en\", \"month\": \"201401\", \"rank\": -1, \"title\": \"web_scraping\"}"
#install.packages("rjson")
require(rjson)
rd <- fromJSON(raw.data)
rd
$daily_views
$daily_views$`2014-01-15`
[1] 779
$daily_views$`2014-01-14`
[1] 806
$daily_views$`2014-01-17`
[1] 827
$daily_views$`2014-01-16`
[1] 981
$daily_views$`2014-01-11`
[1] 489
$daily_views$`2014-01-10`
[1] 782
$daily_views$`2014-01-13`
[1] 756
$daily_views$`2014-01-12`
[1] 476
$daily_views$`2014-01-19`
[1] 507
$daily_views$`2014-01-18`
[1] 473
$daily_views$`2014-01-28`
[1] 789
$daily_views$`2014-01-29`
[1] 799
$daily_views$`2014-01-20`
[1] 816
$daily_views$`2014-01-21`
[1] 857
$daily_views$`2014-01-22`
[1] 899
$daily_views$`2014-01-23`
[1] 792
$daily_views$`2014-01-24`
[1] 749
$daily_views$`2014-01-25`
[1] 508
$daily_views$`2014-01-26`
[1] 488
$daily_views$`2014-01-27`
[1] 769
$daily_views$`2014-01-07`
[1] 786
$daily_views$`2014-01-04`
[1] 456
$daily_views$`2014-01-05`
[1] 77
$daily_views$`2014-01-02`
[1] 674
$daily_views$`2014-01-03`
[1] 586
$daily_views$`2014-01-01`
[1] 348
$daily_views$`2014-01-08`
[1] 765
$daily_views$`2014-01-09`
[1] 787
$daily_views$`2014-01-31`
[1] 874
$daily_views$`2014-01-30`
[1] 1159
$project
[1] "en"
$month
[1] "201401"
$rank
[1] -1
$title
[1] "web_scraping"
rd.views <- rd$daily_views
rd.views
$`2014-01-15`
[1] 779
$`2014-01-14`
[1] 806
$`2014-01-17`
[1] 827
$`2014-01-16`
[1] 981
$`2014-01-11`
[1] 489
$`2014-01-10`
[1] 782
$`2014-01-13`
[1] 756
$`2014-01-12`
[1] 476
$`2014-01-19`
[1] 507
$`2014-01-18`
[1] 473
$`2014-01-28`
[1] 789
$`2014-01-29`
[1] 799
$`2014-01-20`
[1] 816
$`2014-01-21`
[1] 857
$`2014-01-22`
[1] 899
$`2014-01-23`
[1] 792
$`2014-01-24`
[1] 749
$`2014-01-25`
[1] 508
$`2014-01-26`
[1] 488
$`2014-01-27`
[1] 769
$`2014-01-07`
[1] 786
$`2014-01-04`
[1] 456
$`2014-01-05`
[1] 77
$`2014-01-02`
[1] 674
$`2014-01-03`
[1] 586
$`2014-01-01`
[1] 348
$`2014-01-08`
[1] 765
$`2014-01-09`
[1] 787
$`2014-01-31`
[1] 874
$`2014-01-30`
[1] 1159
rd.views <- unlist(rd.views)
df <- as.data.frame(rd.views)
df
rd.views
2014-01-15 779
2014-01-14 806
2014-01-17 827
2014-01-16 981
2014-01-11 489
2014-01-10 782
2014-01-13 756
2014-01-12 476
2014-01-19 507
2014-01-18 473
2014-01-28 789
2014-01-29 799
2014-01-20 816
2014-01-21 857
2014-01-22 899
2014-01-23 792
2014-01-24 749
2014-01-25 508
2014-01-26 488
2014-01-27 769
2014-01-07 786
2014-01-04 456
2014-01-05 77
2014-01-02 674
2014-01-03 586
2014-01-01 348
2014-01-08 765
2014-01-09 787
2014-01-31 874
2014-01-30 1159
rd <- fromJSON(readLines(url, warn="F"))
rd.views <- rd$daily_views
df <- as.data.frame(unlist(rd.views))
require(ggplot2)
require(lubridate)
df$date <- as.Date(rownames(df))
colnames(df) <- c("views","date")
ggplot(df,aes(date,views))+
geom_line()+
geom_smooth()+
theme_bw(base_size=20)
Plot Wikipedia page views in February 2014 for the Sochi games
How does this compare to the number of views on the Russian language wikipedia page?
If you are comfortable with loops and functions, go ahead and write a great application.
The rest of us are going to learn about loops
plusOne <- function(x){
return(x+1)
}
plusOne2 <- function(num){
return(num+1)
}
plusOne(8)
[1] 9
plusOne2(10)
[1] 11
plusOne2(num=5)
[1] 6
#plusOne2(wrongVar=2)
for (number in 1:5){
print (number)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
a <- c(1,2,3,4,5)
for (value in a){
print (
plusOne(value)
)
}
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
listOfNumbers <- c(1,2,3,4,5)
for (number in listOfNumbers){
print (
number+1
)
}
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
a <- c(1,2,3,4,5)
a[1] #The first number in the vector
[1] 1
a[4] #The fourth number in the vector
[1] 4
for (i in 1:length(a)){
print (
plusOne(a[i])
)
}
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
In each case we are performing an operation on the vector in brackets.
See:
#for (i in 1:length(a))
print(1:length(a))
[1] 1 2 3 4 5
#for (i in length(a))
print (length(a))
[1] 5
#for (i in a)
print (a)
[1] 1 2 3 4 5
a <- c(1,2,3,4,5) #Least flexible, fastest
a+1
[1] 2 3 4 5 6
plusOne(a) #Quite flexible
[1] 2 3 4 5 6
sapply(a,plusOne) #Can be used in all sorts of situations, slow - similar to a loop, better if you are colleting an output
[1] 2 3 4 5 6
stats.grok.se/json/en/201401/web_scraping
for (month in 1:12){
print(paste(2014,month,sep=""))
}
[1] "20141"
[1] "20142"
[1] "20143"
[1] "20144"
[1] "20145"
[1] "20146"
[1] "20147"
[1] "20148"
[1] "20149"
[1] "201410"
[1] "201411"
[1] "201412"
We need the variable 'month' to have two digits:
201401
for (month in 1:9){
print(paste(2012,0,month,sep=""))
}
[1] "201201"
[1] "201202"
[1] "201203"
[1] "201204"
[1] "201205"
[1] "201206"
[1] "201207"
[1] "201208"
[1] "201209"
for (month in 10:12){
print(paste(2012,month,sep=""))
}
[1] "201210"
[1] "201211"
[1] "201212"
dates=NULL
for (month in 1:9){
date=(paste(2012,0,month,sep=""))
dates=c(dates,date)
}
for (month in 10:12){
date=(paste(2012,month,sep=""))
dates=c(dates,date)
}
print (as.numeric(dates))
[1] 201201 201202 201203 201204 201205 201206 201207 201208 201209 201210
[11] 201211 201212
here we concatenated the values:
dates <- c(c(201201,201202),201203)
print (dates)
[1] 201201 201202 201203
!! To do this with a data.frame, use rbind()
We might want years to be a variable Shows two loops:
for (year in 2012:2013){
for (month in 1:9){
print(paste(year,0,month,sep=""))
}
for (month in 10:12){
print(paste(year,month,sep=""))
}
}
[1] "201201"
[1] "201202"
[1] "201203"
[1] "201204"
[1] "201205"
[1] "201206"
[1] "201207"
[1] "201208"
[1] "201209"
[1] "201210"
[1] "201211"
[1] "201212"
[1] "201301"
[1] "201302"
[1] "201303"
[1] "201304"
[1] "201305"
[1] "201306"
[1] "201307"
[1] "201308"
[1] "201309"
[1] "201310"
[1] "201311"
[1] "201312"
for (year in 2012:2013){
for (month in 1:9){
print(paste("http://stats.grok.se/json/en/",year,0,month,"/web_scraping",sep=""))
}
for (month in 10:12){
print(paste("http://stats.grok.se/json/en/",year,month,"/web_scraping",sep=""))
}
}
[1] "http://stats.grok.se/json/en/201201/web_scraping"
[1] "http://stats.grok.se/json/en/201202/web_scraping"
[1] "http://stats.grok.se/json/en/201203/web_scraping"
[1] "http://stats.grok.se/json/en/201204/web_scraping"
[1] "http://stats.grok.se/json/en/201205/web_scraping"
[1] "http://stats.grok.se/json/en/201206/web_scraping"
[1] "http://stats.grok.se/json/en/201207/web_scraping"
[1] "http://stats.grok.se/json/en/201208/web_scraping"
[1] "http://stats.grok.se/json/en/201209/web_scraping"
[1] "http://stats.grok.se/json/en/201210/web_scraping"
[1] "http://stats.grok.se/json/en/201211/web_scraping"
[1] "http://stats.grok.se/json/en/201212/web_scraping"
[1] "http://stats.grok.se/json/en/201301/web_scraping"
[1] "http://stats.grok.se/json/en/201302/web_scraping"
[1] "http://stats.grok.se/json/en/201303/web_scraping"
[1] "http://stats.grok.se/json/en/201304/web_scraping"
[1] "http://stats.grok.se/json/en/201305/web_scraping"
[1] "http://stats.grok.se/json/en/201306/web_scraping"
[1] "http://stats.grok.se/json/en/201307/web_scraping"
[1] "http://stats.grok.se/json/en/201308/web_scraping"
[1] "http://stats.grok.se/json/en/201309/web_scraping"
[1] "http://stats.grok.se/json/en/201310/web_scraping"
[1] "http://stats.grok.se/json/en/201311/web_scraping"
[1] "http://stats.grok.se/json/en/201312/web_scraping"
getData <- function(url){
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
rd.views <- rd$daily_views
rd.views <- unlist(rd.views)
rd <- as.data.frame(rd.views)
rd$date <- rownames(rd)
rownames(rd) <- NULL
return(rd)
}