How to scrape all subreddit posts in a given time period

How to scrape all subreddit posts in a given time period - r

I have a function to scrape all the posts in the Bitcoin subreddit between 2014-11-01 and 2015-10-31.
However, I'm only able to extract about 990 posts that go back only to October 25. I don't understand what's happening. I included a Sys.sleep of 15 seconds between each extract after referring to https://github.com/reddit/reddit/wiki/API, to no avail.
Also, I experimented with scraping from another subreddit (fitness), but it also returned around 900 posts.
require(jsonlite)
require(dplyr)
getAllPosts <- function() {
url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
extract <- fromJSON(url)
posts <- extract$data$children$data %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
extract.next <- fromJSON(url.next)
posts.next <- extract.next$data$children$data
# execute while loop as long as there are any rows in the data frame
while (!is.null(nrow(posts.next))) {
posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
posts <- rbind(posts, posts.next)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
Sys.sleep(15)
extract <- fromJSON(url.next)
posts.next <- extract$data$children$data
}
posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
return(posts)
}
posts <- getAllPosts()
Does reddit have some kind of limit that I'm hitting?

Yes, all reddit listings (posts, comments, etc.) are capped at 1000 items; they're essentially just cached lists, rather than queries, for performance reasons.
To get around this, you'll need to do some clever searching based on timestamps.

Related

Dynamically scrape content with Rvest where 3 parts of URL are dynamically changing

I am trying to scrape content for players from transfermarkt where the urls for individual teams are almost identical but 3 parts of url are dynamically changing.
I am scraping 5 years of data which I already did: but it is just for one team and I want to do it for all of them.
# make a target url with the relevant year
url_base <- 'https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=%d'
map_df(2017:2021, function(i) {
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(name=html_text(html_nodes(pg, ".hauptlink a , #yw1_c1")),
date_of_birth=html_text(html_nodes(pg, ".posrela+ .zentriert , .sort-link")),
market_value=html_text(html_nodes(pg, ".rechts")),
season=i,
stringsAsFactors=FALSE)
}) -> asSquad
Example of URLs per team:
https://www.transfermarkt.com/**as-trencin**/kader/verein/**7918**/plus/1/galerie/0?saison_id=**2017**
https://www.transfermarkt.com/**slovan-bratislava**/kader/verein/**540**/plus/1/galerie/0?saison_id=**2019**
For now, I have been able to scrape one team for 5 years, but how can I scrape it when 3 parts of URL are changing and do it all at once per all teams, please?
Please, any advice is welcomed! Thank you!

Something like:
library(rvest)
teams <- c("as-trencin", "slovan-bratislava")
var2 <- c("7918", "540")
years <- 2017:2018
all <- data.frame()
for (i in 1:length(teams)){
for (year in years) {
url <- paste0("https://www.transfermarkt.com/", teams[i], "/kader/verein/", var2[i],"/plus/1/galerie/0?saison_id=", year)
print(url)
# do.The.Scraping, saveToDataFrame, rBindToMainDataFrame
pg <- read_html(sprintf(url))
asSquad <- data.frame(
name=stringi::stri_trim(html_text(html_nodes(pg, ".hauptlink a , #yw1_c1"))),
date_of_birth=html_text(html_nodes(pg, ".posrela+ .zentriert , .sort-link")),
market_value=html_text(html_nodes(pg, ".rechts")),
season=year,
stringsAsFactors=FALSE)
asSquad <-asSquad[-1,]
all <- rbind(all, asSquad)
}
}
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/as-trencin/kader/verein/7918/plus/1/galerie/0?saison_id=2018"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2017"
#> [1] "https://www.transfermarkt.com/slovan-bratislava/kader/verein/540/plus/1/galerie/0?saison_id=2018"
If var2 differs for the same team, then add another loop.
Grzegorz

Web scraping of key stats in Yahoo! Finance with R

Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html, html_nodes(), and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!
Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!

I know this is an older thread, but I used it to scrape Yahoo Analyst tables so I figure I would share.
# Yahoo webscrape Analysts
library(XML)
symbol = "HD"
url <- paste('https://finance.yahoo.com/quote/HD/analysts?p=',symbol,sep="")
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
earningEstimates <- readHTMLTable(tableNodes[[1]])
revenueEstimates <- readHTMLTable(tableNodes[[2]])
earningHistory <- readHTMLTable(tableNodes[[3]])
epsTrend <- readHTMLTable(tableNodes[[4]])
epsRevisions <- readHTMLTable(tableNodes[[5]])
growthEst <- readHTMLTable(tableNodes[[6]])
Cheers,
Sody

I gave up on Excel a long time ago. R is definitely the way to go for things like this.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

When I use the methods shown here with XML library, I get a Warning
Warning in readLines(page) : incomplete final line found on
'https://finance.yahoo.com/quote/DIS/key-statistics?p=DIS'
We can use rvest and xml2 for a cleaner approach. This example demonstrates how to pull a key statistic from the key-statistics Yahoo! Finance page. Here I want to obtain the float of an equity. I don't believe float is available from quantmod, but some of the key stats values are. You'll have to reference the list.
library(xml2)
library(rvest)
getFloat <- function(stock){
url <- paste0("https://finance.yahoo.com/quote/", stock, "/key-statistics?p=", stock)
tables <- read_html(url) %>%
html_nodes("table") %>%
html_table()
float <- as.vector(tables[[3]][4,2])
last <- substr(float, nchar(float)-1+1, nchar(float))
float <-gsub("[a-zA-Z]", "", float)
float <- as.numeric(as.character(float))
if(last == "k"){
float <- float * 1000
} else if (last == "M") {
float <- float * 1000000
} else if (last == "B") {
float <- float * 1000000000
}
return(float)
}
getFloat("DIS")
[1] 1.81e+09
That's a lot of shares of Disney available.

Scraping data from a site with multiple urls

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:
>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)
However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).
I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.
getData <- function(year, start) {
url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html",
year, start)
fileurl <- read_html(url)
content <- fileurl %>%
html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
html_table()
contentframe <- data.frame(content)
}
We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:
D <- lapply(2000:2005, function(year) {
do.call(rbind, lapply(seq(1, 500, 100), function(start) {
cat(paste("Retrieving", year, ":", start, "\n"))
getData(year, start)
}))
})

Loop for a string

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is representing the name of the account. Thanks in adavance,
mydata <- read.csv("tweets.csv",sep=",", header=TRUE)
head(mydata)
dim(mydata)
colnames(mydata)
****#tweets for each university****
table(mydata$University)
Unames<- unique(mydata$University)
mystring <- function(Uname, string){
mydata_temp <- subset(mydata,University==Uname)
mymatch <- rep(NA,dim(mydata_temp)[1])
for(i in 1:dim(mydata_temp)[1]){
mymatch[i] <- length(grep(string, mydata_temp[i,2]))
}
return(mymatch)
}
**#web link e.g. (Here I would like to see the total links for all universities in table or graph. The below code is only giving me the output one by one!
mylink <- mystring(Unames[1],"http://")

So my suspicions are wrong and you do have a body of data for which this command produces the desired results (and you expect all the :
mylink <- mystring(Unames[1],"http://")
In that case, you should just do this:
links_list <- lapply(Unames, mystring, "http://")

Importing data into R from google spreadsheet

There seems to be a change in the google spreadsheet publishing options. It is no longer possible to publish to the web as csv or tab file (see this recent post). Thus the usual way to use RCurl to import data into R from a google spreadsheed does not work anymore:
require(RCurl)
u <- "https://docs.google.com/spreadsheet/pub?hl=en_GB&hl=en_GB&key=0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE&single=true&gid=4&output=csv"
tc <- getURL(u, ssl.verifypeer=FALSE)
net <- read.csv(textConnection(tc))
Does anyone have a work-around?

I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.
install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')
More detail is here: https://github.com/maxconway/gsheet

Use the googlesheets4 package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.
The package can be installed with install.packages("googlesheets4").
There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.

I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.
First, we need a function to read from Google sheets. readGoogleSheet() will return a list of data frames, one for each table found on the Google sheet:
readGoogleSheet <- function(url, na.string="", header=TRUE){
stopifnot(require(XML))
# Suppress warnings because Google docs seems to have incomplete final line
suppressWarnings({
doc <- paste(readLines(url), collapse=" ")
})
if(nchar(doc) == 0) stop("No content found")
htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}
Next, we need a function to clean the individual tables. cleanGoogleTable() removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
if(!is.data.frame(dat)){
dat <- dat[[table]]
}
if(is.na(dropFirstCol)) {
firstCol <- na.omit(dat[[1]])
if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
dat <- dat[, -1]
}
} else if(dropFirstCol) {
dat <- dat[, -1]
}
if(skip > 0){
dat <- dat[-seq_len(skip), ]
}
if(nrow(dat) == 1) return(dat)
if(nrow(dat) >= 2){
if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
}
if(header && nrow(dat) > 1){
header <- as.character(dat[1, ])
names(dat) <- header
dat <- dat[-1, ]
}
# Keep only desired columns
if(!is.na(ncols)){
ncols <- min(ncols, ncol(dat))
dat <- dat[, seq_len(ncols)]
}
# Keep only desired rows
if(nrows > 0){
nrows <- min(nrows, nrow(dat))
dat <- dat[seq_len(nrows), ]
}
# Rename rows
rownames(dat) <- seq_len(nrow(dat))
dat
}
Now we are ready to read you Google sheet:
> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)
2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined
> cleanGoogleTable(g, table=2, skip=1)
BUNDLE FEE VALIDITY MB Cost Sh/MB
1 Daily Bundle (20MB) 500/= 1 day 20 500 25.0
2 1 Day bundle (300MB) 3,000/= 1 day 300 3,000 10.0
3 Weekly bundle (3GB) 15,000/= 7 days 3,000 15,000 5.0
4 Monthly bundle (8GB) 70,000/= 30 days 8,000 70,000 8.8
5 Quarterly Bundle (24GB) 200,000/= 90 days 24,000 200,000 8.3
6 Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000 7.8
7 Handset Browsing Bundle(400 MB) 2,500/= 30 days 400 2,500 6.3
8 STANDARD <NA> <NA> 1 <NA> <NA>

Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:
myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")
R version 3.3.2 (2016-10-31)

There is an easiest way to fetch the google sheets even if you're behind the proxy
require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <- read.csv(textConnection(fileCSV))

A simpler way.
Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.
library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
"1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2

Scrape the html table using httr and XML packages.
library(XML)
library(httr)
url <- "https://docs.google.com/spreadsheets/d/12MK9EFmPww4Vw9P6BShmhOolH1C45Irz0jdzE0QR3hs/pubhtml"
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}
df <- readSpreadsheet(url)
df

Publish as CSV doesn't seem to be supported (or at least isn't currently supported) in the new Google Sheets, which is the default for any new sheet you create. You can, though, create a sheet in the old Google Sheets format, which does support publish as CSV, through this link... https://g.co/oldsheets.
More details on the new vs. old Sheets is here... https://support.google.com/drive/answer/3541068?p=help_new_sheets&rd=1

Thanks for this solution! Works as good as the old one. I used another fix to get rid of the blank first line. When you just exclude it, you might per accident delete a valid observation when the line is 'unfreezed'. The extra instruction in the function deletes any rows which have no time stamp.
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
df <- df[df[,1] != "",] ## only select rows with time stamps
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}

It is still (as of May 2015) possible to get a CSV file out of Google Spreadsheets, using the hidden URL <sheeturl>/export?format=csv trick 1.
However, after solving this problem, one encounters another problem - numbers are formatted according to the locale of the sheet, e.g. you may get 1,234.15 in a "US" sheet or 1.234,15 in a "German" sheet. To decide on a sheet locale, go to File > Spreadsheet Settings in Google Docs.
Now you need to remove the decimal mark from the numeric columns so that R can parse them; depending on how large your numbers are, this may need to be done several times for each column. A simple function I wrote to accomplish this:
# helper function to load google sheet and adjust for thousands separator (,)
getGoogleDataset <- function(id) {
download.file(paste0('https://docs.google.com/spreadsheets/d/', id, '/export?format=csv'),'google-ds.csv','curl');
lines <- scan('google-ds.csv', character(0), sep="\n");
pattern<-"\"([0-9]+),([0-9]+)";
for (i in 0:length(lines)) {
while (length(grep(pattern,lines[i]))> 0) {
lines[i] <- gsub(pattern,"\"\\1\\2",lines[i]);
}
}
return(read.csv(textConnection(lines)));
}
You will need to require(utils) and have curl installed, but no other extra packages.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape all subreddit posts in a given time period - r

Yes, all reddit listings (posts, comments, etc.) are capped at 1000 items; they're essentially just cached lists, rather than queries, for performance reasons. To get around this, you'll need to do some clever searching based on timestamps.

Related

Dynamically scrape content with Rvest where 3 parts of URL are dynamically changing

Web scraping of key stats in Yahoo! Finance with R

Scraping data from a site with multiple urls

Loop for a string

Importing data into R from google spreadsheet

Categories

Resources