Skipping Error in a Loop of Google Trends requests - r

For my Bachelor Thesis I need to pull Google Trends Data for several Brands in different countries.
As I am totally new to R a friend of mine helped me create the code for a loop, which does it automatically.
After a while the error
data must be a data frame, or other object coercible by fortify(), not a list
appears and the loop stops. When checking with the google trends page itself i found out that there is not enough data to support the request.
My question now would be, if it is possible to continue the loop, regardless of the error and just "skip" the request responsible for the error.
I Already looked around in other threads but the try() appears not to work here or I did it wrong.
Also I changed the low_search_volume = FALSEwhich is the default to TRUE, but that didn't change anything.
for (row in 1:nrow(my_data)) {
country_code <- as.character(my_data[row, "Country_Code"])
query <- as.character(my_data[row, "Brand"])
trend <- gtrends(
c(query),
geo = country_code,
category = 68,
low_search_volume = TRUE,
time = "all"
)
plot(trend)
export <- trend[["interest_over_time"]]
filepath <- paste(
"C:\\Users\\konst\\Desktop\\Uni\\Bachelorarbeit\\R\\Ganzer Datensatz\\",
query, "_", country_code,
".csv",
sep = ""
)
write.csv(export, filepath)
}
To reproduce the error use following list:
Brand Country Code
Gucci MA
Gucci US
allsaints MA
allsaints US
The allsaints MA request should produce the error. Therefore, the allsaints US will not processed.
Thank you all in advance for your assistance.
Best wishes from Hamburg, Germany

Related

Downloading and storing multiple files from URLs on R; skipping urls that are empty

Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)

Loop googleanalyticsR - Error in if (nrow(out) < all_rows { : argument is of length zero

I am using the googleanalyticsR to download all the data I can from Google Analytics. My objective it is to build a small dataframe to analyze.
To download all the data I created a loop:
for (i in 1:length(metricsarray)) {
print(paste(i))
tryCatch( google_analytics_4(my_id,
date_range = c(start_date, end_date ),
metrics = metricsarray[i],
dimensions = c( 'transactionId'),
max = -1)) %>%
assign(gsub(" ", "", paste( "metricsarray",i, sep="")), ., inherits = TRUE)
}
The loop runs from 1 to 11 with no problems, i.e. Prints the number of i and gives me the message:
Downloaded [3537] rows from a total of [3537]
But I got this error when it reaches i = 12 in metricsarray[i]:
2017-10-04 10:37:56> Downloaded [0] rows from a total of [].
Error in if (nrow(out) < all_rows) { : argument is of length zero
I used the tryCatch, but with no effect, it continues. My objetive was that it would continue to test each of the metricsarray[i] until the end.
Also, also continue when it finds the error:
JSON fetch error: Selected dimensions and metrics cannot be queried
together.
I am new to using the googleanalytics API in R, feel free to suggest solutions, articles or anything we might think it will help me gain more knowledge about this.
Thank you,
JSON fetch error: Selected dimensions and metrics cannot be queried
together.
Not all Google analytics dimensions and metrics can be queried together. The main reasons for that is either the data doesnt exist or the data would make no sence.
The best way to test what metadata can be queried together is to check the dimensions and metrics reference. Invalid items will be grayed out.

Why is my Rfacebook loop script not working when there is a post with zero comments?

I've edited my question to be more relevant
It's only been less than a month since I started to learn R and I'm trying to use it to get rid of the tedious work related to Facebook (extracting comments) that we use for our reports.
Using Rfacebook package, i made this script where it extracts (1) the posts of the page for a given period, and (2) Comments on those posts. It worked well for the page I'm doing the report for, but when I tried it on other pages with posts that had zero comments, it reported an error.
Here's the script:
Loading libraries
library(Rfacebook)
library(lubridate)
library(tibble)
Setting time period. Change time as you please.
current_date <-Sys.Date()
past30days<-current_date-30
Assigning a page. Edit this to the page you are monitoring*
brand<-'bpi'
Authenticating Facebook. Use your own
app_id <- "xxxxxxxxxxxxxxxx"
app_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
token <- fbOAuth(app_id,app_secret,extended_permissions = FALSE,legacy_permissions = FALSE)
Extract all posts from a page
listofposts <- getPage(brand, token, n = 5000, since = past30days, until = current_date, feed = FALSE, reactions = FALSE, verbose=TRUE)
write.csv(listofposts,file = paste0('AsOf',current_date,brand,'Posts','.csv'))
Convert to a data frame
df<-as_tibble(listofposts)
Convert to a vector
postidvector<-df[["id"]]
Get the number of posts in the period
n<-length(postidvector)
Produce all comments via loop
reactions<-vector("list",n)
for(i in 1:n){
reactions[[i]]<-assign(paste(brand,'Comments', i, sep = ""), (getPost((postidvector[i]),token,comments=T,likes=F,n.likes=5000,n.comments=10000)))
}
Extract all comments per post to CSV
for(j in 1:n){
write.csv(reactions[[j]],file = paste0('AsOf',current_date,brand,'Comments' ,j, '.csv'))
}
Here's the error when exporting the comments to CSV when I tried it on the
pages with posts that had ZERO comments:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
I tried it on a heavy traffic page, and it worked fine too. One post had 10,000 comments and it extracted just fine. :(
Thanks in advance! :D
Pages can be restricted by age or location. You canĀ“t use an App Access Token for those, because it does not include a user session so Facebook does not know if you are allowed to see the Page content. You will have to use a User Token or Page Token for those.

requesting data from the Center for Disease Control using RSocrata or XML in R

My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.

Check if table within website exists R

For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?
I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.

Resources