Grabbing Tweets from Multiple Timelines (twitteR "usertimeline" function) - r

I'm trying to grab the most recent tweets from multiple users. I have already had my application registered, and have the requisite keys and tokens.
I know that for a single user, the command is:
recent <- twListToDF(userTimeline("**twitterID**",n=15))
However, I'm unsure how to grab the Tweets for multiple IDs, and how to combine them into data frame.
I tried:
targets <- c("a","b","c")
recent <- twListToDF(userTimeline("targets",n=15))
where a, b, c are IDs, but get the error message:
Error in twInterfaceObj$doAPICall(cmd, params, method, ...) :
Not Found (HTTP 404).
It doesn't seem to matter whether target is surrounded by quotes or not. Is there a simple way to grab tweets from multiple IDs? Or do I need to have a vector, iterate through etc.

I figured it out, so I thought I'd share my solution to the problem.
Assuming I have the same list of screen names to pull Tweets from, called targets.
I put the output into a list, called output, with each entry corresponding to the tweets from the same index value in targets. I.e., output[1] contains all the tweets from targets[1], with one tweet on each line.
num = length(targets)
output <- vector("list", num)
for(i in 1:num){
output[i] <- getTweets(targets[i])
}
getTweets uses twListToDF(userTimeline(handle, n=15))
To put all the Tweets into a single dataframe,complete with other info:
masterFrame <- data.frame()
for (i in 1:num){
tempFrame <- getTweets(targets[i])
masterFrame <- rbind(masterFrame, tempFrame)
}

Related

Downloading and storing multiple files from URLs on R; skipping urls that are empty

Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)

using a for loop to utilize spotifyr get_artist_audio_features function in R, skip errors in the loop

I downloaded my personal Spotify data from the Spotify website.
I converted these data from JSON to a regular R dataframe for further analysis. This personal dataframe has 4 columns:
Endtime artistName trackName Msplayed
However, Spotify has many variables coupled to songs from an artist, that you can only retrieve using the function get_artist_audio_features from the spotifyr package. I want to join these variables to my personal dataframe. The package allows data retrieval for only one artist at a time and it would be very time consuming to write a line of code for all 3000+ artists in my dataframe.
I used a for loop to try and collect the metadata for the artists:
empty_list <- vector(mode = "list")
for(i in df$artistName){
empty_list[[i]] <- get_artist_audio_features(i)
}
My dataframe also has podcasts, for which non of this meta-data is available. When i try using the function on a podcast i get the error message:
Error in get_artist_audio_features(i) :
No artist found with artist_id=''.
In addition: Warning messages:
1: Unknown or uninitialised column: `id`.
2: Unknown or uninitialised column: `name`.
When i use the for loop, it stops as soon as the first error (podcast) in the dataframe occurs. When i feed it a vector of only artists and no podcasts, it works perfectly.
I checked stack for possible answers (most notably: Skipping error in for-loop) but i cant get the loop to work.
My question: how can i use the function spotifyr::get_artist_audio_features in a for loop and skip the errors, storing the results in a list. Unfortunately, it is very difficult to post a reproducable example, since you need to active a developer account on spotify to use the spotifyr package.
It looks like your issue is in artist_id = '', so try the below code to see if it helps get you started (since I don't have reproducible data, not sure if it will help). In this case it should just skip the podcasts, but I'm sure some more codesmithing will allow you to put relevant data in the given list position.
for(i in df$artistName){
if(artist_id = ''){
empty_list[[i]] <- NA
} else {
empty_list[[i]] <- get_artist_audio_features(i)
}
}
You could also use a while loop conditioning on an incremental i to restart the loop, but I can't do that without the data.

tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

Dear Stackoverflow users,
I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.
I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.
In short, the situation is the following.
I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.
Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).
j <- 1
MHP_codes <- c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NA)
j <- j + 1
}
when the loop is done, I bind the information from different profiles into one data frame and save it.
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
The function (getProfile) works well on individual profiles.
It works also on a small range of profiles ( c(150100:150150)).
Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.
However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).
However, in some IDs ranges, two problems might happen.
First, I get one error message such as teh following one:
Error in open.connection(x, "rb") : HTTP error 404.
So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).
Moreover, after the loop is stopped and R runs the line:
final_df <- rbind.fill(df_list)
A second error message appears:
Warning message:
In df[[var]] :
closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)
It seems like there is a specific problem with that one empty URL.
In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.
Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything?
This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.
Anyone who is able to help me sorting out this issue?
Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.
library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)
##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.
##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
next})
I would like to thank #Jul for the answer.
Here I post my updated loop:
j <- 1
MHP_codes <- c(150000:150200) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
delayedAssign("do.next", {next})
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- tryCatch(read_html(URL),
error = function(e) force(do.next))
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
As you can see, something had to be changed: although the answer from #Jul was close to solve the problem, the loop still stopped, and thus I had to slightly change the original suggestion.
In particular, I have introduced in the loop but outside of the tryCatch function the following line:
delayedAssign("do.next", {next})
And in the tryCatch function the following argument:
force(do.next)
This is based on this other stackoverlflow post.

Check if table within website exists R

For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?
I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.

How to retrieve multiple tweets from tweet_id using R

I am using the twitteR package in R to extract tweets based on their ids.
But I am unable to do this for multiple tweet ids without hitting either a rate limit or an error 404.
This is because I am using the showStatus() - one tweet id at a time.
I am looking for a function similar to getStatuses() - multiple tweet id/request
Is there an efficient way to perform this action.
I suppose only 60 requests can be made in a 15 minute window using the outh.
So, how do I ensure :-
1.Retrieve multiple tweet ids for single request thereafter repeating these requests.
2.Rate limit is under check.
3.Error handling for tweets not found.
P.S : This activity is not user based.
Thanks
I have come across the same issue recently. For retrieving tweets in bulk, Twitter recommends using the lookup-method provided by its API. That way you can get up to 100 tweets per request.
Unfortunately, this has not been implemented in the twitteR package yet; so I've tried to hack together a quick function (by re-using lots of code from the twitteR package) to use that API method:
lookupStatus <- function (ids, ...){
lapply(ids, twitteR:::check_id)
batches <- split(ids, ceiling(seq_along(ids)/100))
results <- lapply(batches, function(batch) {
params <- parseIDs(batch)
statuses <- twitteR:::twInterfaceObj$doAPICall(paste("statuses", "lookup",
sep = "/"),
params = params, ...)
twitteR:::import_statuses(statuses)
})
return(unlist(results))
}
parseIDs <- function(ids){
id_list <- list()
if (length(ids) > 0) {
id_list$id <- paste(ids, collapse = ",")
}
return(id_list)
}
Make sure that your vector of ids is of class character (otherwise there can be a some problems with very large IDs).
Use the function like this:
ids <- c("432656548536401920", "332526548546401821")
tweets <- lookupStatus(ids, retryOnRateLimit=100)
Setting a high retryOnRateLimit ensures you get all your tweets, even if your vector of IDs has more than 18,000 entries (100 IDs per request, 180 requests per 15-minute window).
As usual, you can turn the tweets into a data frame with twListToDF(tweets).

Resources