I have large tsv-files containing the tweet-IDs of millions of tweets which I would like to content analyze in R. How do I get the meta data of the tweets (message, user, date etc) into a dataset without looking up every individual tweet?
I know this is possible in Python, is it also possible to do it in R since I do not know Python well. Is there a R package for this purpose?
If you use the rTweet library (which is usually preferred over twitteR, as the latter is no longer maintained), you can use the lookup_statuses function to get the metadata for large batches of Tweets.
Related
Using R, I know how to get tweets between two dates with the package academictwitter, but the
"start_tweets" and "end_tweets" expressions do not seem to work in the "get_liked_tweets"function. Is there a way to get liked tweets of a user between two dates using this R package?
Thanks!
No, this is not possible, there are no dates assigned to linked Tweets.
everyone!
I'm doing research using COVID 19 Tweets. I've downloaded some COVID 19-sourced tweets from https:/zenodo.org/record/3970127#.Xy12rChKiUk. However, the data only includes the Twitter ID. Does anyone know how to hydrate the data in RStudio and get the JSON file with the text? It seems I can use the Twarc
package, but I'd like to do the whole process in the R environment, not in Python.
I realize this is a tad late but here goes: Twarc's package description includes a mention of a similar package for R--which would answer OP's question.
"For R there is academictwitteR. Unlike twarc, it focuses solely on querying the Twitter Academic Research Product Track v2 API endpoint. Data gathered in twarc can be imported into R for analysis as a dataframe if you export the data into CSV using twarc-csv."
Here is the source.
Following code extracts the tables from PDF.
install.packages("tabulizer"); install.packages("tidyverse")
library(tabulizer); library(tidyverse)
n_tables <- extract_tables("filename.pdf") %>% length()
However, it takes forever to do this. Can we bypass the actual table extraction step, presumably a very time consuming process, and get the count of tables from pdfs directly using tabulizer or any other R package?
Original tabulizer developer here: Nope. The algorithm works page-by-page, identifying the tables and extracting them. The extraction per se is not expensive - it's the identification that's time-consuming.
The reason the package - and the underlying Tabula Java library - exists at all is because there is no internal representation of a "table" in the PDF specification, unlike in say HTML or docx. Tables in a PDF are just arrangements of glyphs in something that looks to the human eye like a table. Thus there's no way to quickly query for the presence of a table or a list of all tables as no such list exists in the file.
So short, disappointing answer: nope.
I am new to R and have just started to use it. I am currently experimenting with the quantmod, rugarch and rmgarch packages.
In particular, I'm implementing the last package to make a multivariate portfolio analysis for the case of the european markets. In this sense, I need to download the 3-month german treasury bills, in order to use them as risk free rate. However, as far as I known, I canĀ“t download the the mentioned data serie from Yahoo, Google or FDRA databases, so I have already downloaded them from investing.com and I want to load them in R.
The fact here is, my data is different from the ones downloaded by the getsymbols () function of yahoo, because in this case I only have 2 columns, the date column and the closing price column. To sump up, the question arises here is, is there any way to load this type of data in R for rmgarch purposes??
thanks in advance
Not sure if this is the issue, but this is how you might go about getting the data from a csv file.
data <- read.csv(file="file/path/data.csv")
head(data) # Take a look at your data
# Do this if you want the data only replacing ColumnName with the proper name
data_only <- data$ColumnName
It looks like the input data for rugarch needs to be an xts vector. So, you might want to take a look at this. You might also want to take a look at ?read.csv.
I am attempting to download all of the followers and their information (location, date of creation, etc.) from the Haaretz Twitter feed (#haaretzcom) using the twitteR package in R. The Twitter feed has over 90,000 followers I was able to download the full list of followers no problem using the code below.
require(twitteR)
require(ROAuth)
#Loading the Twitter OAuthorization
load("~/Dropbox/Twitter/my_oauth")
#Confirming the OAuth
registerTwitterOAuth(my_oauth)
# opening list to download
haaretz_followers<-getUser("haaretzcom")$getFollowerIDs(retryOnRateLimit=9999999)
However, when I try to extract their information using the lookupUsers function, I run into the rate limit. The trick of using retryOnRateLimit does not seem to work here:)
#Extracting user information for each of Haaretz followers
haaretz_followers_info<-lookupUsers(haaretz_followers)
haaretz_followers_full<-twListToDF(haaretz_followers_info)
#Export data to csv
write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv", sep=",")
I believe I need to write a for loop and subsample over the list of followers (haaretz_followers) to avoid the rate limit. In this loop, I need to include some kind of rest/pause like Keep downloading tweets within the limits using twitteR package. The twitteR package is a bit opaque on how to go about this and I am bit of a novice writing for loops in R. Finally, I know that depending on how you write your loops in R, greatly affects the run time. Any help you could give would be much appreciated!
Something like this will likely get the job done:
for (follower in haaretz_followers){
Sys.sleep(5)
haaretz_followers_info<-lookupUsers(haaretz_followers)
haaretz_followers_full<-twListToDF(haaretz_followers_info)
#Export data to csv
write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv", sep=",")
}
Here you're sleeping for 5 seconds between each call. I don't know what you're rate limit is -- you may need more or less to comply with Twitter's policies.
You're correct that the way you structure loops in R will affect performance, but in this case, you're intentionally inserting a pause which will be orders of magnitude longer than any wasted CPU time from a poorly-designed loop, so you don't really need to worry about that.