I'm trying to get all tweets and retweets mentioning Obama or Trump from a certain period of time with the academictwitteR package. The problem I'm facing is that every retweet comes with "..." instead the full text (desired output). This is the code I'm using to do this.
variable <-
get_all_tweets(
query = c("Obama", "Trump"),
start_tweets = "2010-01-01T00:00:00Z",
end_tweets = "2022-05-11T00:00:00Z",
n = 100000)
I've done some research and I've found this post (When I use the package AcademicTwitterR and function 'get_all_tweets' it seems to return the shortened version of the original tweet), where is asked how to avoid shortened tweets, with academictwitteR package, but i didn't understood the answers with implementations of the solution. For eg., this code is showed as a solution: bind_tweets(data_path = "tweetdata") %>% as_tibble
I don't know where to put it in my code. Can anyone show me a full code example to deal with this problem?
The idea is to download the tweets as json instead of immediately binding them into a dataframe, which apparently truncates the retweets.
variable <-
get_all_tweets(
query = c("Obama", "Trump"),
start_tweets = "2010-01-01T00:00:00Z",
end_tweets = "2022-05-11T00:00:00Z",
n = 100000,
data_path = "tweetdata",
bind_tweets = FALSE)
Then the raw data can be pulled in with:
tweet_data <- bind_tweets(data_path = "tweetdata", output_format = "raw")
Related
About my project: I am using the academic twitter api and the package AcademicTwitteR to first scrape all tweets of amnesty international UK. This has worked fine.
The next step is to use the conversation ids of those ~30,000 tweets to get the entire threads behind them, which is where my problem lies.
This is the code I am running:
`ai_t <-
get_all_tweets(
users = "AmnestyUK",
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf
)`
`conversations <- c()`
`for (i in list){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)`
The problem I have is that this is an abundance of individual queries, but the package only allows to run one id at a time, putting in the list directly instead of the for loop produces an error, hence why I am using a loop.
Apart from the rate limit sleep timer, individual queries already take anywhere between ~3 seconds, when not many tweets are retrieved, and more than that, when there are for example 2000 tweets with that conversation_id. A rough calculation already put this at multiple days of running this code, if I am not making a mistake.
The code itself seems to be working fine, I have tried with a short sample of the conversation ids:
`list2 <- list[c(1:3)]`
`for (i in list2){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)
`
Has anybody a solution for this or will is this the most efficient way and this will just take forever?
I am unfortunately not experienced in python at all, but if there is an easier way in that language I would also be interested.
Cheers
I am trying to pull tweets that use a particular term for topic analyses.
I am able to successfully extract tweets using the R package AcademicTwitterR with the function get_all_tweets. However, the text/tweets seem to be shortened from the original.
For example a tweet text might look like this:
"Not exactly, although invasive species can become a problem as well
(talk to Australians about rabbits for..."
I would like to pull the whole tweet.
Example code I used:
df <- get_all_tweets(query = "invasive species", start_tweets = "2006-10-01T00:00:00Z",end_tweets = "2021-10-01T00:00:00Z")
Christopher Barrie, who made the package, replied. The code does pull all the tweets but the way I was binding the tweet rows was the problem.
An alternative option for binding the rows which converts JSON files to various data frame formats:
The “vanilla” format. direct output from jsonlite::read_json. It can display columns such as text just fine.
bind_tweets(data_path = "tweetdata") %>% as_tibble
The “raw” format. a list of data frames containing all of the data extracted in the API call.
bind_tweets(data_path = "tweetdata", output_format = "raw") %>% names
The “tidy” format.
bind_tweets(data_path = "tweetdata", output_format = "tidy")
More information here:
https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-tidy.html
I have started using rtweet package and so far, I have had good results for my queries, languages and geocode parameters. However, I still do not know how can I collect twitter data from within the last 7 days.
For example in the next code chunk I want to extract some data for 7 days but I am not sure if the collected tweets will be since 2017-06-29 until 2017-06-05 or if they will be since 2017-06-22 until 2017-06-29:
Stream all tweets mentioning AMLO or lopezobrador for 7 days
stream_tweets("AMLO,lopezobrador",
timeout = 60*60*24*7,
file_name = "tweetsaboutAMLO.json",
parse = FALSE)
Read in the data as a tidy tbl data frame
AMLO <- parse_stream("tweetsaboutAMLO.json")
Do you know if there are any commands in rtweet to specify the time frame to use when using the search_tweets() or stream_tweets() functions?
So, to answer your question about gow to write it more efficiently, you could try a for loop or a list apply. Here I show the for loop.
First, create a list with the 4 dates you are calling.
fechas <- seq.Date(from = as.Date("2018-06-24"), to = as.Date("2018-06-27"), by = 1)
Then create an empty data.frame to store your tweets.
df_tweets <- data.frame()
Now, loop along your list and populate the empty data.frame.
for (i in seq_along(fechas)) {
df_temp <- search_tweets("lang:es",
geocode = mexico_coord,
until= fechas[i],
n = 100)
df_tweets <- rbind(df_tweets, df_temp)
}
summary(df_tweets)
On the other hand, the following solution might be more convenient and efficient altogether:
library(tidyverse)
f_tweets2 <- search_tweets("lang:es",
geocode = mexico_coord,
until= "2018-06-29", ## or latest date
n = 10000)
df_tweets2 %>%
group_by(as.Date(created_at)) %>% ## Group (or set apart) the tweets by date of creation
sample_n(100) ## Obtain 100 random tweets for each group, in this case, for each date.
I already found a wat to collect tweets within the past seven days. However, it is not efficient.
rt_24 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-24",
n = 100)
rt_25 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-25",
n = 100)
rt_26 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-26",
n = 100)
rt_27 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-27",
n = 100)
Then, append the dataframes
rbind(rt_24,rt_25,rt_26,rt_27)
Do you know if there is a more efficient way to write this? Maybe using the max_id() function in combination with until ?
I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()
I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…
required libraries: tau and tcltk
bigramThe <- function(dataset,column) {
bidata <- data.frame(x= character(0), y= numeric(0))
pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
for (i in 1:nrow(dataset)) {
a <- column[i]
bi<-textcnt(a, n = 2, method = "string")
tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
tweetbi$grepl<-grepl("the ",tweetbi$V1)
tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
bidata <- rbind(bidata, tweetbi)
setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
close(pb)
return(aggbi)
}
I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:
text userid
tweet text 1 1
tweets text 2 2
the tweet text 3 3
To use RWeka, first run sudo apt-get install openjdk-6-jdk (or install/re-install your JDK in Windows GUI) then try re-installing the package.
Should that fail, use download.file to download the source .zip file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL).
If you want to speed things up without using a different package, consider using multicore and re-writing the code to use an apply function which can take advantage of parallelism.
You can get rid of the evil loop structure by collapsing the text column into one long string:
paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")
I expected to also need
subset(bi, function(x) !grepl("*", x)
But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.