I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()
Related
Starting from a data frame containing the full list of usernames of the US Senators, I'm trying to take the full list of their followers using rtweet 0.7.0 version and the function get_followers with the retryonratelimit option set TRUE. If I use the column of the data frame R gives me an error:
Error in get_followers_(user = c("LisaMurkowski", "SenDanSullivan", "SenTuberville", :
isTRUE(length(user) == 1) it's not TRUE
While if I just use one user name with the same option, it gives me just 5000 users and not the full list.
Using the last version of rtweet gives me the same problem in both cases.
library(rtweet)
data <- read.csv("data.csv", sep=";")
follower <- get_followers(data$twitter, retryonratelimit = T)
library(rtweet)
data <- read.csv("data.csv", sep=";")
follower <- get_followers("LisaMurkowski", retryonratelimit = T)
I tried those two chunks with both versions of rtweet but I failed to find a solution.
The function lookup_users says that there are 57271024 accounts (the sum of the number of followers of each US Senator). I would like to have this full list to build a network.
About my project: I am using the academic twitter api and the package AcademicTwitteR to first scrape all tweets of amnesty international UK. This has worked fine.
The next step is to use the conversation ids of those ~30,000 tweets to get the entire threads behind them, which is where my problem lies.
This is the code I am running:
`ai_t <-
get_all_tweets(
users = "AmnestyUK",
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf
)`
`conversations <- c()`
`for (i in list){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)`
The problem I have is that this is an abundance of individual queries, but the package only allows to run one id at a time, putting in the list directly instead of the for loop produces an error, hence why I am using a loop.
Apart from the rate limit sleep timer, individual queries already take anywhere between ~3 seconds, when not many tweets are retrieved, and more than that, when there are for example 2000 tweets with that conversation_id. A rough calculation already put this at multiple days of running this code, if I am not making a mistake.
The code itself seems to be working fine, I have tried with a short sample of the conversation ids:
`list2 <- list[c(1:3)]`
`for (i in list2){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)
`
Has anybody a solution for this or will is this the most efficient way and this will just take forever?
I am unfortunately not experienced in python at all, but if there is an easier way in that language I would also be interested.
Cheers
I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.
I have started using rtweet package and so far, I have had good results for my queries, languages and geocode parameters. However, I still do not know how can I collect twitter data from within the last 7 days.
For example in the next code chunk I want to extract some data for 7 days but I am not sure if the collected tweets will be since 2017-06-29 until 2017-06-05 or if they will be since 2017-06-22 until 2017-06-29:
Stream all tweets mentioning AMLO or lopezobrador for 7 days
stream_tweets("AMLO,lopezobrador",
timeout = 60*60*24*7,
file_name = "tweetsaboutAMLO.json",
parse = FALSE)
Read in the data as a tidy tbl data frame
AMLO <- parse_stream("tweetsaboutAMLO.json")
Do you know if there are any commands in rtweet to specify the time frame to use when using the search_tweets() or stream_tweets() functions?
So, to answer your question about gow to write it more efficiently, you could try a for loop or a list apply. Here I show the for loop.
First, create a list with the 4 dates you are calling.
fechas <- seq.Date(from = as.Date("2018-06-24"), to = as.Date("2018-06-27"), by = 1)
Then create an empty data.frame to store your tweets.
df_tweets <- data.frame()
Now, loop along your list and populate the empty data.frame.
for (i in seq_along(fechas)) {
df_temp <- search_tweets("lang:es",
geocode = mexico_coord,
until= fechas[i],
n = 100)
df_tweets <- rbind(df_tweets, df_temp)
}
summary(df_tweets)
On the other hand, the following solution might be more convenient and efficient altogether:
library(tidyverse)
f_tweets2 <- search_tweets("lang:es",
geocode = mexico_coord,
until= "2018-06-29", ## or latest date
n = 10000)
df_tweets2 %>%
group_by(as.Date(created_at)) %>% ## Group (or set apart) the tweets by date of creation
sample_n(100) ## Obtain 100 random tweets for each group, in this case, for each date.
I already found a wat to collect tweets within the past seven days. However, it is not efficient.
rt_24 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-24",
n = 100)
rt_25 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-25",
n = 100)
rt_26 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-26",
n = 100)
rt_27 <- search_tweets("lang:es",
geocode = mexico_coord,
until="2018-06-27",
n = 100)
Then, append the dataframes
rbind(rt_24,rt_25,rt_26,rt_27)
Do you know if there is a more efficient way to write this? Maybe using the max_id() function in combination with until ?
I'm using "TwitteR" package and R program to retrieve tweets information. Even though Twitter API provides
retweet_count’ function(https://dev.twitter.com/docs/faq#6899)
I couldn't figure out how to utilize it within R. ( Maybe using 'getURL' function in 'RCurl' package?)
Basically, I'm looking for ways to
the number of times specific tweet has been retweeted
Using Streaming API in R for getting real time information such as
a. new followers join those users, and
b. when they post tweets or retweets, and
c. when the tweets they have posted are re-tweeted by someone else
I would appreciate if anyone could help me out finding leads to get any of these information.
I can't help with the streaming API question, but how about this for working with retweets, based on this helpful tutorial. You could probably work with it to focus on specific tweets instead of numbers of retweets per user. Some of the posts here may be more useful.
# get package with functions for interacting with Twitter.com
require(twitteR)
# get 1500 tweets with #BBC tag, note that 1500 is the max, and it's subject to mysterious filtering and other restrictions by Twitter
s <- searchTwitter('#BBC', n=1500)
#
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
#
# Clean text of tweets
df$text <- sapply(df$text,function(row) iconv(row,to='UTF-8')) #remove odd characters
trim <- function (x) sub('#','',x) # remove # symbol from user names
#
# Extract retweets
library(stringr)
df$to <- sapply(df$to,function(name) trim(name)) # pull out who msg is to
df$rt <- sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (#[[:alnum:]_]*)")[2]))
#
# basic analysis and visualisation of RT'd messages
sum(!is.na(df$rt)) # see how many tweets are retweets
sum(!is.na(df$rt))/length(df$rt) # the ratio of retweets to tweets
countRT <- table(df$rt)
countRT <- sort(countRT)
countRT.subset <- subset(countRT,countRT >2) # subset those RTd at least twice
barplot(countRT.subset,las=2,cex.names = 0.75) # plot them
#
# basic social network analysis using RT
# (not requested by OP, but may be of interest...)
rt <- data.frame(user=df$screenName, rt=df$rt) # tweeter-retweeted pairs
rt.u <- na.omit(unique(rt)) # omit pairs with NA, get only unique pairs
#
# begin sna
library(igraph)
g <- graph.data.frame(rt.u, directed = T)
ecount(g) # edges (connections)
vcount(g) # vertices (nodes)
diameter(g) # network diameter
farthest.nodes(g) # show the farthest nodes