About my project: I am using the academic twitter api and the package AcademicTwitteR to first scrape all tweets of amnesty international UK. This has worked fine.
The next step is to use the conversation ids of those ~30,000 tweets to get the entire threads behind them, which is where my problem lies.
This is the code I am running:
`ai_t <-
get_all_tweets(
users = "AmnestyUK",
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf
)`
`conversations <- c()`
`for (i in list){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)`
The problem I have is that this is an abundance of individual queries, but the package only allows to run one id at a time, putting in the list directly instead of the for loop produces an error, hence why I am using a loop.
Apart from the rate limit sleep timer, individual queries already take anywhere between ~3 seconds, when not many tweets are retrieved, and more than that, when there are for example 2000 tweets with that conversation_id. A rough calculation already put this at multiple days of running this code, if I am not making a mistake.
The code itself seems to be working fine, I have tried with a short sample of the conversation ids:
`list2 <- list[c(1:3)]`
`for (i in list2){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)
`
Has anybody a solution for this or will is this the most efficient way and this will just take forever?
I am unfortunately not experienced in python at all, but if there is an easier way in that language I would also be interested.
Cheers
Related
For species niche modeling I am trying to fetch building heights from the briljant 3D BAG data of the TU Delft https://3dbag.nl/nl/download. I want to do this for the city of Haarlem. It is possible to select and download tiles manually. But this is quite labor-intensive and prone to errors (a missing tile), and I want to repeat this action for more cities. So I am trying to use the AFS service to download features. I created a bounding box of Haarlem with a 1.2 extent for the AFS request. However, the maximum record the server delivers is 5000. Despite many alternative attempts I have failed so far to overcome this number. This is partly caused by my confusion in the AWF semantics, when I check with GetCapabilities it is hard to find out the name space, featureTypes and individual attributes (or properties). What I have tried:
Add pagination. But all the tutorials I have read so far need the number of actual/maximum features beside the server maximum (resultType = "hits"). And I was not able to easily retrieve this maximum for the limits of the bounding box.
Select tiles. I figured it should be possible to extract the tile ids that match with the bounding box, using the tile_id, an attribute from the BAG3D_v2:bag_tiles_3k layer, and then somehow build an apply or loop to extract the features per tile. But I already failed to create a cql_filter to select an individual tile.
Create tiles. Since I am not entirely sure whether the individual tiles from the 3D BAG service already exceed the 5000 feature limit, an alternative approach could be to split the bounding box is many small tiles using the R package slippymath, and then extract the features per tile. But then the challenge of filtering remains the same.
Any help with this would be appreciated. The basic code I used in many different ways:
library(httr)
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = 10000,
sortBy = "gid")
request <- build_url(url)
test <- st_read(request)
qtm(test)
One solution is to loop over startindex 5000 by 5000. Then stop the code when the shape returned contained less than 5000 which mean it's done (unless the total number of features is a multiple of 5000...)
Below a piece of code adapted from happign package.
library(httr)
library(sf)
# function for building url
build_3DBAG_url <- function(startindex){
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = startindex,
count = 5000,
sortBy = "gid")
url <- build_url(url)
return(url)
}
# initialize first request
resp <- read_sf(build_3DBAG_url(startindex = 0))
message("Features downloaded : ", nrow(resp), appendLF = F)
# loop until returned shape is less than 5000
i <- 5000
temp <- resp
while(nrow(temp) == 5000){
message("...", appendLF = F)
temp <- read_sf(build_3DBAG_url(startindex = i))
resp <- rbind(resp, temp)
message(nrow(resp), appendLF = F)
i <- i + 5000
}
I'm trying to get all tweets and retweets mentioning Obama or Trump from a certain period of time with the academictwitteR package. The problem I'm facing is that every retweet comes with "..." instead the full text (desired output). This is the code I'm using to do this.
variable <-
get_all_tweets(
query = c("Obama", "Trump"),
start_tweets = "2010-01-01T00:00:00Z",
end_tweets = "2022-05-11T00:00:00Z",
n = 100000)
I've done some research and I've found this post (When I use the package AcademicTwitterR and function 'get_all_tweets' it seems to return the shortened version of the original tweet), where is asked how to avoid shortened tweets, with academictwitteR package, but i didn't understood the answers with implementations of the solution. For eg., this code is showed as a solution: bind_tweets(data_path = "tweetdata") %>% as_tibble
I don't know where to put it in my code. Can anyone show me a full code example to deal with this problem?
The idea is to download the tweets as json instead of immediately binding them into a dataframe, which apparently truncates the retweets.
variable <-
get_all_tweets(
query = c("Obama", "Trump"),
start_tweets = "2010-01-01T00:00:00Z",
end_tweets = "2022-05-11T00:00:00Z",
n = 100000,
data_path = "tweetdata",
bind_tweets = FALSE)
Then the raw data can be pulled in with:
tweet_data <- bind_tweets(data_path = "tweetdata", output_format = "raw")
I am scraping NBA play by play data using the play_by_play function in the nbastatR package. The problem is this function only collects data for 1 game ID at a time, and there are 1230 game IDs in a complete season. When I enter more than 15 game ID's in the play_by_play function, R just keeps loading and showing the wheel of death forever.
I tried to get around this by making a for loop which binds each game id to one cumulative dataframe. However, I run into the same error where R will endlessly load around the 16'th game- very peculiar. I could clean the data in the loop and try that out (I do not need all play by play data, just every shot from the season), but does anyone know why this is happening and how/if I could get around this?
Thanks
full<- play_by_play(game_ids = 21400001, nest_data = F, return_message = T)
for(i in 21400002:21400040){
data <- play_by_play(game_ids = c(i), nest_data = F, return_message = F)
full <- bind_rows(full,data)
cat(i)
}
This code will stop working at around the the 16th game ID. I tried using bind_rows from dplyr but that did not help at all.
Try this [untested code as I don't have your data]:
full <- lapply(
21400001:21400040,
function(i) {
play_by_play(game_ids = c(i), nest_data = F, return_message = F)
}
) %>%
bind_rows()
You can get more information on lazy evaluation here.
I am using the gtrendsR package for extracting google trends data. I understand that this package has a limit of maximum 5 "keywords" at a time; therefore I used a loop to extract >5 "keywords" at a time.
Now i would like to repeat this excercise for multiple "countries" as well, and would like the result to show all possible google trend outputs for every combination of the "keyword" and "country"
This is the code i am using:
Country = readLines("country_list.csv")
Keyword = readLines("keyword_list.csv")
results <- list()
for (i in Keyword)
{
for (j in Country) {
time=("today 3-m")
channel='web'
trends = gtrends(keyword=i, gprop =channel,geo=j, time = time)
results [[j]] <- trends$interest_over_time
}
}
Out <- as.data.frame(do.call("rbind", results))
I keep getting the error:
Error in get_widget(comparison_item, category, gprop, hl, cookie_url, :
widget$status_code == 200 is not TRUE
I have around 60 "countries" and 300 "keywords" on the lists. Is it down to excess data extraction not being possible from Google Trends ? Or some fundamental error
Btw, i am a basic user of R; so many thanks for the help
The error codes, widget$status_code == 200, the server returns are usually pretty descriptive of the problem if you just put them into a google search. In your case, too many requests in a short period of time. For every i in keyword you are calling to the server length(Country) times. that's a lot of requests in a short period & you're going to get blocked. Either set a some type of time out in-between your calls or look into hacky scraper methods like rotating headers/cookies etc.
I solved by downloading gtrendsR from:
install.packages("remotes")
remotes::install_github("PMassicotte/gtrendsR")
(see https://github.com/PMassicotte/gtrendsR/issues/166)
and using the following
Country = readLines("states.csv")
Keyword = readLines("celebs.csv")
for (i in Keyword)
{
for (j in Country) {
time=("2018-01-01 2018-06-30")
channel='web'
trends = gtrends(keyword=i, gprop =channel,geo=j, time = time, onlyInterest = TRUE,low_search_volume = FALSE)
Sys.sleep(5)
results [[j]][[i]] <- trends$interest_over_time
}
}
I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()