I'm using "TwitteR" package and R program to retrieve tweets information. Even though Twitter API provides
retweet_count’ function(https://dev.twitter.com/docs/faq#6899)
I couldn't figure out how to utilize it within R. ( Maybe using 'getURL' function in 'RCurl' package?)
Basically, I'm looking for ways to
the number of times specific tweet has been retweeted
Using Streaming API in R for getting real time information such as
a. new followers join those users, and
b. when they post tweets or retweets, and
c. when the tweets they have posted are re-tweeted by someone else
I would appreciate if anyone could help me out finding leads to get any of these information.
I can't help with the streaming API question, but how about this for working with retweets, based on this helpful tutorial. You could probably work with it to focus on specific tweets instead of numbers of retweets per user. Some of the posts here may be more useful.
# get package with functions for interacting with Twitter.com
require(twitteR)
# get 1500 tweets with #BBC tag, note that 1500 is the max, and it's subject to mysterious filtering and other restrictions by Twitter
s <- searchTwitter('#BBC', n=1500)
#
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
#
# Clean text of tweets
df$text <- sapply(df$text,function(row) iconv(row,to='UTF-8')) #remove odd characters
trim <- function (x) sub('#','',x) # remove # symbol from user names
#
# Extract retweets
library(stringr)
df$to <- sapply(df$to,function(name) trim(name)) # pull out who msg is to
df$rt <- sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (#[[:alnum:]_]*)")[2]))
#
# basic analysis and visualisation of RT'd messages
sum(!is.na(df$rt)) # see how many tweets are retweets
sum(!is.na(df$rt))/length(df$rt) # the ratio of retweets to tweets
countRT <- table(df$rt)
countRT <- sort(countRT)
countRT.subset <- subset(countRT,countRT >2) # subset those RTd at least twice
barplot(countRT.subset,las=2,cex.names = 0.75) # plot them
#
# basic social network analysis using RT
# (not requested by OP, but may be of interest...)
rt <- data.frame(user=df$screenName, rt=df$rt) # tweeter-retweeted pairs
rt.u <- na.omit(unique(rt)) # omit pairs with NA, get only unique pairs
#
# begin sna
library(igraph)
g <- graph.data.frame(rt.u, directed = T)
ecount(g) # edges (connections)
vcount(g) # vertices (nodes)
diameter(g) # network diameter
farthest.nodes(g) # show the farthest nodes
Related
I am unable to even run the example code given on the botrnot documentation. Unsure what's happening.
# libraries
library(rtweet)
library(tweetbotornot)
# authentication for twitter API
auth <- rtweet_app()
auth_setup_default()
users <- c("kearneymw", "geoffjentry", "p_barbera",
"tidyversetweets", "rstatsbot1234", "RStatsStExBot")
## get most recent 10 tweets from each user
tmls <- get_timeline(users, n = 10)
## pass the returned data to botornot()
data <- botornot(tmls)
Expecting data frame titled data should be created and have an additional column that is the probability of the user being a bot. Instead I have this error.
?Error in botornot.data.frame(tmls) : "user_id" %in% names(x) is not TRUE
The table in the bottom of the documentation is what I'm hoping to achieve.
https://www.rdocumentation.org/packages/botrnot/versions/0.0.2
I am a novice with R and a total newbie with the NHL API. I wrote an R program to extract all of the goals recorded in the NHL's data repository accessed through the NHL API using the R "nhlapi" package. I have code that works, but it's ugly and slow, and I wanted to see if anyone has suggestions for improving it. I am using the nhl_games_feed function provided by nhlapi to pull all events, from which I select the goals. This function returns a JSON blob (list of lists of lists of lists ...) in R, which I want to convert into a proper data.table.
I pasted a stripped-down version of my code below. I understand that normal practice here would be to include a sample data blob with the code so that other users can recreate my problems, but the data blob is the problem.
When I ran the full version of my code last night, the "Loop through games" portion took about 11 hours, and the "Convert players list to columns" took about 2 hours. Unless I can find a way to push the column or row filtering into the NHL's system, I don't think I am likely to find a way to speed up the "Loop through games" portion. So my first question: Does anyone have any thoughts about how to extract a subset of columns or rows using the NHL API, or do I need to pull everything and process it on my end?
My other question related to the second chunk of code ("Convert players..."), which converts the resulting event data into a single row of scalar elements per event. The event data shows up in lblob_feed[[1]]$liveData$plays$allPlays, which contains one row of scalar elements per event, except that one of the elements is ..$allPlays$players, which is itself a 4x5 dataframe. As a result, the only way that I could find to extract that data into scalar elements is the "Convert players..." loop. Is there a better way to convert this into a simple data.table?
Finally, any tips on other ways to end up with a comprehensive database of NHL events?
require("nhlapi")
require("data.table")
require("tidyverse")
require("hms")
assign("last.warning", NULL, envir = baseenv())
# create small list of selected games, using NHL API game code format
cSelGames <- c(2021020001, 2021020002, 2021020003, 2021020004)
liNumGames <- length(cSelGames)
print(liNumGames)
# 34370 games in the full database
# =============================================================================
# Loop through games
# Pull data for one game per call
Sys.time()
dtGoals <- data.table()
for (liGameNum in 1:liNumGames) {
# Pull the NHL feed blob for one selected game
# 11 hours in the full version
lblob_feed <- nhl_games_feed(gameId = cSelGames[liGameNum])
# Select only the play (event) portion of the feed blob
ldtFeed <- as.data.table(c(lblob_feed[[1]]$gamePk, lblob_feed[[1]]$liveData$plays$allPlays))
setnames(ldtFeed, 1, "gamePk")
# Check for games with no play data - 1995020006 has none and would kill execution
if ('result.eventCode' %in% colnames(ldtFeed)) {
# Check for missing elements in allPlays list
# team.triCode is missing for at least one game, probably for all-star games
if (!('team.triCode' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (team.triCode = NA)]}
if (!('result.strength.code' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.strength.code = NA)]}
if (!('result.emptyNet' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.emptyNet = NA)]}
# Select the events and columns for the output data table
ldtGoals_new <- ldtFeed[(result.eventTypeId == 'GOAL')
,list(gamePk, result.eventCode, players, result.description
, team.triCode, about.period, about.periodTime
, about.goals.away, about.goals.home, result.strength.code, result.emptyNet)]
# Append the incremental data table to the aggregate data table
dtGoals <- rbindlist(list(dtGoals, ldtGoals_new), use.names=TRUE, fill=TRUE)
}
}
# =============================================================================
# Convert players list to columns
# 2 hours in the full version
# 190686 goals in full table
# For each goal, the player dataframe is 4x5
Sys.time()
dtGoal_player <- data.table()
for (i in 1:dtGoals[,.N]) {
# convert rows with embedded dataframes into multiple rows with scalar elements
s_result.eventCode <- dtGoals[i,result.eventCode]
dtGoal_player_new <- as.data.table(dtGoals[i,players[[1]]])
dtGoal_player_new[, ':=' (result.eventCode=s_result.eventCode)]
dtGoal_player <- rbindlist(list(dtGoal_player, dtGoal_player_new), use.names=TRUE, fill=TRUE)
}
# drop players element
dtGoals[, players:=NULL]
# clean up problem with duplicated rows with playerType=Assist
dtGoal_player[, lag.playerType:=c('nomatch', playerType[-.N]), by=result.eventCode]
dtGoal_player[, playerType2:=ifelse((playerType==lag.playerType),'Assist2',playerType)]
# transpose multiple rows per event into single row with multiple columns for playerType
dtGoal_player_t <- dcast.data.table(dtGoal_player, result.eventCode ~ playerType2
, value.var='player.id', fun.aggregate=max)
# =============================================================================
# Merge players data into dtGoals
Sys.time()
dtGoals <- merge(dtGoals, dtGoal_player_t, by="result.eventCode")
Sys.time()
I am doing research on U.S. Lobbying, who publishes their data as an open API that is very poorly integrated and only seems to allow 250 observations to be downloaded at one time. I would like to compile the whole data set into one data table but am struggling with the last step to do so. This is what I have thus far
base_url <- sample("https://lda.senate.gov/api/v1/contributions/?page=", 10, rep = TRUE) #Set the number between the commas as how many pages you want
numbers <- 1:10 #Set the second number as how many pages you want
pagesize <- sample("&page_size=250", 10, rep = TRUE) #Set the number between the commas as how many pages you want
pages <- data.frame(base_url, numbers, pagesize)
pages$numbers <- as.character(pages$numbers)
pages$url <- with(pages, paste0(base_url, numbers, pagesize)) # creates list of pages you want. the list is titled pages$url
for (i in 1:length(pages$url)) assign(pages$url[i], GET(pages$url[i])) # Creates all the base lists in need of extraction
The last two things I need to do are extract the data table from the created lists and then full join all of them. I know how to join all of them but extracting the data frames is proving to be challenging. basically, to all the created lists I need to apply the function fromJSON(rawToChar(list$content)). I have tried using lapply but have yet to figure it out. any help would be greatly welcomed!
When you were assigning GET(pages$url[i])) to your data frame you were coercing it to a character vector. Better to assign it to a list and keep it as a response:
library(httr)
library(jsonlite)
library(dplyr) # for bind_rows
page_content <- list()
for (i in 1:length(pages$url)) page_content[[i]] <- GET(pages$url[i]) # Creates all the base lists in need of extraction
Then you can use the code you had written - fromJSON(rawToChar()) - to extract it from raw bytes to characters:
results_list <- lapply(
page_content,
\(page) fromJSON(rawToChar(page[["content"]]))["results"][[1]]
)
results_table <- do.call(bind_rows, results_list)
dim(results_table) # 2500 27
names(results_table)
# [1] "url" "filing_uuid" "filing_type" "filing_type_display" "filing_year"
# [6] "filing_period" "filing_period_display" "filing_document_url" "filing_document_content_type" "filer_type"
# [11] "filer_type_display" "dt_posted" "contact_name" "comments" "address_1"
# [16] "address_2" "city" "state" "state_display" "zip"
# [21] "country" "country_display" "registrant" "lobbyist" "no_contributions"
# [26] "pacs" "contribution_items"
I am interested in creating a network graph similar to the one displayed on this persons website - this first one on this page >> http://minimaxir.com/2016/12/interactive-network/
I would like to make the nodes of this graph == words in a .txt document (after removing stopwords and other pre-processing). I would also like to make the vertices/edges of this graph be the correlations to other words in the document (e.g. the word "word" occurs frequently next to the word "up") accounting for only the stronger correlations. I was thinking "size of node" = "frequency of word" in the document overall, and "distance between nodes" =
strength/weakness of relationship" between words.
I am currently using a combination of R, quanteda and ggplot2 as well as some other dependencies.
If anyone has any advice on how I can generate word correlations in R (preferably with quanteda) and then plot as a graph I would be forever grateful!
Of course if there are any improvements I can make to this question please let me know. Here is where i'm at so far with my attempt:
library(quanteda)
library(readtext)
library(ggplot2)
library(stringi)
## Load the .txt doc
document <- texts(readtext("file1.txt"))
## Make everything lowercase... store in a seperate variable
documentlower <- char_tolower(document$text)
## Tokenize the lower-case document
documenttokens <- tokens(documentlower, remove_punct = TRUE) %>% as.character()
(total_length <- length(documenttokens)
## Create the Document Frequency Matrix - here we can also remove stopwords and stem
docudfm <- dfm(documentlower, remove_punct = TRUE, remove = stopwords("english"), stem = TRUE)
## Inspect the top 10 Words by Count
textstat_frequency(docudfm, n = 10)
## Create a sorted list of tokens by frequency count
sorted_document <- topfeatures(docudfm, n = nfeat(docudfm))
## Normalize the data points to find their percentage of occurrence in the documents
sorted_document <- sorted_document / sum(sorted_document) * 100
## Also normalize the data points in the DFM
docudfm_pct <- dfm_weight(docudfm, scheme = "prop") * 100
I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()