I'm sure everyone is aware of the limitations with Twitter's API and the inability to get the number of replies a given tweet has.
I'm after this information, and am trying to loop over all the followers for a given account and look back at all the times they've tweeted, and if these tweets go to the original account.
I believe this will get around the issue of only being able to get the last weeks worth of tweets, which using the search function does.
Code below doesn't work, any help would be very useful! I know Python can do this, but I'd like an R solution if possible.
library(rtweet)
library(plyr)
library(dplyr)
##set name of tweeter to look at (this can be changed)
targettwittername <- "realDonaldTrump"
##get this tweeter's timeline
tmls <- get_timeline(targettwittername, n=3200, retryonratelimit=TRUE)
##get their user id
targettwitteruserid <- as.numeric(select(lookup_users(targettwittername), user_id))
##get ids of their tweets
tweetids <- select(tmls, status_id)
tweetids <- transform(tweetids, status_id_num=as.numeric(status_id))
##get list of followers (who are most likely to reply)
targetfollowers <- get_followers(targettwittername, retryonratelimit=TRUE)
master_reply_counts <-data.frame(reply_to_status_id_num=integer(),n=integer())
getfollowersreplies <- function(follower){
follower <- as.numeric(filter(targetfollowers, user_id==follower))
followertl <- get_timeline(user = follower, n=3200, retryonratelimit=TRUE)
followertl <- filter(followertl, reply_to_user_id == targettwitteruserid)
followertl <- transform(followertl, reply_to_status_id_num=as.numeric(reply_to_status_id))
join <- inner_join(followertl, tweetids, by=c("reply_to_status_id_num"="status_id_num"))
replycounts <- join %>%
group_by(reply_to_status_id_num) %>%
summarise(n=n())
master_reply_counts <- rbind(master_reply_counts, replycounts)
return(follower)
}
data <- ldply(targetfollowers$user_id, getfollowersreplies)
Related
I am trying to get data from the UN Stats API for a list of indicators (https://unstats.un.org/SDGAPI/swagger/).
I have constructed a loop that can be used to get the data for a single indicator (code is below). The loop can be applied to multiple indicators as needed. However, this is likely to cause problems relating to large numbers of requests, potentially being perceived as a DDoS attack and taking far too long.
Is there an alternative way to get data for an indicator for all years and countries without making a ridiculous number of requests or in a more efficient manner than below? I suppose this question likely applies more generally to other similar APIs as well. Any help would be most welcome.
Please note: I have seen the post here (Faster download for paginated nested JSON data from API in R?) but it is not quite what I am looking for.
Minimal working example
# libraries
library(jsonlite)
library(dplyr)
library(purrr)
# get the meta data
page = ("https://unstats.un.org/SDGAPI//v1/sdg/Series/List")
sdg_meta = fromJSON(page) %>% as.data.frame()
# parameters
PAGE_SIZE =100000
N_PAGES = 5
FULL_DF = NULL
my_code = "SI_COV_SOCINS"
# loop to go over pages
for(i in seq(1,N_PAGES,1)){
ind = which(sdg_meta$code == my_code)
cat(paste0("Processing : ", my_code, " ", i, " of ",N_PAGES, " \n"))
my_data_page <- c(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Series/Data?seriesCode=",my_code,"&page=",i,"pageSize=",PAGE_SIZE))
df <- fromJSON(my_data_page) #depending on the data you are calling, you will get a list
df= df$data %>% as.data.frame() %>% distinct()
# break the loop when no more to add
if(is_empty(df)){
break
}
FULL_DF = rbind(FULL_DF,df)
Sys.sleep(5) # sleep to avoid any issues
}
I have setup an API access key with a data provider of stock market data. With this key i am able to extract stock market data based on ticker code (E.g. APPL: Apple, FB: Facebook etc).
I am able to extract stock data on an individual ticker basis using R but I want to write a piece of code that extracts data based on the multiple stock tickers and puts them all in one data frame (the structure is the same for all stocks). I m not sure how to create a loop that updates the data frame each time stock data is extracted. I get an error called 'No encoding supplied: defaulting to UTF-8' which does not tell me much. A point in the right direction would be helpful.
I have the following code:
if (!require("httr")) {
install.packages("httr")
library(httr)
}
if (!require("jsonlite")) {
install.packages("jsonlite")
library(jsonlite)
}
stocks <- c("FB","APPL") #Example stocks actual stocks removed
len <- length(stocks)
url <- "URL" #Actual url removed
access_key <- "MY ACCESS KEY" #Actual access key removed
extraction <- lapply(stocks[1:len],function(i){
call1 <- paste(url,"?access_key=",access_key,"&","symbols","=",stocks[i],sep="")
get_prices <- GET(call1)
get_prices_text <- content(get_prices, "text")
get_prices_json <- fromJSON(get_prices_text, flatten = TRUE)
get_prices_df <- as.data.frame(get_prices_json)
return(get_prices_df)
}
)
file <- do.call(rbind,extraction)
I realised that this is not the most efficient way of doing this. A better way is to update the url to include multiple stocks rather then using a lapply function. I am therefore closing the question.
My goal is to scrape all this diamond data from bluenile.com. I've got some code that seems to be doing that, but it only grabs the first 61 rows.
By the way, I am using the "SelectorGadget" chrome plugin to get the CSS selectors. If I scroll down a little, the highlighting stops. Is it something to do with the website?
library('rvest')
le_url <- "https://www.bluenile.com/diamonds/round-cut?track=DiaSearchRDmodrn"
webpage <- read_html(le_url)
shape_data_html <- html_nodes(webpage,'.shape')
price_data_html <- html_nodes(webpage,'.price')
carat_data_html <- html_nodes(webpage,'.carat')
cut_data_html <- html_nodes(webpage,'.cut')
color_data_html <- html_nodes(webpage,'.color')
clarity_data_html <- html_nodes(webpage,'.clarity')
#Converting data to text
shape_data <- html_text(shape_data_html)
price_data <- html_text(price_data_html)
carat_data <- html_text(carat_data_html)
cut_data <- html_text(cut_data_html)
color_data <- html_text(color_data_html)
clarity_data <- html_text(clarity_data_html)
# make a data.frame
le_mat <- cbind(shape_data, price_data, carat_data, cut_data, color_data, clarity_data)
le_df <- le_mat[-1,]
colnames(le_df) <- le_mat[1,]
Data is dynamically added via API call as you scroll down page. The API call has a query string that allows you to specify startIndex (start row) and number of results per page (pageSize). The results per page max seems to be 1000. The return is json from which you can extract all the info you want including the total number of rows; accessed via key of countRaw. So, you can make a request for the initial 1000, parse out the total row count, countRaw, and perform a loop, adjusting the row startIndex parameter until you have all the results.
You can use a json parser e.g. jsonlite to handle the json response.
Example API endpoint call for first 1000 results:
https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus=
library(jsonlite)
url <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url)
print(r$countRaw)
You get a list of 8 elements from each call. r$results is a dataframe containing info of main interest.
Part of response:
Given the indicated result count I was expecting I could do something like (bearing in mind my limited R experience):
total <- r$countRaw
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=placeholder&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
if(total > 1000){
for(i in seq(1000, total + 1, by = 1000)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$results
# do something with df e.g. merge
}
}
However, it seems that there are only results for first two calls i.e. the initial df from r$results shown above and then:
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=1000&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us¤cy=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url2)
df2 <- r$results
Searching the page with css selector .row yields 1002 results versus the indicated total All diamonds number; so, I think there is some exploration to do around filters.
I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()
I am looking for ways to get company description, key statistics, chairman name from Yahoo Finance (or other financial website) using R, for example package quantmod.
There is oodles of info how to get current and historical prices etc, but this is not what I want.
best,
This R package does not support queries for Asian bourses. The problem appears to be with the underlying Yahoo APIs.
You can get that using Intrinio's API. Their data tag directory allows you to look up the tags you want, in your case, "long_description" and "ceo" will get you the data you want:
#Install httr, which you need to request data via API
install.packages("httr")
require("httr")
#Create variables for your usename and password, get those at intrinio.com/login
username <- "Your_API_Username"
password <- "Your_API_Password"
#Making an api call for roic. This puts together the different parts of the API call
base <- "https://api.intrinio.com/"
endpoint <- "data_point"
stock <- "T"
item1 <- "long_description"
item2 <- "ceo"
#Pasting them together to make the API call
call1 <- paste(base,endpoint,"?","ticker","=", stock, "&","item","=",item1, sep="")
call2 <- paste(base,endpoint,"?","ticker","=", stock, "&","item","=",item2, sep="")
#Now we use the API call to request the data from Intrinio's database
ATT_description <- GET(call1, authenticate(username,password, type = "basic"))
ATT_CEO <- GET(call2, authenticate(username,password, type = "basic"))
#That gives us the ROIC value, but it isn't in a good format so we parse it
test1 <- unlist(content(ATT_description,"parsed"))
test2 <- unlist(content(ATT_CEO,"parsed"))
#Then make your data frame:
df1 <- data.frame(test1)
df2 <- data.frame(test2)
#From here you can rbind or cbind, and create loops to get the same data for many tickers
You can get your API keys here. Full API documentation here. This is what the CEO dataframe looks like: