R - Use Twitter API to get every tweet from an account

R - Use Twitter API to get every tweet from an account - r

My goal is to get EVERY tweet ever for any twitter account. I picked the NYTimes for this example.
The code below works, but it only pulls the last 100 tweets. max_results does not allow you to put a value over 100.
The code below almost fully copy-paste-able, you would have to have your own bearer token.
How can I expand this to give me every tweet from an account?
One idea is that I can loop it for every day since the account was created, but that seems tedious if there is a faster way.
# NYT Example --------------------------------------------------------------------
library(httr)
library(jsonlite)
library(tidyverse)
bearer_token <- "insert your bearer token here"
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
params <- list(`user.fields` = 'description')
handle <- 'nytimes'
url_handle <- sprintf('https://api.twitter.com/2/users/by?usernames=%s', handle)
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
json_data %>%
as_tibble()
NYT_ID <- json_data$data$id
url_handle <- paste0("https://api.twitter.com/2/users/", NYT_ID, "/tweets")
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100')
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
NYT_tweets <- json_data$data %>%
as_tibble() %>%
select(-id, -author_id, -9)
NYT_tweets

For anyone that finds this later on, I found a solution that works for me.
Using the parameters of start_time and end_time you can clarify dates for the tweets to be between. I was able to pull all tweets from November for example and then rbind those to the ones from December, etc. Sometimes I had to do two tweet pulls (half of March, second half of March) to get all of them, but it worked for this.
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100',
`start_time` = '2021-11-01T00:00:01.000Z',
`end_time` = '2021-11-30T23:58:21.000Z')

Related

Scraping Tweets in R httr, jsonlite, dplyr

This is my code:
library(httr)
library(jsonlite)
library(dplyr)
bearer_token <- Sys.getenv("BEARER_TOKEN")
headers <- c('Authorization' = sprintf('Bearer %s', bearer_token))
params <- list('expansions' = 'attachments.media_keys')
handle <- readline('BenDuBose')
url_handle <-
sprintf('https://api.twitter.com/2/users/by?username=%s', handle)
response <-
httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
obj <- httr::content(response, as = "text")
print(obj)
This is my error message:
[1] "{"errors":[{"parameters":{"ids":[""]},"message":"The number of values in the ids query parameter list [0] is not between 1 and 100"}],"title":"Invalid Request","detail":"One or more parameters to your request was invalid.","type":"https://api.twitter.com/2/problems/invalid-request"}"
My end goal is to scrape an image from a specific tweet ID/user. I already have a list of users and tweet IDs, along with attachments.media_keys. But, I don't know how to use HTTR and I am trying to copy the Twitter Developer example verbatim to learn, but it isn't working.

Rvest wont return data

I have been trying to scrape the following table:

Your problem was that your request delivered you a html site, not a json response. Thus, parsing it as a json failed with the error you saw.
(I can't tell you exactly whether it was because you missed out on the accept_json() or whether the URL you used was a bit off.).
Either way, reverse engineering the essentials of the API request behind the table you linked, you'd have to put something like this together:
require(httr)
require(dplyr)
library(purrr)
first_req <- GET("https://www.barchart.com")
xsrf_token <- cookies(first_req) %>% filter(name == 'XSRF-TOKEN') %>% pull(value) %>% URLdecode()
req <- GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get",
query = list(
lists = "stocks.optionable.by_sector.all.us",
fields = "symbol,symbolName,lastPrice,priceChange,percentChange,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions",
orderBy = "symbol",
orderDir = "asc",
meta = "field.shortName,field.type,field.description",
hasOptions = TRUE,
#page = 1,
#limit = 100,
raw = 1
),
content_type_json(),
accept_json(),
add_headers(
"x-xsrf-token" = xsrf_token,
"referrer" = "https://www.barchart.com/options/stocks-by-sector?page=1"
)
)
table_data <- req %>%
content() %>%
.$data %>%
map_dfr(unlist)
This will get you the full list of 4258 items and coerce it into a tibble for convenience :)

Want to write a for loop that calls an API with some conditions (if/else)

I want to call an Api that gives me all my shop orders. I have a total of 86.000 orders where the ID of the first order is 2 and the ID of the most recent order is 250.000. Orders obviously are not count consecutive, due to some reason i dont know yet, but doesnt matter.
I started a simple script with a for loop where the ID gets updated in every loop, like this:
library(jsonlite)
library(httr)
user = "my_name"
token = "xyz"
y = 0
urls = rep("api.com/orders/", 250000)
for(i in urls){
y = y + 1
url = paste0(i, y)
a = httr::GET(url, authenticate(user, token))
a_content = httr:content(a, as = "text", encoding = "UTF-8")
a_json = jsonlite::fromJSON(a_content, flatten = T)
...
}
Problem here is, whenever there is an ID with no order, the loop stops with "{\"success\":false,\"message\":\"Order by id 1 not found\"}" So i somehow have to expand the code with some if else statements, like 'if the id does not match an order proceed to the next order'. Also i want to write all orders into a new list.
Any help apprichiated.

Maybe tryCatch can solve the problem. This url_exists function posted by user #hrbrmstr is called in the loop below.
Without testing, I would do something along the lines of:
base_url <- "api.com/orders/"
last_order_number <- 250000
for(i in seq.int(last_order_number)){
url = paste0(base_url, i)
if(url_exists(url)){
a <- httr::GET(url, authenticate(user, token))
a_content <- httr:content(a, as = "text", encoding = "UTF-8")
a_json <- jsonlite::fromJSON(a_content, flatten = T)
}
}

r Web scraping: Unable to read the main table

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).

This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

API Query for loop

I'm trying to pull some data from an API throw it all into a single data frame. I'm trying to put a variable into the URL I'm pulling from and then loop it to pull data from 54 keys. Here's what I have so far with notes.
library("jsonlite")
library("httr")
library("lubridate")
options(stringsAsFactors = FALSE)
url <- "http://api.kuroganehammer.com"
### This gets me a list of 58 observations, I want to use this list to
### pull data for each using an API
raw.characters <- GET(url = url, path = "api/characters")
## Convert the results from unicode to a JSON
text.raw.characters <- rawToChar(raw.characters$content)
## Convert the JSON into an R object. Check the class of the object after
## it's retrieved and reformat appropriately
characters <- fromJSON(text.raw.characters)
class(characters)
## This pulls data for an individual character. I want to get one of
## these for all 58 characters by looping this and replacing the 1 in the
## URL path for every number through 58.
raw.bayonetta <- GET(url = url, path = "api/characters/1/detailedmoves")
text.raw.bayonetta <- rawToChar(raw.bayonetta$content)
bayonetta <- fromJSON(text.raw.bayonetta)
## This is the function I tried to create, but I get a lexical error when
## I call it, and I have no idea how to loop it.
move.pull <- function(x) {
char.x <- x
raw.x <- GET(url = url, path = cat("api/characters/",char.x,"/detailedmoves", sep = ""))
text.raw.x <- rawToChar(raw.x$content)
char.moves.x <- fromJSON(text.raw.x)
char.moves.x$id <- x
return(char.moves.x)
}

The first part of this:
library(jsonlite)
library(httr)
library(lubridate)
library(tidyverse)
base_url <- "http://api.kuroganehammer.com"
res <- GET(url = base_url, path = "api/characters")
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble() -> chars
Gets you a data frame of the characters.
This:
pb <- progress_estimated(length(chars$id))
map_df(chars$id, ~{
pb$tick()$print()
Sys.sleep(sample(seq(0.5, 2.5, 0.5), 1)) # be kind to the free API
res <- GET(url = base_url, path = sprintf("api/characters/%s/detailedmoves", .x))
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble()
}, .id = "id") -> moves
Gets you a data frame of all the "moves" and adds the "id" for the character. You get a progress bar for free, too.
You can then either left_join() as needed or group & nest the moves data into a separate list-nest column. If you want that to begin with you can use map() vs map_df().
Leave in the time pause code. It's a free API and you should likely increase the pause times to avoid DoS'ing their site.