I recently used search_fullarchive function in rtweet package to read in a large quantity of tweets (~500,000). Because the file is so large and I needed to transfer the data to another computer, I saved the file to rds format, which appeared to be smaller in size. I used the following function in base R to read in the file:
tweet<-readRDS("20201211164534-1.rds")
However, instead of a dataset I got the object that's captured in the image below
a screenshot of View(tweet)
typeof(tweet)
[1]"list"
I opened every node but I cannot find the dataset. Does anyone have an idea if/where I can find the dataset with my tweets in the rds file? Thank you!
Update: Someone asked about the function I used to save the tweet dataset. I did not use a function, but r automatically stored the rds files in my directory when I ran the search_fullarchive function as below:
fever.june.1<-search_fullarchive(
fever,#Search query on which to match/filter tweets
n = 500000, #Number of tweets to return; it is best to set this number in intervals of 100 for the '30day' API and either 100 (for sandbox) or 500 (for paid) for the 'fullarchive' API. Default is 100.
fromDate = 202006010000, #Oldest date-time (YYYYMMDDHHMM)
toDate = 202006302359,#Newest date-time (YYYYMMDDHHMM)
env_name = "develop",#Name/label of developer environment to use for the search.
#safedir = "~/Desktop/",#Name of directory to which each response object should be saved.
parse = TRUE,#Logical indicating whether to convert data into data frame.
token = token #A token associated with a user-created APP
)
Related
I want to import json data from some kind of fitness tracker in order to run some analysis on them. The single json files are quite large, while I am only interested in specific numbers per training session (each json file is a training session).
I managed to read in the names of the files & to grab the interesting content out of the files. Unfortunately my code does obviously not work correctly if one or more information are missing in some of the json files (e.g. distance is not availaible as it was an indoor training).
I stored all json files with training sessions in a folder (=path in the code) and asked R to get a list of the files in that folder:
json_files<- list.files(path,pattern = ".json",full.names = TRUE) #this is the list of files
jlist<-as.list(json_files)
Then I wrote this function to get the data im interested in from each single file (as reading in all the content for each file exceeded my available RAM capacity...)
importPFData <- function(x)
{
testimport<-fromJSON(x)
sport<-testimport$exercises$sport
starttimesession<-testimport$exercises$startTime
endtimesession<-testimport$exercises$stopTime
distance<-testimport$exercises$distance
durationsport<-testimport$exercises$duration
maxHRsession <- testimport$exercises$heartRate$max
minHRsession <- testimport$exercises$heartRate$min
avgHRsession <- testimport$exercises$heartRate$avg
calories <- testimport$exercises$kiloCalories
VO2max_overall <- testimport$physicalInformationSnapshot$vo2Max
return(c(starttimesession,endtimesession,sport,distance,durationsport,
maxHRsession,minHRsession,avgHRsession,calories,VO2max_overall))
}
Next I applied this function to all elements of my list of files:
dataTest<-sapply(jlist, importPFData)
I receive a list with one entry per file, as expected. Unfortunately not all of the data was available per file, which results in some entries having 7, other having 8,9 or 10 entries.
I struggle with getting this into a proper dataframe as the infomation is not shown as NA or 0, its just left out.
Is there an easy way to include NA in the function above if no information is found in the individual json file for that specific detail (e.g. distance not available --> "NA" for distance for this single entry)?
Example (csv) of the content of a file with 10 entries:
"","c..2013.01.06T08.52.38.000....2013.01.06T09.52.46.600....RUNNING..."
"1","2013-01-06T08:52:38.000"
"2","2013-01-06T09:52:46.600"
"3","RUNNING"
"4","6890"
"5","PT3608.600S"
"6","234"
"7","94"
"8","139"
"9","700"
"10","48"
Example (csv) for a file with only 7 entries (columns won´t macht to Example 1):
"","c..2014.01.22T18.38.30.000....2014.01.22T18.38.32.000....RUNNING..."
"1","2014-01-22T18:38:30.000"
"2","2014-01-22T18:38:32.000"
"3","RUNNING"
"4","0"
"5","PT2S"
"6","0"
"7","46"
I need to find some historical time series for Stocks and have the result in R.
I tried already the package "quantmod", but unfortunately most of the stocks are not covered.
I found EXCELS "STOCKPRICEHISTORY" to yield good results.
Hence, to allow more solutions, I phrase my question more open:
How do I get from a table, that contains Stocks (Ticker), Startdate and Endate to a list in R which contains each respective stock and its stockprice timeseries?
My Startingpoint looks like this:
My aim at the very End is to have something like this:
(Its also ok if I have every single stock price timeseries as csv)
My ideas so far:
Excel VBA Solution - 1
Write a macro, that executes EXCELS "STOCKHISTORY" function on each of these stocks, and writes them as csv or so? Then after that, read them in and create a list in R
Excel VBA Solution - 2
Write a macro, that executes EXCELS "STOCKHISTORY" function on each of these stocks,
each one in a new worksheet? Bad Idea, since there are more than 4000 Stocks..
R Solution
(If possible) call "STOCKHISTORY" function from R directly (?)
Any suggestions on how to takle this?
kind regards
I would recommend using an API, especially over trying to connect to Excel via VBA. There are many that are free, but require and API key from their website. For example, alphavantage.
library(tidyverse)
library(jsonlite)
library(httr)
symbol = "IBM"
av_key = "demo"
url <- str_c("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=", symbol ,"&apikey=", av_key, "&datatype=csv")
d <- read_csv(url)
d %>% head
Cred and other options: https://quantnomad.com/2020/07/06/best-free-api-for-historical-stock-data-examples-in-r/
I have a series of .json files each containing data captured from between 500 and 10,000 tweets (3-40 MB each). I am trying to use rtweet's parse_stream() function to read these files into R and store the tweet data in a data table. I have tried the following:
tweets <- parse_stream(path = "india1_2019090713.json")
There is no error message and the command creates a tweets object, but it is empty (NULL). I have tried this with other .json files, and the result is the same. Has anyone encountered this behaviour/is there something obvious I am doing wrong? I would appreciate any advice to an rtweet newbie!
I am using rtweet version 0.6.9.
Many thanks!
As an update and partial answer:
I've not made progress with the original issue, but I have had a lot more success using the jsonlite package, which is amply able to read in large and complex .json files containing Tweet data.
library(jsonlite)
I used the fromJSON() function as detailed here. I found I needed to edit the original .json file to match the required structure, beginning and ending the file with square brackets ([ ]) and adding a comma before each line break at the end of each Tweet. Then:
tweetsdf <- fromJSON("india1_2019090713.json", simplifyDataFrame = TRUE, flatten = TRUE)
simplifyDataFrame ensures the contents are saved as a data frame with one row per Tweet, and flatten collapses most of the nested Tweet attributes to separate columns for each sub-value rather than generating columns full of unwieldy list structures.
I'm having trouble accessing the Energy Information Administration's API through R (https://www.eia.gov/opendata/).
On my office computer, if I try the link in a browser it works, and the data shows up (the full url: https://api.eia.gov/series/?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json).
I am also successfully connected to Bloomberg's API through R, so R is able to access the network.
Since the API is working and not blocked by my company's firewall, and R is in fact able to connect to the Internet, I have no clue what's going wrong.
The script works fine on my home computer, but at my office computer it is unsuccessful. So I gather it is a network issue, but if somebody could point me in any direction as to what the problem might be I would be grateful (my IT department couldn't help).
library(XML)
api.key = "e122a1411ca0ac941eb192ede51feebe"
series.id = "PET.MCREXUS1.M"
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=", api.key, "&out=xml", sep="")
doc = xmlParse(file=my.url, isURL=TRUE) # yields error
Error msg:
No such file or directoryfailed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
Error: 1: No such file or directory2: failed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
I tried some other methods like read_xml() from the xml2 package, but this gives a "could not resolve host" error.
To get XML, you need to change your url to XML:
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=",
api.key, "&out=xml", sep="")
res <- httr::GET(my.url)
xml2::read_xml(res)
Or :
res <- httr::GET(my.url)
XML::xmlParse(res)
Otherwise with the post as is(ie &out=json):
res <- httr::GET(my.url)
jsonlite::fromJSON(httr::content(res,"text"))
or this:
xml2::read_xml(httr::content(res,"text"))
Please note that this answer simply provides a way to get the data, whether it is in the desired form is opinion based and up to whoever is processing the data.
If it does not have to be XML output, you can also use the new eia package. (Disclaimer: I'm the author.)
Using your example:
remotes::install_github("leonawicz/eia")
library(eia)
x <- eia_series("PET.MCREXUS1.M")
This assumes your key is set globally (e.g., in .Renviron or previously in your R session with eia_set_key). But you can also pass it directly to the function call above by adding key = "yourkeyhere".
The result returned is a tidyverse-style data frame, one row per series ID and including a data list column that contains the data frame for each time series (can be unnested with tidyr::unnest if desired).
Alternatively, if you set the argument tidy = FALSE, it will return the list result of jsonlite::fromJSON without the "tidy" processing.
Finally, if you set tidy = NA, no processing is done at all and you get the original JSON string output for those who intend to pass the raw output to other canned code or software. The package does not provide XML output, however.
There are more comprehensive examples and vignettes at the eia package website I created.
I have a file in my google drive that is an xlsx. It is too big so it is not automatically converted to a googlesheet (that's why using googlesheets package did not work). The file is big and I can't even preview it through clicking on it on my googledrive. The only way to see it is to download is as an .xlsx . While I could load it as an xlsx file, I am trying instead to use the googledrive package.
So far what I have is:
library(googledrive)
drive_find(n_max = 50)
drive_download("filename_without_extension.xlsx",type = "xlsx")
but I got the following error:
'file' does not identify at least one Drive file.
Maybe it is me not specifying the path where the file lives in the Drive. For example : Work\Data\Project1\filename.xlsx
Could you give me an idea on how to load in R the file called filename.xlsx that is nested in the drive like that?
I read the documentation but couldn't figure out how to do that.Thanks in advance.
You should be able to do this by:
library(googledrive)
drive_download("~/Work/Data/Project1/filename.xlsx")
The type parameter is only for Google native spreadsheets, and does not apply to raw files.
I want to share my way.
I do this way because I keep on updating the xlsx file. It is a query result that comes from an ERP.
So, when I tried to do it by googleDrive Id, it gave me errors because each time the ERP update the file its Id change.
This is my context. Yours can be absolutely different. This file changes just 2 or three times at month. Even tough it is a "big" xlsx file (78-80K records with 19 factors), I use it for just seconds to calculate some values and then I can trash it. It does not have any sense to store it. (to store is more expensive than upload)
library(googledrive)
library(googlesheets4) # watch out: it is not the CRAN version yet 0.1.1.9000
drive_folder_owner<-"carlos.sxxx#xxxxxx.com" # this is my account in this gDrive folder.
drive_auth(email =drive_folder_owner) # previously authorized account
googlesheets4::sheets_auth(email =drive_folder_owner) # Yes, I know, should be the same, but they are different.
d1<-drive_find(pattern = "my_file.xlsx",type = drive_mime_type("xlsx")) # This is me finding the file created by the ERP, and I do shorten the search using the type
meta<-drive_get(id=d1$id)[["drive_resource"]] # Get the id from the file in googledrive
n_id<-glue("https://drive.google.com/open?id=",d1$id[[1]]) # here I am creating a path for reading
meta_name<- paste(getwd(),"/Files/",meta[[1]]$originalFilename,sep = "") # and a path to temporary save it.
drive_download(file=as_id(n_id),overwrite = TRUE, path = meta_name) # Now read and save locally.
V_CMV<-data.frame(read_xlsx(meta_name)) # store to data frame
file.remove(meta_name) # delete from R Server
rm(d1,n_id) # Delete temporary variables