Problems parsing StreamR JSON data - r

I am attempting to use the streamR in R to download and analyze Twitter, under the pretense that this library can overcome the limitations from the twitteR package.
When downloading data everything seems to work fabulously, using the filterStream function (just to clarify, the function captures Twitter data, just running it will provide the json file -saved in the working directory- that needs to be used in further steps):
filterStream( file.name="tweets_test.json",
track="NFL", tweets=20, oauth=credential, timeout=10)
Capturing tweets...
Connection to Twitter stream was closed after 10 seconds with up to 21 tweets downloaded.
However, when moving on to parse the json file, I keep getting all sorts of errors:
readTweets("tweets_test.json", verbose = TRUE)
0 tweets have been parsed.
list()
Warning message:
In readLines(tweets) : incomplete final line found on 'tweets_test.json'
Or with this function from the same package:
tweet_df <- parseTweets(tweets='tweets_test.json')
Error in `$<-.data.frame`(`*tmp*`, "country_code", value = NA) :
replacement has 1 row, data has 0
In addition: Warning message:
In stream_in_int(path.expand(path)) : Parsing error on line 0
I have tried reading the json file with jsonlite and rjson with the same results.
Originally, it seemed that the error came from special characters ({, then \) within the json file that I tried to clean up following the suggestion from this post, however, not much came out of it.
I found out about the streamR package from this post, which shows the process as very straight forward and simple (which it is, except for the parsing part!).
If any of you have experience with this library and/or these parsing issues, I'd really appreciate your input. I have been searching non stop but haven't been able to locate a solution.
Thanks!

Related

Using speech to text with googlelanguageR produces NULL transcripts

I'm using the R package 'googleLanguageR' to transcribe various 30 second audio files (over 500 so want to automatize this). I've followed all the steps in the googleLanguageR tutorials, got my key, and authenticated through R.
I'm able to transcribe the test audio (.wav) that comes with the package, but whenever I apply the same function to my files (.mp3), I get NULL for both transcript and timings.
This is the code provided in tutorials:
# get the sample source file
test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR")
gl_speech(test_audio)$transcript
If I use the same for my file, I get an empty element, so I've tried the following with no luck:
test_audio <- "/audio_location/filename.mp3"
gl_speech(test_audio)$transcript
Has anybody encountered a similar problem with this package or have any suspicions of why it produces NULL transcripts?

Batch processing a 30GB json file in R

I have a large (30GB) json file of tweets I'd like to parse and conduct some text analysis with in R. Tweets were acquired using the filter_stream function from the twitteR package about 2 years ago. Here is a sample (pretty standard): https://www.dropbox.com/s/ecrfo3etk2ingcm/WomensMarch2018.json?dl=0.
My computer grinds to a halt anytime I attempt the following:
library(streamR)
mydata <- parseTweets("BigData.json", simplify = TRUE)
I know I need to batch process the file, else move to a cloud server with tons of RAM, but I don't know how to do either. Can anyone help?
Edit: I tried this solution (Reading a huge json file in R , issues), but get the following error:
Error: lexical error: invalid char in json text.
_at":"Wed Jul 21 12:54:05 +{"created_at":"Sat Jan 21 17:18:2
(right here) ------^

Difficulty opening a package data file of unknown type

I am trying to load the state map from the maps package into an R object. I am hoping it is a SpatialPolygonsDataFrame or something I can turn into one after I have inspected it. However I am failing at the first step – getting it into an R object. I do not know the file type.
I first tried to assign the map() output to an R object directly:
st_m <- maps::map(database = "state")
draws the map, but str(st_m) appears to do nothing, unless it is redrawing the same map.
Then I tried loading it as a dataset: st_m <- data("stateMapEnv", package="maps") but this just returns a string:
> str(stateMapEnv)
chr "R_MAP_DATA_DIR"
I opened the maps directory win-library/3.4/maps/mapdata/ and found what I think is the map file, “state.L”.
I tried reading it with scan and got an error message I do not understand:
scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
scan() expected 'a real', got '#'
I then opened the file with Notepad++. It appears to be a binary or compressed file.
So I thought it might be an R data file with an unusual extension. But my attempt to load it returned a “bad magic number” error:
st_m <- load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
bad restore file magic number (file may be corrupted) -- no data loaded
Observing that these responses have progressed from the unhelpful through the incomprehensible to the occult, I thought it best to seek assistance from the wizards of stackoverflow.
This should be able to export the 'state' or any other maps dataset for you:
library(ggplot2)
state_dataset <- map_data("state")

rtweet giving error in rbind when collecting large numbers of tweets

I'm using the rtweet package in R to pull tweets for data analysis.
When I run the following line of code requesting 18,000 tweets, everything works fine:
t <- search_tweets("at", n=18000, lang='en', geocode='-25.609139,134.361949,3500km', since='2017-08-01', type='recent', retryonratelimit=FALSE)
But when I try to extend this to 100,000 tweets I get an error message
t <- search_tweets("at", n=100000, lang='en', geocode='-25.609139,134.361949,3500km', since='2017-08-01', type='recent', retryonratelimit=TRUE)
Finished collecting tweets!
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
Why is this occurring and how do I solve this? Thanks
I suggest updating to the dev version of rtweet. It fixed this issue for me.
devtools::install_github("mkearney/rtweet")

Using R package BerkeleyEarth

I'm working for the first time with the R package BerkeleyEarth, and attempting to use its convenience functions to access the BEST data. I think maybe it's just a problem with their servers (a matter I've separately addressed to the package's maintainer) but I wanted to know if it's instead something silly I'm doing.
To reproduce my fault
library(BerkeleyEarth)
downloadBerkeley()
which provides the following error message
trying URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
Error in download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
cannot open URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
In addition: Warning message:
In download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
InternetOpenUrl failed: 'A connection with the server could not be established'
Has anyone had a better experience using this package?
The error message is pointing to a different URL than one should get judging what URLs are listed at http://berkeleyearth.org/data/ that point to the zip formatted files. There are another set of .nc files that appear to be more recent. I would replace the entries in the BerkeleyUrls dataframe with the ones that match your analysis strategy:
This is the current URL that should be in position 1,1:
http://berkeleyearth.lbl.gov/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip
And this is the one that is in the package dataframe:
> BerkeleyUrls[1,1]
[1] "http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip"
I suppose you could try:
BerkeleyUrls[, 1] <- sub( "download\\.berkeleyearth\\.org", "berkeleyearth.lbl.gov", BerkeleyUrls[, 1])

Resources