Batch processing a 30GB json file in R - r

I have a large (30GB) json file of tweets I'd like to parse and conduct some text analysis with in R. Tweets were acquired using the filter_stream function from the twitteR package about 2 years ago. Here is a sample (pretty standard): https://www.dropbox.com/s/ecrfo3etk2ingcm/WomensMarch2018.json?dl=0.
My computer grinds to a halt anytime I attempt the following:
library(streamR)
mydata <- parseTweets("BigData.json", simplify = TRUE)
I know I need to batch process the file, else move to a cloud server with tons of RAM, but I don't know how to do either. Can anyone help?
Edit: I tried this solution (Reading a huge json file in R , issues), but get the following error:
Error: lexical error: invalid char in json text.
_at":"Wed Jul 21 12:54:05 +{"created_at":"Sat Jan 21 17:18:2
(right here) ------^

Related

R Studio not recognizing files when compiling raw db

I have long-read RNA sequencing data from PacBio IsoSeq and trying to use IsoPops software (https://kellycochran.github.io/IsoPops/site_files/walkthrough.html) to generate graphs to interpret the data. I am new to R-studio so have limited proficiency. There is a walkthrough for the program which seems quite intuitive. I have successsfully installed the package on R Studio, and have opened the library and assigned variables to the files I have (which include a fasta file, gff file and abundance files). Once I've successfully assigned variables to the files, I'm supposed to generate a rawdb using the function compile_raw_db. However, when I do this function I get the error:
Error in system(command = paste("head -n1 ", filename, step = ""), intern = T) :
'head' not found
I've used the head -n1 function in linux command line, and can see that there definitely is a header and contents in the file. The error occurs for any file that I give to R studio, so my reasoning is that the programme is not recognizing the files.
Thought the files might have been empty, opened on Ubuntu and head -n2 function worked fine for each file, so I don't think it's an issue with the files.
My data was originally stored on an online shared network drive, moved files to local computer to eliminate any possibility for path errors. Still have the error message.
Have tried the same function with a number of different files, and I always have the same error message. So I think the issue is with how R studio is reading the files, and not the files themselves.
Would really appreciate some support with this, at the end of my PhD trying to analyse the last piece of my data and really struggling with this. Thanks in advance for any feedback!
> library(IsoPops)
Welcome to IsoPops version 0.3.1.
> transcript_AD3 <- "C:/R_Package_for_IsoSeq/IsoPops/IsoPops_Data/AD3-hq_transcripts.fasta"
> abundance_AD3 <- "C:/R_Package_for_IsoSeq/IsoPops/IsoPops_Data/AD3_Collapsed_Filtered_Isoform_Counts.abundance.txt"
> GFF_AD3 <- "C:/R_Package_for_IsoSeq/IsoPops/IsoPops_Data/AD3-collapse_isoforms.gff"
> rawDB <- compile_raw_db(transcript_AD3, abundance_AD3, GFF_AD3)
[1] "Loading sequences..."
Error in system(command = paste("head -n1 ", filename, step = ""), intern = T) :
'head' not found

Using speech to text with googlelanguageR produces NULL transcripts

I'm using the R package 'googleLanguageR' to transcribe various 30 second audio files (over 500 so want to automatize this). I've followed all the steps in the googleLanguageR tutorials, got my key, and authenticated through R.
I'm able to transcribe the test audio (.wav) that comes with the package, but whenever I apply the same function to my files (.mp3), I get NULL for both transcript and timings.
This is the code provided in tutorials:
# get the sample source file
test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR")
gl_speech(test_audio)$transcript
If I use the same for my file, I get an empty element, so I've tried the following with no luck:
test_audio <- "/audio_location/filename.mp3"
gl_speech(test_audio)$transcript
Has anybody encountered a similar problem with this package or have any suspicions of why it produces NULL transcripts?

Problems parsing StreamR JSON data

I am attempting to use the streamR in R to download and analyze Twitter, under the pretense that this library can overcome the limitations from the twitteR package.
When downloading data everything seems to work fabulously, using the filterStream function (just to clarify, the function captures Twitter data, just running it will provide the json file -saved in the working directory- that needs to be used in further steps):
filterStream( file.name="tweets_test.json",
track="NFL", tweets=20, oauth=credential, timeout=10)
Capturing tweets...
Connection to Twitter stream was closed after 10 seconds with up to 21 tweets downloaded.
However, when moving on to parse the json file, I keep getting all sorts of errors:
readTweets("tweets_test.json", verbose = TRUE)
0 tweets have been parsed.
list()
Warning message:
In readLines(tweets) : incomplete final line found on 'tweets_test.json'
Or with this function from the same package:
tweet_df <- parseTweets(tweets='tweets_test.json')
Error in `$<-.data.frame`(`*tmp*`, "country_code", value = NA) :
replacement has 1 row, data has 0
In addition: Warning message:
In stream_in_int(path.expand(path)) : Parsing error on line 0
I have tried reading the json file with jsonlite and rjson with the same results.
Originally, it seemed that the error came from special characters ({, then \) within the json file that I tried to clean up following the suggestion from this post, however, not much came out of it.
I found out about the streamR package from this post, which shows the process as very straight forward and simple (which it is, except for the parsing part!).
If any of you have experience with this library and/or these parsing issues, I'd really appreciate your input. I have been searching non stop but haven't been able to locate a solution.
Thanks!

Difficulty opening a package data file of unknown type

I am trying to load the state map from the maps package into an R object. I am hoping it is a SpatialPolygonsDataFrame or something I can turn into one after I have inspected it. However I am failing at the first step – getting it into an R object. I do not know the file type.
I first tried to assign the map() output to an R object directly:
st_m <- maps::map(database = "state")
draws the map, but str(st_m) appears to do nothing, unless it is redrawing the same map.
Then I tried loading it as a dataset: st_m <- data("stateMapEnv", package="maps") but this just returns a string:
> str(stateMapEnv)
chr "R_MAP_DATA_DIR"
I opened the maps directory win-library/3.4/maps/mapdata/ and found what I think is the map file, “state.L”.
I tried reading it with scan and got an error message I do not understand:
scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
scan() expected 'a real', got '#'
I then opened the file with Notepad++. It appears to be a binary or compressed file.
So I thought it might be an R data file with an unusual extension. But my attempt to load it returned a “bad magic number” error:
st_m <- load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
bad restore file magic number (file may be corrupted) -- no data loaded
Observing that these responses have progressed from the unhelpful through the incomprehensible to the occult, I thought it best to seek assistance from the wizards of stackoverflow.
This should be able to export the 'state' or any other maps dataset for you:
library(ggplot2)
state_dataset <- map_data("state")

R read.spss error importing SPSS .por file - "Bad character in time"

I'm trying to import the NYPD stop-and-frisk data into R. The data is in SPSS .por files at http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/YYYY.zip
where YYYY is a year from 2003 to 2012
Most of the files load fine, but the 2004, 2007, and 2008 files all give me this error:
> library(foreign)
> mydata= read.spss("2004.por", to.data.frame=TRUE)
Error in read.spss("2004.por", to.data.frame = TRUE) :
error reading portable-file dictionary
In addition: Warning message:
In read.spss("2004.por", to.data.frame = TRUE) : Bad character in time
Execution halted
Any suggestions on how to debug this? I realize that read.spss does not support the latest SPSS versions, but given that most of the files (7 out of 10) import properly I wonder whether it's something more subtle.
psppire loads all the files without complaint, but the data looks corrupted, with some fields seemingly combined with others, and binary data in some of the fields.
I had some success using memisc as recommended in Read SPSS file into R. Namely, after installing memisc:
> install.packages('memisc')
You can read the data rather easily:
> library(memisc)
> data <- as.data.set(spss.portable.file('2004.por'))
While I haven't thoroughly inspected the data, it appears on first glance to be right.

Resources