R code halts in for loop no error message - r

I'm working on a segment of R code which is a for loop that reads in columns from a collection of .csv files the paths of which are compiled in a file path directory I've made. As it reads in each file it stores four different columns into four different grand tables within my working environment.
I'm trying to make part of code calculates the monthly average for each month then stores it in a new column so that I can use it to replace missing data points and this involves a couple more for loops for mapping the subset of the aggregate table.
All this ends up being four nested for loops which handles a great deal of data at once before overwriting with the next large file.
After incorporating the nested 3 loops which build the monthly average vector the code started halting as soon as it tries to read in the data file with no error message it just looks like this.
halted code no errror
if I add show_col_types = FALSE to the read_csv it looks like this instead still halting the code.
halted code with show_col_types = FALSE
I can't include the data or much of the code because my company will not allow it but I would appreciate any input since there isn't any error message I can google. Thanks!

Related

Is there a method in R of extracting nested relational data tables from a JSON file?

I am currently researching publicly available payer transparency files across multiple insurers and I am trying to parse and extract JSON files using R and output them into .CSV files to later use with SQL. The file I am currently working with contains nested tables within the highest table.
I have attached the specific file I am working with right now in a link below, along with the code to mount it into R's dataviewer. I have used R extensively in healthcare analytics classes for statistical analysis and machine learning; though, I have never used R for building out data tables.
My goal is to assign a primary key to the highest level of the table, apply foreign and primary keys to lower tables and extract the lower tables and join them onto eachother later to build out a large CSV or TXT file to load onto SQL.
So far, I have used the jsonlite and rjson packages to extract the JSON itself into R, but trying to delist an unnest the tables within the tables are an enigma to me even after extensive research. I also find myself running into problems with "subscript out of bounds", "unimplemented list errors" and other issues.
It could also very well be the case that the JSON is too large for R's packages or that the JSON is structurally flawed (I wouldn't know if it is, I am not accustomed to JSONs). It seems that this could be a problem better solved with Python, though I don't know how to use Python too well and I am optimistic in R given how powerful it is.
Any feedback or answers would be greatly appreciated.
JSON file link: https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json
Code to load JSON:
json2 <- fromJSON('https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json')
JSONs load correctly, but there are tables embedded within tables. I would hope that these tables could be easily exported and have keys for joining, but I can not figure out how to denest these tables from within the data.
Some nested tables are out of subscript bounds for the data array. I have never encountered this problem and am bewildered as to how to go about and resolve the issue.
I can not figure out how to 'extract' the lower level tables, let alone open them, due to the subscript boundary error.
I can assign row ID to the main/highest table in the file, but I can not figure out how to add sub row ID's to the lower table for future joins.
Maybe the jsonStrings package can help. It allows to manipulate JSON, without converting to a R object. That's the first time I try it on such a big JSON string and it works fine.
Here is how to get the table in the first element of the JSON array:
options(timeout = 300)
download.file(
"https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json",
"jsonFile.json"
)
library(jsonStrings)
# load the JSON file
jstring <- jsonString$new("jsonFile.json")
# extract table "plans" of first element (indexed by 0)
jsonTable <- jstring$at(0, "plans")
# get a dataframe
library(jsonlite)
dat <- fromJSON(jsonTable$asString())
But the dataframe dat has a list column. I don't know how you want to make a CSV with this dataframe.

R misreading of time from xlsx dataTable

I have an issue very annoying.
I have some oxygen measurements saved in .xlsx table (created directly by the device software). Opened with excel, this is my part of my file.
In the first picture, we can notice that sometimes, the software skips a second (11:13:00 then 13:02).
in the second picture, just notice the continuity of time from 11:19:01 to 11:19:09.
I call my excel table in R with the package readxl with the code
oxy <- read_excel("./Metabolism/20180502 DAPH 20.xlsx" , 1)
And before any manipulation, when I check my table in R (Rstudio), I have that:
In the first case, R kept the time continuity by adding 11:13:01 and shift the next rows.
Then, later, reverse situation: the continuity of time was respected in excel, but R skips a second and again, shits the next rows.
At the end, there is the same number of rows. I guess it is a problem with the way R and excel round the time. But these little errors prevent me using the date to merge two tables, and the calculations afterwards are wrong.
May I do something to tell R to read the data exactly the same way Excel saved them?
Thank you very much!
Index both with a sequential integer counter each starting at the same point and use that for merging like with like. If you want the Excel version to be 'definitive' convert the index back to time with a lookup based on your Excel version.

R merge large number of data frames

I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks

R - Writing data to CSV in a loop

I have a loop that is going through a list of variables (in a csv), accessing a database and extracting the relevant data. It does this for 4 different time periods (which depend on the variables).
I am trying to get R to write this data to a csv, but at current I can only get it to store the data for the last variable in 4 different csv files as it overwrites the previous variable each time.
I'd like it to have all of the data for these variables for one time period all in the same file/sheet. (So either 4 sheets or 4 csv files with all of the data on them) This is because I need to do some data manipulation on the variables before I feed them into the next loop of the script.
I'd like it to be something like this, but need 4 separate sheets/files so I can cover each time period.
date/time | var1 | var2 | ... | varn
I would post the code, but even only posting the relevant loop and none of the surrounding code would be ~150 lines. I am not familiar with R (I can follow the script but struggle writing my own), I inherited this project and don't have long to work on it.
Note: each variable is recorded at a different frequency - some will only have one data point an hour, others one every minute, so will need to match these up based on time recorded (to the nearest minute).
EDIT: I hope I've explained this clearly enough
Four different .csv files would be easiest, because you could do something like the following in your loop:
outfile.name <- paste('Sales', year.of.data, sep='')
write.csv(outfile.name, out.filepath, row.names=FALSE)
You could also append the data into one data.frame and then export it all at once into one sheet. You won't be able to export to multiple sheets for a .csv, because a CSV won't let you have multiple sheets.

Mysterious problems appending data frames with rbind

I am in a dilly of a pickle trying to join several files together into a master file. There are 5 files with same structure, and I can read each file individually into a data frame with no problems. I even manually set the column class for 200+ variables rather than letting R decide, because I believed that was causing the problem. However, appending any two files together causes me to run out of memory.
Warning messages:
1: In rbind(deparse.level, ...) :
Reached total allocation of 4043Mb: see help(memory.size)
So I did some experimenting:
I joined two different chunks of file 1 together. That works.
I joined a chunk of file 2 to a chunk of file 1. That works.
I joined a chunk of file 2 to the original file 1. That works.
Each of these files comes in at a little under 200MB so I am not sure that I should be running out of memory. If anybody is interested, the data comes from hearstchallenge.com. The competition is long over, we are just using the data for an analysis experiment (and not programming!).
Any suggestions for how to solve this?
I have run into similar problems. The solution is not to use rbind() or cbind() on large data. They tend to leak memory.
To solve your problem using only R, first create a dataframe of the dimensions that the dataframe would have after you put the pieces together. Then use assignments to fill in the large dataframe.

Resources