merging huge data files in R (used for searching) - r

I am working in R 3.5 and need to create a huge database of around 200 million rows and then search a file with around 15 million rows in that data base to find the reference value (and then cbind the two files: input file + matched file).
For smaller database files (~10 million rows) I used the merge() function to merge the input file with the database file. But, this is almost impossible now.
I tried rsqlite package and although it did work I did not like it.
pros
reference data file is not loaded at first
it does not need any installation (rather than rsqlite package)
cons
it is very very slow (even after creating index on tables)
the database file is huge (around 10Gb)
binding the input file and found items is not simple (row number may be different)
I don't want to use SQL server or MySQL , because they both need installation and configuration and it is not suitable for all systems and servers.
any suggestions or similar experiences on big data matching?

Related

Is there a method in R of extracting nested relational data tables from a JSON file?

I am currently researching publicly available payer transparency files across multiple insurers and I am trying to parse and extract JSON files using R and output them into .CSV files to later use with SQL. The file I am currently working with contains nested tables within the highest table.
I have attached the specific file I am working with right now in a link below, along with the code to mount it into R's dataviewer. I have used R extensively in healthcare analytics classes for statistical analysis and machine learning; though, I have never used R for building out data tables.
My goal is to assign a primary key to the highest level of the table, apply foreign and primary keys to lower tables and extract the lower tables and join them onto eachother later to build out a large CSV or TXT file to load onto SQL.
So far, I have used the jsonlite and rjson packages to extract the JSON itself into R, but trying to delist an unnest the tables within the tables are an enigma to me even after extensive research. I also find myself running into problems with "subscript out of bounds", "unimplemented list errors" and other issues.
It could also very well be the case that the JSON is too large for R's packages or that the JSON is structurally flawed (I wouldn't know if it is, I am not accustomed to JSONs). It seems that this could be a problem better solved with Python, though I don't know how to use Python too well and I am optimistic in R given how powerful it is.
Any feedback or answers would be greatly appreciated.
JSON file link: https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json
Code to load JSON:
json2 <- fromJSON('https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json')
JSONs load correctly, but there are tables embedded within tables. I would hope that these tables could be easily exported and have keys for joining, but I can not figure out how to denest these tables from within the data.
Some nested tables are out of subscript bounds for the data array. I have never encountered this problem and am bewildered as to how to go about and resolve the issue.
I can not figure out how to 'extract' the lower level tables, let alone open them, due to the subscript boundary error.
I can assign row ID to the main/highest table in the file, but I can not figure out how to add sub row ID's to the lower table for future joins.
Maybe the jsonStrings package can help. It allows to manipulate JSON, without converting to a R object. That's the first time I try it on such a big JSON string and it works fine.
Here is how to get the table in the first element of the JSON array:
options(timeout = 300)
download.file(
"https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json",
"jsonFile.json"
)
library(jsonStrings)
# load the JSON file
jstring <- jsonString$new("jsonFile.json")
# extract table "plans" of first element (indexed by 0)
jsonTable <- jstring$at(0, "plans")
# get a dataframe
library(jsonlite)
dat <- fromJSON(jsonTable$asString())
But the dataframe dat has a list column. I don't know how you want to make a CSV with this dataframe.

How to extract a portion of a large dataset via unzipping

I have a very large species dataset from gbif (178GB) zipped, when unzipped its approximately 800gb (TSV) My Mac only has 512gb Memory and 8GB of Ram, however I am not in need of using all of this data.
Are their any approaches that I can take that can unzip the file without eating all of my memory and extracting a portion of the dataset by filtering out rows relative to a column? For example, it has occurrence values going back until 1600, I only need data for the last 2 years which I believe my PC can more than handle. Perhaps there is a library with a function that can filter rows when loading the data?
I am unsure of how to unzip properly, and I have looked to see unzipping libraries and unzip according to this article, truncates data >4GB. My worry is where could I store 800gb of data when unzipped?
Update:
It seems that all the packages I have come across stop at 4GB after decompression. I am wondering if it is possible to create a function that can decompress at the 4GB, mark that point or data that has been retrieved, and begin decompression again from that point, and continue until the whole .zip file has been decompressed. It could store the decompressed files into a folder, that way you can access them with something like list.files(), any ideas if this can be done?

Copying data from SQL Server to R using ODBC connection

I have successfully set up a R SQL Server ODBC connection by doing:
DBI_connection <- dbConnect(odbc(),
driver = "SQL Server"
server = server_name
database = database_name)
Dataset_in_R <- dbFetch(dbSendQuery(DBI_connection,
"SELECT * FROM MyTable_in_SQL"))
3 quick questions:
1-Is there a quicker way to copy data from SQL Server to R? This table has +44million rows and it is still running...
2-If I make any changes to this data in R does it change anything in my MyTable_in_SQL? I dont think so because I have saved it in a global data.frame variable in R, but just checking.
3-How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
1: Is there a quicker way to copy data from SQL Server to R?
The answer here is rather simple to answer. the odbc package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to
Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname.
If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N or LIMIT N eg: select [select statement] from mytable_in_sql order by [order col] limit 100 which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl function as well, however it is always faster to correct your statement.
Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have 44 * 10^6 / 1024^3 = 2.6 GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!
As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.
2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?
No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.
3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes, at which point it will save .Rhistory and .Rdata in the getwd() directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...). However I highly suggest you do not do this for larger datasets, as it will cause your R session to take forever to open the next time you open R, as well as possibly creating unnecessary copies of your data (for example if you import it into df and make a transformation in df2 then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path), which is a much (and I mean MUCH!!) faster alternative to saving as RDS or csv files. These can't be opened as easily in Excel, but can be opened in R using arrow::read_parquet and python using pandas.read_parquet or pyarrow.parquet.read_parquet, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.
Note:
If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData file from your working directory. Usually the documents folder (C:/Users/[user]/Documents) from your system.
On question 2 you're correct, any changes in R won't change anything in the DB.
About question 3, you can save.image() or save.image('path/image_name.Rdata') and it will save your environment so you can recover it later on another session with load.image('path/image_name.Rdata').
Maybe with this you don't need a faster way to get data from a DB.

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

Resources