chekpointing dataframe to disk in a long process when no loop is involved - r

I have a long process in R and I want to save the working result dataframe every "t" time.
There's no loop involved, so you cannot use counter or iterator to explicit a condition to write to disk.
There's a lot of cirsumstances you could need to preserve the "on progress data frame" to disk, for instance, taking a big dataset of locations (Street + City + Zip code) you try to get Lat/Lon using ggmap package
df<- mutate_geocode(long.data, location)
....long time.....
write.table(df, file = "my_result.csv")
The result dataframe is only written to disk at the end of process.
The issue is, sometimes my laptop freezes and Google Maps limit is 2.500 querys per day, so my work is lost because it's not saved to disk. You have to start again all process from scratch.
is a very generic question, so no sample data is provided.
is there any way in R of checkpointing my work to disk if no loop is involved?
thanks

Guess if splitting your data into smaller data frames of , let's say, 100 rows and looping through all of them but saving to disk for every one and stacking all at the end?

Related

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

High-scale signal processing in R

I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.
This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

Strategies for reading in CSV files in pieces?

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.
Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?
Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R
I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.
So reading it in pieces still seems like the best way to go for now.
After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!
1) Open a connection to your file
con = file("file.csv", "r")
2) Read in chunks of code with read.csv
read.csv(con, nrows="CHUNK SIZE",...)
Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.
3) Do what ever you need to do
4) Repeat.
5) Close the connection
close(con)
The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....
If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.
Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.
I hope this helps someone.
UPDATE
People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.
You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql and ?sqldf and also the sqldf home page.

Resources