I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}
Related
I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.
I am working with a time-series data stream from an experiment. We record multiple data channels, including a trigger channel ('X7.ramptrig' in linked data: Time-Series Data Example), that indicates when a relevant event occurs in other channels.
I am trying to create subsets of the next n-rows (e.g. 15,000) of the time-series (time steps are 0.1ms) that occur after onset of a trigger ('1'). That column has multiple triggers ('1') interspersed at irregular intervals. Every other time step is a '0', indicating no new event.
I am asking to see if there is a more efficient solution to directly subset the subsequent n-rows after a trigger is detected instead of the indirect (possibly inflexible) solution I have come up with.
Link to simple example data:
https://gtvault-my.sharepoint.com/:t:/g/personal/shousley6_gatech_edu/EZZSVk6pPpJPvE0fXq1W2KkBhib1VDoV_X5B0CoSerdjFQ?e=izlkml
I have a working solution that creates an index from the trigger channel and splits the dataset on that index. Because triggers have variability in placement in time, the subsequent data frame subsets are not consistent and there are occasionally 'extra' subsets that precede 'important' ones ('res$0' in example). Additionally, I need to have the subsets be matched for total time and aligned for trigger onset.
My current solution 'cuts' the lists of data frames to the same size (in the example to the first 15,000 rows). While this technically works it seems clunky. I also tried to translate a SQL solution using FETCH NEXT but those functions are not available in the SQLite supported in R.
I am completely open to alternatives so please be unconstrained by my current solution.
##create index to detect whenever an event trigger occurs
idx<-c(0, cumsum(diff(Time_Series_Data_Example$X7.ramptrig) >0))
## split the original dataframe on event triggers
split1<-split(Time_Series_Data_Example, idx)
## cuts DFs down to 1.5s
res <- lapply(split1, function(x){
x <- top_n(x, -15000)
})
Here is an example of data output: 'head(res[["1"]]' 2
For the example data and code provided, the output is 4 subsets, 3 of which are 'important' and time synced to the trigger. The first 'res$0' is a throw away subset.
Thanks in advance and please let me know how I can improve my question asking (this is my first attempt).
I have a script I have to run periodically where I use a varying number of assignment statements in R, like this:
r5$NWord_1<-ifelse(r5$Match==1,NA,r5$NWord_1)
r5$NWord_2<-ifelse(r5$Match==2,NA,r5$NWord_2)
r5$NWord_3<-ifelse(r5$Match==3,NA,r5$NWord_3)
r5$NWord_4<-ifelse(r5$Match==4,NA,r5$NWord_4)
r5$NWord_5<-ifelse(r5$Match==5,NA,r5$NWord_5)
r5$NWord_6<-ifelse(r5$Match==6,NA,r5$NWord_6)
r5$NWord_7<-ifelse(r5$Match==7,NA,r5$NWord_7)
The problem is that the number of "NWord" variables changes from run to run usually between 5 and 7). I have the number of "NWord" variables stored separately as Size.
Size<-5
I have tried the following, but get() only works on objects, not columns of dataframes.
for(i in 1:Size){
get(paste("r5$NWord_",i,sep=""))<-ifelse(r5$Match==i,NA,get(paste("r5$NWord_",i,sep="")))
}
I am curious: What is the best way to automate this process so I do not have to manually run a subset of these statements every time?
For those interested: 1) The import data are in wide format (that is what the system allows me to download). 2) The export data have to be in wide format in order to upload to the system. Since these are just a few of some ~200 variables in the data set, going back and forth between wide and long and then long back to wide (possibly multiple times) seems cumbersome and prone to error. Therefore, I came up with this:
idx1<-which(colnames(r5)=="NWord_1")
idx2<-which(colnames(r5)==paste("NWord_",Size,sep=""))
for(i in idx1:idx2){
r5[,i]<-ifelse(r5$Match==1,NA,r5[,i])
}
Seems to work just fine, however, I'm not sure that it is the most efficient way to code this.
I have a long process in R and I want to save the working result dataframe every "t" time.
There's no loop involved, so you cannot use counter or iterator to explicit a condition to write to disk.
There's a lot of cirsumstances you could need to preserve the "on progress data frame" to disk, for instance, taking a big dataset of locations (Street + City + Zip code) you try to get Lat/Lon using ggmap package
df<- mutate_geocode(long.data, location)
....long time.....
write.table(df, file = "my_result.csv")
The result dataframe is only written to disk at the end of process.
The issue is, sometimes my laptop freezes and Google Maps limit is 2.500 querys per day, so my work is lost because it's not saved to disk. You have to start again all process from scratch.
is a very generic question, so no sample data is provided.
is there any way in R of checkpointing my work to disk if no loop is involved?
thanks
Guess if splitting your data into smaller data frames of , let's say, 100 rows and looping through all of them but saving to disk for every one and stacking all at the end?
I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...