I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}
I'm conducting simulations over a range of models and parameter values. At this point in time my drake workflow involves over 3k thousand simulated data.frames and corresponding stanfit objects.
Trying to run make currently incurs a delay of ~2 minutes before plan execution begins. I assume that this is because drake is going through its cache to verify which steps in the plan will need updating. I would like to have some way of letting it know that it can represent all of these models as a single monolithic chunk of output. What I could do is make a function that writes all my output objects as a side-effect and then outputs a hash of sorts so that drake is "fooled" as to what needs to be checked but I can't restructure my code at this point in time given an upcoming deadline and the processing time involved.
Similarly, for purposes of using the dependency graph, having 3k+ objects show up makes it unusable. It would be nice to be able to collapse certain objects under a single "output type" group.
Great question. I know what you are saying, and I think about this problem all the time. In fact, trying to get rid of the delay is one of my top two priorities for drake for 2019.
Unfortunately, drake does not have a solution right now that will allow you to keep your targets up to date. The long-term solution will probably be speed improvements + https://github.com/ropensci/drake/issues/304 + https://github.com/ropensci/drake/issues/233. These are important areas of development, but also huge undertakings.
For new projects, you could have each target be a list of fitted stan models.
drake_plan(
data1 <- generate_data(...),
data2 <- generate_data(...),
models_data1 <- fit_models(data1),
models_data2 <- fit_models(data2)
)
fit_models <- function(data){
list(
run_stan(data, "normal_priors"),
run_stan(data, "t_priors")
)
}
And for the graph visualizations, there is support for target clusters. See https://ropenscilabs.github.io/drake-manual/vis.html#clusters
EDIT: parallel computing and verbosity
If you run make(jobs = c(imports = 4, targets = 6)), drake will use 4 processes on your local machine to do the preprocessing. And make(verbose = 4) shows more progress messages than with the default setting.
What I want is transport a matrix, e.g. a 1000x1000 matrix (actually biger than that), from NodeA to NodeB in R. Now I use the code below:
NodeA:
A<-matrix(0,1000,1000)
Conn1<-socketConnection(port=8000, server=TRUE)
write.table(A,file = Conn1, col.names = FALSE)
NodeB:
HostId<-'x.x.x.x'
Conn2<-socketConnection(host=HostId, port=8000, blocking=TRUE)
A<-read.table(file = Conn2,nrows =1000)
But it takes me about 30s to finish data transmission when I run 4 transmissions simultaneously and the dimension of the matrix meets 1.5k (a matrix with ~20Mb). However, in my point of view, the speed of data transmission in FTP is about 10Mb/s, which should much faster than 30s, so I'm wondering how can I improve my code?
Thanks in advance.
EDIT:
After trying Ralf Stubner's Answer, something strange happened:
serialize overrides write.table in the test1
t1<-proc.time()
S<-unserialize(Con,refhook = NULL)
t2<-proc.time() -t1
t3<-proc.time()
S<-read.table(file=Con)
t4<-proc.time() -t3
The output of proc.time() is 14s vs 70s:
But, when I run 4 pieces of code at the same time in a framework like this answer (test2), serialize took much time than write.table did.
The output of serialize is 101s (the third number in the ptn)
The output of write.table is 16s (the third number in the ptn)
Thanks for anyone who could bear such a long post(and my poor English). The command serialize might be the best answer if I have only 1 piece of code to run, but the strange events in the test2 are really out of my range. I'm wondering if I have to use some external tools such as MPI.
With read.table and write.table you are converting the table to text before transferring it. This will take time and increase size. Have a look at serialize() for converting the matrix to a binary format.
Edit: You seem to have trouble with interacting with multiple clients. If you are willing to some learning I would suggest something like ZeroMQ, e.g. via the rzmq package. You will have to think about the architecture, though. See http://zguide.zeromq.org/page:all for several examples.
I am attempting to download data consisting of approximately 1 million jpg files for which I have individual URL's and desired file names. The images have a mean filesize of approximately 120KB and range from 1KB to 1MB. I would like to use R to download the images.
I've tried a few things and eventually figured out a way that has let me download all million images in under three hours. My current strategy works, but it is a somewhat absurd solution that I would prefer not to use ever again, and I'm baffled as to why it even works. I would like to understand what's going on and to find a more elegant and efficient way of achieving the same result.
I started out with mapply and download.file() but this only managed a rate of 2 images per second. Next, I parallelized the process with the parallel package. This was very effective and improved the rate to 9 images per second. I assumed that would be the most I could achieve, but I noticed that the resources being used by my modest laptop were nowhere near capacity. I checked to make sure there wasn't a significant disk or network access bottleneck, and sure enough, neither were experiencing much more than ~10% of their capacity.
So I split up the url information and opened a new R console window where I ran a second instance of the same script on a different segment of the data to achieve 18 images per second. Then I just continued to open more and more instances, giving each of them a unique section of the full list of URL's. It was not until I had 12 open that there was any hint of slowing down. Each instance actually gave a nearly linear increase in downloads per second, and with some memory management, I approached my maximum down speed of 13 MB/s.
I have attached a graph showing the approximate total images being downloaded per second as a function of the number of instances running.
Also attached is a screenshot of my resource monitor while 10 simultaneous instances of R were running.
I find this result very surprising and I don't quite understand why this should be possible. What's making each individual script run so slowly? If the computer can run 12 instances of this code with little to no diminishing returns, what prevents it from just running 12 times as fast? Is there a way to achieve the same thing without having to open up entirely new R environments?
Here is the code I am asking about specifically. Unfortunately I cannot disclose the original URL's but the script is nearly identical to what I am using. I have replaced my data with a few CC images from wikimedia. For better replication, please replace "images" with your own large URL list if you have access to such a thing.
library(parallel)
library(data.table)
images <-
data.table(
file = c(
"Otter.jpg",
"Ocean_Ferret.jpg",
"Aquatic_Cat.jpg",
"Amphibious_Snake_Dog.jpg"
),
url = c(
"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Otter_and_Bamboo_Wall_%2822222758789%29.jpg/640px-Otter_and_Bamboo_Wall_%2822222758789%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Otter_Looking_Back_%2817939094316%29.jpg/640px-Otter_Looking_Back_%2817939094316%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Otter1_%2814995327039%29.jpg/563px-Otter1_%2814995327039%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Otter_Profile_%2817962452452%29.jpg/640px-Otter_Profile_%2817962452452%29.jpg"
) #full URL's are redundant and unnecessary but I kept them in case there was some performance advantage over nesting a function inside download.file that combines strings.
)
#Download with Mapply (just for benchmarking, not actually used in the script)
system.time(
mapply(
function(x, y)
download.file(x, y, mode = 'wb', quiet = TRUE),
x = images$url,
y = images$file,
SIMPLIFY = "vector",
USE.NAMES = FALSE
)
)
#Parallel Download with clusterMap (this is what each instance is running. I give each instance a different portion of the images data table)
cl <- makeCluster(detectCores())
system.time(
clusterMap(
cl,
download.file,
url = images$url,
destfile = images$file,
quiet = TRUE,
mode = 'wb',
.scheduling = 'dynamic',
SIMPLIFY = 'vector',
USE.NAMES = FALSE
)
)
In summary, the questions I am asking are:
1) Why is my solution behaving this way? More specifically, why is 1 script not fully utilizing my computer's resources?
2) What is a better way to achieve the following with R: download 120GB composed of one million jpeg images directly via their URL's in under 3 hours.
Thank you in advance.
cl <- makeCluster(detectCores())
This line says to make a backend cluster with a number of nodes equal to your cores. That would probably be 2, 4 or 8, depending on how beefy a machine you have.
Since, as you noticed, the downloading process isn't CPU-bound, there's nothing stopping you from making the cluster as big as you want. Replace that line with something like
cl <- makeCluster(50)
and you'll have 50 R sessions downloading in parallel. Increase the number until you hit your bandwidth or memory limit.
Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.
As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):
date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296
I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).
The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.
Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.
Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.
So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.
#flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.
For your example file, I would do:
dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))
then post process the date and time variables into an R date-time object class such as POSIXct, with something like:
dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))
As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:
> foo <- read.csv("log.csv")
> foo
date time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03 45004472 184177208
2 2011-06-28 00:00:18 45292232 184177208
PS.Perm.Gen.Used
1 94048296
2 94048296
Replicate that, 50000 times:
out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo)))
out[, 1] <- rep(foo[,1], times = 50000)
out[, 2] <- rep(foo[,2], times = 50000)
out[, 3] <- rep(foo[,3], times = 50000)
out[, 4] <- rep(foo[,4], times = 50000)
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)
Write it out
write.csv(out, file = "bigLog.csv", row.names = FALSE)
Time loading the naive way and the proper way:
system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
Which is very quick on my modest laptop:
> system.time(in1 <- read.csv("bigLog.csv"))
user system elapsed
0.355 0.008 0.366
> system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
user system elapsed
0.282 0.003 0.287
For both ways of reading in.
As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.
It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:
working <- out[sample(nrow(out), 1000), ]
for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.
For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.
R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files
The simplest example of reading a CSV file:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
To use another separator, e.g. tab and extract n-th column, use
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
print row[n]
To operate on columns use the built-in list data-type, it's extremely versatile!
To create beautiful plots I use matplotlib
code
The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)
There seem to be several questions mixed together:
Can you draw plots quicker and more easily?
Can you do things in R with less learning effort?
Are there other tools which require less learning effort than R?
I'll answer these in turn.
There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.
Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.
The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.
SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.
ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).