How can I optimize data transmission in local area network? - r

What I want is transport a matrix, e.g. a 1000x1000 matrix (actually biger than that), from NodeA to NodeB in R. Now I use the code below:
NodeA:
A<-matrix(0,1000,1000)
Conn1<-socketConnection(port=8000, server=TRUE)
write.table(A,file = Conn1, col.names = FALSE)
NodeB:
HostId<-'x.x.x.x'
Conn2<-socketConnection(host=HostId, port=8000, blocking=TRUE)
A<-read.table(file = Conn2,nrows =1000)
But it takes me about 30s to finish data transmission when I run 4 transmissions simultaneously and the dimension of the matrix meets 1.5k (a matrix with ~20Mb). However, in my point of view, the speed of data transmission in FTP is about 10Mb/s, which should much faster than 30s, so I'm wondering how can I improve my code?
Thanks in advance.
EDIT:
After trying Ralf Stubner's Answer, something strange happened:
serialize overrides write.table in the test1
t1<-proc.time()
S<-unserialize(Con,refhook = NULL)
t2<-proc.time() -t1
t3<-proc.time()
S<-read.table(file=Con)
t4<-proc.time() -t3
The output of proc.time() is 14s vs 70s:
But, when I run 4 pieces of code at the same time in a framework like this answer (test2), serialize took much time than write.table did.
The output of serialize is 101s (the third number in the ptn)
The output of write.table is 16s (the third number in the ptn)
Thanks for anyone who could bear such a long post(and my poor English). The command serialize might be the best answer if I have only 1 piece of code to run, but the strange events in the test2 are really out of my range. I'm wondering if I have to use some external tools such as MPI.

With read.table and write.table you are converting the table to text before transferring it. This will take time and increase size. Have a look at serialize() for converting the matrix to a binary format.
Edit: You seem to have trouble with interacting with multiple clients. If you are willing to some learning I would suggest something like ZeroMQ, e.g. via the rzmq package. You will have to think about the architecture, though. See http://zguide.zeromq.org/page:all for several examples.

Related

reading in large text files in r

I want to read in a large ido file that had just under 110,000,000 rows and 8 columns. The columns are made up of 2 integer columns and 6 logical columns. The delimiter "|" is used in the file. I tried using read.big.matrix and it took forever. I also tried dumpDf and it ran out of RAM. I tried ff which I heard was a good package and I am struggling with errors. I would like to do some analysis with this table if I can read it in some way. If anyone has any suggestions that would be great.
Kind Regards,
Lorcan
Thank you for all your suggestions. I managed to figure out why I couldn't get the error to work. I'll give you all answers and suggestions so no one can make my stupid mistake again.
First of all, the data that was been giving to me contained some errors in it so I was doomed to fail from the start. I was unaware until a colleague came across it in another piece of software. In a column that contained integers there were some letters so that when the read.table.ff package tried to read in the data set it somehow got confused or I don't know. Whatever though I was given another sample of data, 16,000,000 rows and 8 columns with correct entries and it worked perfectly. The code that I ran is as follows and took about 30 seconds to read:
setwd("D:/data test")
library(ff)
ffdf1 <- read.table.ffdf(file = "test.ido", header = TRUE, sep = "|")
Thank you all for your time and if you have any questions about the answer feel free to ask and I will do my best to help.
Do you really need all the data for your analysis? Maybe you could aggregate your dataset (say from minute values to daily averages). This aggregation only needs to be done once, and can hopefully be done in chunks. In this way you do need to load all your data into memory at once.
Reading in chunks can be done using scan, the important arguments are skip and n. Alternatively, put your data into a database and extract the chunks in that way. You could even using the functions from the plyr package to run chunks in parallel, see this blog post of mine for an example.

Strategies for reading in CSV files in pieces?

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.
Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?
Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R
I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.
So reading it in pieces still seems like the best way to go for now.
After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!
1) Open a connection to your file
con = file("file.csv", "r")
2) Read in chunks of code with read.csv
read.csv(con, nrows="CHUNK SIZE",...)
Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.
3) Do what ever you need to do
4) Repeat.
5) Close the connection
close(con)
The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....
If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.
Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.
I hope this helps someone.
UPDATE
People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.
You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql and ?sqldf and also the sqldf home page.

What software package can you suggest for a programmer who rarely works with statistics?

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.
As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):
date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296
I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).
The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.
Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.
Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.
So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.
#flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.
For your example file, I would do:
dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))
then post process the date and time variables into an R date-time object class such as POSIXct, with something like:
dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))
As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:
> foo <- read.csv("log.csv")
> foo
date time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03 45004472 184177208
2 2011-06-28 00:00:18 45292232 184177208
PS.Perm.Gen.Used
1 94048296
2 94048296
Replicate that, 50000 times:
out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo)))
out[, 1] <- rep(foo[,1], times = 50000)
out[, 2] <- rep(foo[,2], times = 50000)
out[, 3] <- rep(foo[,3], times = 50000)
out[, 4] <- rep(foo[,4], times = 50000)
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)
Write it out
write.csv(out, file = "bigLog.csv", row.names = FALSE)
Time loading the naive way and the proper way:
system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
Which is very quick on my modest laptop:
> system.time(in1 <- read.csv("bigLog.csv"))
user system elapsed
0.355 0.008 0.366
> system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
user system elapsed
0.282 0.003 0.287
For both ways of reading in.
As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.
It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:
working <- out[sample(nrow(out), 1000), ]
for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.
For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.
R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files
The simplest example of reading a CSV file:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
To use another separator, e.g. tab and extract n-th column, use
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
print row[n]
To operate on columns use the built-in list data-type, it's extremely versatile!
To create beautiful plots I use matplotlib
code
The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)
There seem to be several questions mixed together:
Can you draw plots quicker and more easily?
Can you do things in R with less learning effort?
Are there other tools which require less learning effort than R?
I'll answer these in turn.
There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.
Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.
The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.
SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.
ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).

Trimming a huge (3.5 GB) csv file to read into R

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.
So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?
Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?
Any ideas would be greatly appreciated.
Update:
Constraints
Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later
Data
The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17
These are entirely unimportant and can be dropped
I only need ~100,000 rows out of this file (See below)
Data example:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
Results:
After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.
I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.
It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
Failings by Approach:
sqldf
This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
MapReduce
To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
bigmemory
This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
scan
Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.
My try with readLines. This piece of a code creates csv with selected years.
file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers
B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)
I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:
mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files
Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.
Is there a similar way to read in files a piece at a time in R?
Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.
If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.
If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
It passes any arguments along to read.table, so the combination should let you subset pretty tightly.
The ff package is a transparent way to deal with huge files.
You may see the package website and/or a presentation about it.
I hope this helps
What about using readr and the read_*_chunked family?
So in your case:
testfile.csv
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5
Actual code
require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)
This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.
This results in:
# A tibble: 2 × 8
County State Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment` GDP
* <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl>
1 Ada County NC 2009 4 FIRE Financial Banks 801
2 Ada County NC 2010 1 FIRE Financial Banks 825
You can even increase chunk_size but in this example there are only 4 lines.
You could import data to SQLite database and then use RSQLite to select subsets.
Have you consisered bigmemory ?
Check out this and this.
Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.
It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.
scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.
These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.
Now, 2017, I would suggest to go for spark and sparkR.
the syntax can be written in a simple rather dplyr-similar way
it fits quite well to small memory (small in the sense of 2017)
However, it may be an intimidating experience to get started...
I would go for a DB and then make some queries to extract the samples you need via DBI
Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html
It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...
Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.
my two humble pennies worth.

Undo command in R

I can't find something to the effect of an undo command in R (neither on An Introduction to R nor in R in a Nutshell). I am particularly interested in undoing/deleting when dealing with interactive graphs.
What approaches do you suggest?
You should consider a different approach which leads to reproducible work:
Pick an editor you like and which has R support
Write your code in 'snippets', ie short files for functions, and then use the facilities of the editor / R integration to send the code to the R interpreter
If you make a mistake, re-edit your snippet and run it again
You will always have a log of what you did
All this works tremendously well in ESS which is why many experienced R users like this environment. But editors are a subjective and personal choice; other people like Eclipse with StatET better. There are other solutions for Mac OS X and Windows too, and all this has been discussed countless times before here on SO and on other places like the R lists.
In general I do adopt Dirk's strategy. You should aim for your code to be a completely reproducible record of how you have transformed your raw data into output.
However, if you have complex code it can take a long time to re-run it all. I've had code that takes over 30 minutes to process the data (i.e., import, transform, merge, etc.).
In these cases, a single data-destroying line of code would require me to wait 30 minutes to restore my workspace.
By data destroying code I mean things like:
x <- merge(x, y)
df$x <- df$x^2
e.g., merges, replacing an existing variable with a transformation, removing rows or columns, and so on. In these cases, it's easy, especially when first learning R to make a mistake.
To avoid having to wait this 30 minutes, I adopt several strategies:
If I'm about to do something where there's a risk of destroying my active objects, I'll first copy the result into a temporary object. I'll then check that it worked with the temporary object and then rerun replacing it with the proper object.
E.g., first run temp <- merge(x, y); check that it worked str(temp); head(temp); tail(temp) and if everything looks good x <- merge(x, y)
As is common in psychological research, I often have large data frames with hundreds of variables and different subsets of cases. For a given analysis (e.g., a table, a figure, some results text), I'll often extract just the subset of cases and variables that I need into a separate object for the analysis and work with that object when preparing and finalising my analysis code. That way, I'm less likely to accidentally damage my main data frame. This assumes that the results of the analysis does not need to be fed back into the main data frame.
If I have finished performing a large number of complex data transformations, I may save a copy of the core workspace objects. E.g., save(x, y, z , file = 'backup.Rdata') That way, If I make a mistake, I only have to reload these objects.
df$x <- NULL is a handy way of removing a variable in a data frame that you did not want to create
However, in the end I still run all the code from scratch to check that the result is reproducible.

Resources