What are the file formats that read into R the fastest? - r

It seems most intuitive that .rdata files might be the fasted file format for R to load, but when scanning some of the stack posts it seems that more attention has been on enhancing load times for .csv or other formats. Is there a definitive answer?

Not a definitive answer, but below are times it took to load the same dataframe read in as a .tab file with utils::read.delim(), readr::read_tsv(), data.table::fread() and as a binary .RData file timed using the system.time() function:
.tab with utils::read.delim
system.time(
read.delim("file.tab")
)
# user system elapsed
# 52.279 0.146 52.465
.tab with readr::read_tsv
system.time(
read_tsv("file.tab")
)
# user system elapsed
# 23.417 0.839 24.275
.tab with data.table::fread
At #Roman 's request the same ~500MB file loaded in a blistering 3 seconds:
system.time(
data.table::fread("file.tab")
)
# Read 49739 rows and 3005 (of 3005) columns from 0.400 GB file in 00:00:04
# user system elapsed
# 3.078 0.092 3.172
.RData binary file of the same dataframe
system.time(
load("file.RData")
)
# user system elapsed
# 2.181 0.028 2.210
Clearly not definitive (sample size = 1!) but in my case with a 500MB data frame:
Binary .RData is quickest
data.frame::fread() is a close second
readr::read_tsv is an order of magnitude slower
utils::read.x is slowest and only half as fast as readr

Related

R - How to speed up panel data construction process using for-loop

I am planning to construct a panel dataset and, as a first step, I am trying to create a vector that has 25 repetitive ids (same 25 ids to assign to each year later) for each 99537 unique observation. What I have so far is:
unique_id=c(1:99573)
panel=c()
for(i in 1:99573){
x=rep(unique_id[i],25)
panel=append(panel,x)
}
The problem I have is the codes above are taking too much time. RStudio keeps processing and does not give me any output. Could there be any other ways to speed up the process? Please share any ideas with me.
We don't need a loop here
panel <- rep(unique_id, each = 25)
Benchmarks
system.time(panel <- rep(unique_id, each = 25))
# user system elapsed
# 0.046 0.002 0.047
length(panel)
#[1] 2489325

Exporting data.frame in R into a .txt file

I have a data.frame that has 15 columns and looks like the following:
Word Syllable TimeStart TimeEnd Duration PitchMin PitchMax TimePitchMin
Einen "aI 0.00 0.11 0.11 98.173 106.158 0.053
Einen n#n 0.11 0.24 0.13 106.158 123.176 0.110
TimePitchMax PitchSlope IntenMax IntenMin TimeIntenMax TimeIntenMin PitchAccent
0.110 140.443 83.794 82.583 0.095 0.051 no
0.210 169.359 83.875 80.458 0.210 0.234 no
I want to save the data into a .txt file. But when I use standard write.table(table, "outfile.txt") method the result looks like a mess.
What appropriate arguments can be used to solve this problem?
EDIT:
The print screen of the mess output:
What happens if you use write.table(table, "outfile.txt", sep="\t", row.names=FALSE)? That should help you create a tab-delimited text file.
If the output still looks like a mess, you can export your file as a csv with write.csv(table, "outfile.txt", row.names=FALSE).
Did you check the structure of your table with str(table) before you export? It looks like the table may contain some corrupt variable names and/or variable, which may un turn cause export problems. In an ideal case, when you do str(table), you should see that the table object is a data.frame (or tibble) with proper variable names and values. If you see variable names like """ or c(9,11,11, ...) etc., that's a signal that your problem is with how you create table, not how you export it.

how to run mclust faster on 50000 records dataset

I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust package. However, it feels like forever to run a command (e.g.Mclust(XXX.df) or densityMclust(XXX.df).
Is there any way to execute the command faster? an example code will be helpful.
For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.
Dealing with large datasets while using mclust is described in Technical Report, subsection 11.1.
Briefly, functions Mclust and mclustBIC include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM to the full data set, in order to extend the method to larger datasets.
Generic example:
library(mclust)
set.seed(1)
##
## Data generation
##
N <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# > user system elapsed
# > 66.432 0.124 67.439
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# > user system elapsed
# > 19.513 0.020 19.546
"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust uses only single core).
When N<-5e4 (as in your example) and M<-1e3 it took about 3.5 minutes for version with subset to complete.

R: Calculate means for subset of a group

I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day.
Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Try this:
aggregate(StomCond_Trunc~Day,data=subset(sap,Time>=12 & Time<=14),mean)
If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.
Example:
Large(ish) dataset
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
Using aggregate on the data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
Converting it to a data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
Using your original method, but with less typing:
sapply(sap[sap$Day==165 & sap$Time %in% seq(12, 14, 0.1), ],mean)
However this is only a slightly better method than your original one. It's not as flexible as the other answers since it depends on 0.1 increments in your time values. The other methods don't care about the increment size, which makes them more versatile. I'd recommend #Maiasaura's answer with data.table

MPI parallelization using SNOW is slow

My foray into parallelization continues. I initially had difficulty installing Rmpi, but I got that going (I needed to sudo apt-get it). I should say that I'm running a machine with Ubuntu 10.10.
I ran the same simulation as my previous question. Recall the system times for the unclustered and SNOW SOCK cluster respectively:
> system.time(CltSim(nSims=10000, size=100))
user system elapsed
0.476 0.008 0.484
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.028 0.004 0.375
Now, using an MPI cluster, I get a speed slowdown relative to not clustering:
> stopCluster(cl)
> cl <- getMPIcluster()
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.088 0.196 0.604
Not sure whether this is useful, but here's info on the cluster created:
> cl
[[1]]
$rank
[1] 1
$RECVTAG
[1] 33
$SENDTAG
[1] 22
$comm
[1] 1
attr(,"class")
[1] "MPInode"
[[2]]
$rank
[1] 2
$RECVTAG
[1] 33
$SENDTAG
[1] 22
$comm
[1] 1
attr(,"class")
[1] "MPInode"
attr(,"class")
[1] "spawnedMPIcluster" "MPIcluster" "cluster"
Any idea about what could be going on here? Thanks for your assistance as I try out these parallelization options.
Cordially,
Charlie
It's a bit the same as with your other question : the communication between the nodes in the cluster is taking up more time than the actual function.
This can be illustrated by changing your functions :
library(snow)
cl <- makeCluster(2)
SnowSim <- function(cluster, nSims=10,n){
parSapply(cluster, 1:nSims, function(x){
Sys.sleep(n)
x
})
}
library(foreach)
library(doSNOW)
registerDoSNOW(cl)
ForSim <- function(nSims=10,n) {
foreach(i=1:nSims, .combine=c) %dopar% {
Sys.sleep(n)
i
}
}
This way we can simulate a long-calculating and a short-calculating function in different numbers of simulations. Let's take two cases, one where you have 1 sec calculation and 10 loops, and one with 1ms calculation and 10000 loops. Both should last 10sec :
> system.time(SnowSim(cl,10,1))
user system elapsed
0 0 5
> system.time(ForSim(10,1))
user system elapsed
0.03 0.00 5.03
> system.time(SnowSim(cl,10000,0.001))
user system elapsed
0.02 0.00 9.78
> system.time(ForSim(10000,0.001))
user system elapsed
10.04 0.00 19.81
Basically what you see, is that for long-calculating functions and low simulations, the parallelized versions cleanly cut the calculation time in half as expected.
Now the simulations you do are of the second case. There you see that the snow solution doesn't really make a difference any more, and that the foreach solution even needs twice as much. This is simply due to the overhead of communication to and between nodes, and handling of the data that gets returned. The overhead of foreach is a lot bigger than with snow, as shown in my answer to your previous question.
I didn't fire up my Ubuntu to try with an MPI cluster, but it's basically the same story. There are subtle differences between the different types of clusters according to time needed for communication, partly due to differences between the underlying packages.

Resources