I am planning to construct a panel dataset and, as a first step, I am trying to create a vector that has 25 repetitive ids (same 25 ids to assign to each year later) for each 99537 unique observation. What I have so far is:
unique_id=c(1:99573)
panel=c()
for(i in 1:99573){
x=rep(unique_id[i],25)
panel=append(panel,x)
}
The problem I have is the codes above are taking too much time. RStudio keeps processing and does not give me any output. Could there be any other ways to speed up the process? Please share any ideas with me.
We don't need a loop here
panel <- rep(unique_id, each = 25)
Benchmarks
system.time(panel <- rep(unique_id, each = 25))
# user system elapsed
# 0.046 0.002 0.047
length(panel)
#[1] 2489325
It seems most intuitive that .rdata files might be the fasted file format for R to load, but when scanning some of the stack posts it seems that more attention has been on enhancing load times for .csv or other formats. Is there a definitive answer?
Not a definitive answer, but below are times it took to load the same dataframe read in as a .tab file with utils::read.delim(), readr::read_tsv(), data.table::fread() and as a binary .RData file timed using the system.time() function:
.tab with utils::read.delim
system.time(
read.delim("file.tab")
)
# user system elapsed
# 52.279 0.146 52.465
.tab with readr::read_tsv
system.time(
read_tsv("file.tab")
)
# user system elapsed
# 23.417 0.839 24.275
.tab with data.table::fread
At #Roman 's request the same ~500MB file loaded in a blistering 3 seconds:
system.time(
data.table::fread("file.tab")
)
# Read 49739 rows and 3005 (of 3005) columns from 0.400 GB file in 00:00:04
# user system elapsed
# 3.078 0.092 3.172
.RData binary file of the same dataframe
system.time(
load("file.RData")
)
# user system elapsed
# 2.181 0.028 2.210
Clearly not definitive (sample size = 1!) but in my case with a 500MB data frame:
Binary .RData is quickest
data.frame::fread() is a close second
readr::read_tsv is an order of magnitude slower
utils::read.x is slowest and only half as fast as readr
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I have tried measuring the speed of these two ways for taking square root:
> system.time(expr = replicate(10000, 1:10000 ** (1/2)))
## user system elapsed
## 0.027 0.001 0.028
> system.time(expr = replicate(10000, sqrt(1:10000)))
## user system elapsed
## 3.722 0.665 4.494
If the sqrt() function cannot compete with ** 0.5, why do we need such a function?
(system is OS X Yusemite, and R version is 3.1.2)
You forgot important parentheses. Here are the timings after correcting that:
system.time(expr = replicate(10000, (1:10000) ** (1/2)))
#user system elapsed
#4.76 0.32 5.12
system.time(expr = replicate(10000, sqrt(1:10000)))
#user system elapsed
#2.67 0.57 3.31
To add to #Roland's answer, you fell into the Operators precedence "trap". ^ comes before : ("** is translated in the parser to ^" as per documentation of ?"**")
What really happened is
`:`(1, 10000 ** (1/2))
That means that first you've run ** and only then 1:..
A tip for the future, try to debug your code before running sophisticated operations, for example, testing
1:5 ** (1/2)
## [1] 1 2
sqrt(1:5)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
Would reveal the issue.
I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust package. However, it feels like forever to run a command (e.g.Mclust(XXX.df) or densityMclust(XXX.df).
Is there any way to execute the command faster? an example code will be helpful.
For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.
Dealing with large datasets while using mclust is described in Technical Report, subsection 11.1.
Briefly, functions Mclust and mclustBIC include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM to the full data set, in order to extend the method to larger datasets.
Generic example:
library(mclust)
set.seed(1)
##
## Data generation
##
N <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# > user system elapsed
# > 66.432 0.124 67.439
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# > user system elapsed
# > 19.513 0.020 19.546
"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust uses only single core).
When N<-5e4 (as in your example) and M<-1e3 it took about 3.5 minutes for version with subset to complete.
I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day.
Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Try this:
aggregate(StomCond_Trunc~Day,data=subset(sap,Time>=12 & Time<=14),mean)
If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.
Example:
Large(ish) dataset
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
Using aggregate on the data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
Converting it to a data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
Using your original method, but with less typing:
sapply(sap[sap$Day==165 & sap$Time %in% seq(12, 14, 0.1), ],mean)
However this is only a slightly better method than your original one. It's not as flexible as the other answers since it depends on 0.1 increments in your time values. The other methods don't care about the increment size, which makes them more versatile. I'd recommend #Maiasaura's answer with data.table