How to speed up sequential analysis in R? - r

I need to sequently analyze a dataset while using subresults of the operations before.
As I am known to R I decided to work with this and one of the solution I tried is using an for loop.
The dataset which I loop through has around 8 million rows with 4 columns.
I use a data.table and the variables are of type character eg. "XXXXXXXXX"
I tried to loop through but it takes approx 0,7 second per cycle from which the "<-" operation takes half a second.
Can anybody recommend a better technique. Potentially rcpp, apply or whatever?
Thx for your support,
Holger
'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')
dt_loop <- data.table(
paste0("XXXXXXXXXX", 1:80000000),
paste0("YXXXXXXXXX", 1:80000000),
paste0("ZXXXXXXXXX", 1:80000000),
paste0("AXXXXXXXXX", 1:80000000)
)
colnames(dt_loop)[colnames(dt_loop)=="V1"] <- "m"
colnames(dt_loop)[colnames(dt_loop)=="V2"] <- "c"
colnames(dt_loop)[colnames(dt_loop)=="V3"] <- "ma"
colnames(dt_loop)[colnames(dt_loop)=="V4"] <- "unused"
for(i in 1:nrow(dt_loop)){
m <- dt_loop$m[i]
c <- dt_loop$m[i]
if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
dt_loop$ma[i] <- m
} else {
if(m %in% dt_loop$ma[1:i]){
dt_loop$ma[i] <- m
} else {
dt_loop$ma[i] <- c
}
}
}

This is a self-join Cartesian product solution. I modified your code to get somewhat meaningful results. I also think that if you have 8 million rows, you're going to have performance troubles when the nth loop depends on the nth before hand.
Changes in data structure:
Used sample to get some repeats in the data.table
Simplified the column names to data.table function setnames()
Added an ID field
Removed unused column.
'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')
# Generate Data -----------------------------------------------------------
set.seed(1)
n_rows <- 10
dt_loop <- data.table(
sample(paste0("X", 1:n_rows), n_rows, replace = T),
sample(paste0("Y", 1:n_rows), n_rows, replace = T),
sample(paste0("X", 1:n_rows), n_rows, replace = T)
)
setnames(dt_loop, c('m', 'c', 'ma'))
dt_loop[, ID := .I]
I made significant changes to your loop.
Assigned c <- dt_loop$c[i] as I don't know what using m there did.
Removed the first if statement because of the new assignment of c.
# Original loop with Minor Mod --------------------------------------------
for(i in 1:nrow(dt_loop)){
m <- dt_loop$m[i]
c <- dt_loop$c[i] #changed to c instead of m
#Removed first ifelse condition
#as it didn't make sense as originally constructed
# if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
# dt_loop$ma2[i] <- m
# } else {
if(m %in% dt_loop$ma[1:i]){
dt_loop$ma2[i] <- m
} else {
dt_loop$ma2[i] <- c
}
# }
}
dt_loop
m c ma ID ma2
1: X3 Y3 X10 1 Y3
2: X4 Y2 X3 2 Y2
3: X6 Y7 X7 3 Y7
4: X10 Y4 X2 4 X10
5: X3 Y8 X3 5 X3
6: X9 Y5 X4 6 Y5
7: X10 Y8 X1 7 X10
8: X7 Y10 X4 8 X7
9: X7 Y4 X9 9 X7
10: X1 Y8 X4 10 X1
The self-join seems to be faster than the loop when I up the rows to 10,000, but it still slows down. Of note is that you can see when there is duplication with ma because the cartesian product expands the results so you get N == 2.
I believe that there are ways to get the self-join to work so that you only get the Nth row which should relieve some pressure.
dt_loop[dt_loop
, on = .(ID <= ID
, ma = m)
, .(.N
,i.ma2 #for comparison - remove
,ma3 = ifelse(is.na(x.ID), i.c, i.m)
,i.ID, i.m, i.c, i.ma
,x.ID, x.m, x.c, x.ma
)
, by = .EACHI
, allow.cartesian = T]
ID ma N i.ma2 ma3 i.ID i.m i.c i.ma x.ID x.m x.c x.ma
1: 1 X3 0 Y3 Y3 1 X3 Y3 X10 NA <NA> <NA> <NA>
2: 2 X4 0 Y2 Y2 2 X4 Y2 X3 NA <NA> <NA> <NA>
3: 3 X6 0 Y7 Y7 3 X6 Y7 X7 NA <NA> <NA> <NA>
4: 4 X10 1 X10 X10 4 X10 Y4 X2 1 X3 Y3 X10
5: 5 X3 2 X3 X3 5 X3 Y8 X3 2 X4 Y2 X3
6: 5 X3 2 X3 X3 5 X3 Y8 X3 5 X3 Y8 X3
7: 6 X9 0 Y5 Y5 6 X9 Y5 X4 NA <NA> <NA> <NA>
8: 7 X10 1 X10 X10 7 X10 Y8 X1 1 X3 Y3 X10
9: 8 X7 1 X7 X7 8 X7 Y10 X4 3 X6 Y7 X7
10: 9 X7 1 X7 X7 9 X7 Y4 X9 3 X6 Y7 X7
11: 10 X1 1 X1 X1 10 X1 Y8 X4 7 X10 Y8 X1

Related

Consider all the variables created in a for loop as a vector, e.g. `c(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)`

This code:
for (i in 1:10) {
assign(paste0('x', i), i)
}
creates 10 variables in a for loop. I would like to consider all these variables in a vector, e.g.
c(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)
The problem can be solved with a combination of ls and mget.
for (i in 1:10) {
assign(paste0('x', i), i)
}
x_vars <- ls(pattern = "^x\\d+$", envir = .GlobalEnv)
x_all <- mget(x_vars, envir = .GlobalEnv)
x_all <- sort(unlist(x_all))
x_all
# x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
# 1 2 3 4 5 6 7 8 9 10
I am assuming the vector elements must be in the same order they were created in the for loop.
Or, without creating x_vars:
x_all <- mget(ls(pattern = "^x\\d+$", envir = .GlobalEnv), envir = .GlobalEnv)
x_all <- sort(unlist(x_all))
No loop needed. Why not just name a vector of values>
x_all <- 1:10
names(x_all) <- paste0("x", x_all)
x_all
#----------------
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 2 3 4 5 6 7 8 9 10
an option with str_c
library(stringr)
names(x_all) <- str_c("x", x_all)

bin and transpose in R

I am still getting the hang of R and coding in general, so bear with me on this.
my problem This is a dimension reduction idea I have consisting of three steps. I need help with the first two.
bin rows
transpose the binned rows into new columns so the columns will increase by number of bin, rows decrease by number of bins
Perform PCA to then reduce columns
So the data would go from this:
A B C D
1 W1 X1 Y1 Z1
2 W2 X2 Y2 Z2
3 W3 X3 Y3 Z3
4 W4 X4 Y4 Z4
5 W5 X5 Y5 Z5
6 W6 X6 Y6 Z6
so, if I bin by 2 and transpose it would look something like this:
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6
I'm pretty sure I need to nest bin and transpose in some sort of function, but I'm not sure which comes first, or really at all how to approach this, so any suggestions will help!
I really hope this makes some sense, let me know how I can rephrase if needed!
EDIT
I am working with integer datatypes, here is a snippet of my actual data I'd like to bin and expand.
> head(dataset[1:4])
EMG1 EMG2 EMG3 EMG4
1 32744 32571 32935 32279
2 32788 32934 32767 32624
3 32828 33202 32587 32377
4 32870 33269 32423 32954
5 32838 33319 32126 32721
6 32903 33502 32652 32151
Assuming these letter digit entries as not supposed to be stand ins for numerics, I would first run this:
dat[] <- lapply(dat, as.character) # ensures we get rid of factors
This uses recycling of logical indices inside a function that gets serially applied across your dataframe to create two lists from each column. That is then coerced to a dataframe. The initial result res has rather odd names which get shortened with some simple regex work.
res <- data.frame( lapply(dat,
function(cl){list( list(cl[c(TRUE,FALSE)],
list(cl[!c(TRUE,FALSE)]) )) }))
names(res) <- sub("\\..+$", "", names(res))
> res
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6

R shifting values to the right

I'm facing a very simple problem I believe.
Example data:
example=data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,2,3,X,X,X,7,8), z=c(1,2,3,4,5,6,7,8))
which looks like this :
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
This is the ideal case. Because my data is coming in a single column format, I am formatting it into a dataframe like this. Now sometimes (not in every observation) there is an additional value that would always come at the same place (eg always in the same column after formatting it into a DF).
It looks like this :
data.frame(x=c("X1","X2","X3","X4","X5","X6","X7","X8"),
y=c("Y1","Y2","Y3","A38","A15","A8","Y7","Y8"),
z=c("Z1","Z2","Z3","Y4","Y5","Y6","Z7","Z8"),
aa=c(NA,NA,NA,"Z4","Z5","Z6",NA,NA))
resulting in :
x y z aa
1 X1 Y1 Z1 <NA>
2 X2 Y2 Z2 <NA>
3 X3 Y3 Z3 <NA>
4 X4 A38 Y4 Z4
5 X5 A15 Y5 Z5
6 X6 A8 Y6 Z6
7 X7 Y7 Z7 <NA>
8 X8 Y8 Z8 <NA>
The "aa" column was created in the first place only because I had an additional value (the A something) in the original single column formatted file. I want to save the A values in their column (always starting with a A for instance) and I do not really care about the "aa" one. I just want to shift all the values like this :
x y z aa A
1 X1 Y1 Z1 <NA> <NA>
2 X2 Y2 Z2 <NA> <NA>
3 X3 Y3 Z3 <NA> <NA>
4 X4 Y4 Z4 Z4 A38
5 X5 Y5 Z5 Z5 A15
6 X6 Y6 Z6 Z6 A8
7 X7 Y7 Z7 <NA> <NA>
8 X8 Y8 Z8 <NA> <NA>
For the sake of the example I have put a few columns here, but the real case scenario there can be as much as 300 columns, so almost that amount of values to shift left up to the point where we find NA in that row.
I have worked around this type of thing so far using this code :
format_A_things <- function(df, col) {
x <<- grep("A", df$ColumnWhereAareExpected) #selecting lines starting with an A
if (length(x) > 0){
for (i in x){
df[i,"SpecificColumnforA"] = df[i,col]
for (j in col:(ncol(df)-2)){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
return(df)
}
This is working fine expecting that it's very slow (I may have several hundred of lines to go through and it's taking tens of minutes which is not ideal). I know I've probably used the slowest way to do this type of thing, hence my question to you guys. How do I achieve such a result.
Thanks ahead of time for your help !
Based on comments, here is how I transformed my function to perform the action :
countString <- function(df, col, str) {
sel <<- grepl(str, df[,col], fixed = TRUE)
while (any(sel)){
df[sel,"Column"] = df[sel,"Column"] + 1
df[sel, c(col:(ncol(df)-2))] <- df[sel, c(col+1:(ncol(df)-3))]
sel <<- grepl(str, df[,col], fixed = TRUE)
}
return(df)
}
I would have hoped that with the while and resetting the selection at the end of the while, it would gone through several occurrences if needed. It does not, so when I call my function, I use again the selection and if any(sel) is true then I call again this function.
Anyways, it is much faster than my previous method which was the primary intent here.
Thanks everyone for helping out.

How to "unmelt" data with reshape r

I have a data frame that I melted using the reshape package that I would like to "un melt".
here is a toy example of the melted data (real data frame is 500x100 or larger) :
variable<-c(rep("X1",3),rep("X2",3),rep("X3",3))
value<-c(rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3))
dat <-data.frame(variable,value)
dat
variable value
1 X1 0.5285376
2 X1 0.5285376
3 X1 0.5285376
4 X2 0.1694908
5 X2 0.1694908
6 X2 0.1694908
7 X3 0.7446906
8 X3 0.7446906
9 X3 0.7446906
Each variable (X1, X2,X3) has values estimated at 3 different times (which in this toy example happen to be the same, but this is never the case).
I would like to get it (back) in the form of :
X1 X2 X3
1 0.5285376 0.1694908 0.7446906
2 0.5285376 0.1694908 0.7446906
3 0.5285376 0.1694908 0.7446906
Basically, I would like the variable column to be sorted on ID (X1, X2 etc) and become column headings. I have tried various permutations of cast, dcast, recast, etc.. and cant seem to get the data in the format that I want. It was easy enough to 'melt' data from the wide form to the longer form (e.g. the dat datset), but getting it back is proving difficult. Any ideas? I know this is relatively simple, but I am having a hard time conceptualizing how to do this in reshape or reshape2.
Thanks,
LP
I typically do this by creating an id column and then using dcast:
> dat
variable value
1 X1 0.4299397
2 X1 0.4299397
3 X1 0.4299397
4 X2 0.2531551
5 X2 0.2531551
6 X2 0.2531551
7 X3 0.3972119
8 X3 0.3972119
9 X3 0.3972119
> dat$id <- rep(1:3,times = 3)
> dcast(data = dat,formula = id~variable,fun.aggregate = sum,value.var = "value")
id X1 X2 X3
1 1 0.4299397 0.2531551 0.3972119
2 2 0.4299397 0.2531551 0.3972119
3 3 0.4299397 0.2531551 0.3972119
Depending on how robust you need this to be , the following will correctly cast for varying number of occurrences of variables (and in any order).
> variable<-c(rep("X1",5),rep("X2",4),rep("X3",3))
> value<-c(rep(rnorm(1,.5,.2),5),rep(rnorm(1,.5,.2),4),rep(rnorm(1,.5,.2),3))
> dat <-data.frame(variable,value)
> dat <- dat[order(rnorm(nrow(dat))),]
> dat
variable value
11 X3 1.0294454
8 X2 0.6147509
2 X1 0.3537012
7 X2 0.6147509
9 X2 0.6147509
5 X1 0.3537012
4 X1 0.3537012
12 X3 1.0294454
3 X1 0.3537012
1 X1 0.3537012
10 X3 1.0294454
6 X2 0.6147509
> dat$id = numeric(nrow(dat))
> for (i in 1:nrow(dat)){
+ dat_temp <- dat[1:i,]
+ dat[i,]$id <- nrow(dat_temp[dat_temp$variable == dat[i,]$variable,])
+ }
> cast(dat, id~variable, value = 'value')
id X1 X2 X3
1 1 0.3537012 0.6147509 1.029445
2 2 0.3537012 0.6147509 1.029445
3 3 0.3537012 0.6147509 1.029445
4 4 0.3537012 0.6147509 NA
5 5 0.3537012 NA NA

Selecting values from a 3-column dataframe in R

I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"

Resources