Randomizing a column in a list of dataframe - r

I want to have multiple copies of a dataframe, but with each time a new randomization of a variable. My objective behind this is to do multiple iterations of an analysis with a randomize value for one variable.
I've started by doing a list of dataframe, with copies of my original dataframe:
a <- c(1, 2, 3, 4, 5)
b <- c(45, 34, 50, 100, 64)
test <- data.frame(a, b)
test2 <- lapply(1:2,function(x) test) #List of 2 dataframe, identical to test
I know about transform and sample, to randomize the values of a column:
test1 <- transform(test, a = sample(a))
I just cannot find how to apply it to the entire list of dataframes. I've tried this:
test3<- lapply(test2,function(i) sample(i[["a"]]))
But I lost the other variables. And this:
test3 <- lapply(test2,function(i) {transform(i, i[["a"]]==sample(i[["a"]]))})
But my variable is not randomized.
Multiple questions are similar to mine, but didn't helped me to solve my problem:
Adding columns to each in a list of dataframes
Add a column in a list of data frames

You can try the following:
lapply(test2, function(df) {df$a <- sample(df$a); df})
Or, using transform:
lapply(test2, function(df) transform(df, a = sample(a)))
Or just
lapply(test2, transform, a = sample(a))

Is there a reason you need them in separate lists?
This will give you 10 columns of randomized samples of a in different columns and then you could loop through the columns for your further analysis.
a <- c(1, 2, 3, 4, 5)
b <- c(45, 34, 50, 100, 64)
test <- data.frame(a, b)
for(i in 3:12){
test[,i] <- transform(sample(a))
}
`

Related

combine or merge two or more .fd class functional data in R

I am a newcomer in functional data analysis (FDA).
library(fda)
set.seed(151)
I1 <- matrix(rnorm(20*20,mean=0,sd=1),20,20)
I2 <- matrix(rnorm(15*20,mean=0.5,sd=1),15,25)
data1 <- t(I1)
data2 <- t(I2)
minutetime1 <- seq(from = 1, to = 25, length.out = 20)
minutetime2 <- seq(from = 1, to = 25, length.out = 25)
minutebasis <- create.bspline.basis(rangeval=c(0,25),nbasis=10)
fd1<- Data2fd(data1, minutetime1, basisobj=minutebasis)
is.fd(fd1)
# [1] TRUE
fd2<- Data2fd(data2, minutetime2, basisobj=minutebasis)
is.fd(fd2)
# [1] TRUE
I would like to merge or combine fd1 and fd2 ( like combining two vectors), and the result will also be the .fd class. I have used c(fd1,fd2), merge(fd1,fd2) and modifyList(fd1,fd2), etc.
Because your data are represented on the same basis, you should be able to create a combined fd object by combining the basis function coefficients from fd1 and fd2.
The coefficients are stored in matrices. Rows correspond to basis functions and columns correspond to observations (see ?fd), so we can bind columns to combine observations.
merged_coefs <- cbind(fd1$coefs, fd2$coefs)
merged_fd <- fd(coef = merged_coefs, basisobj = minutebasis)
is.fd(merged_fd)

How to efficiently split each row into test and train subsets using R?

I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]

r loop for filtering through each column

I have a data frame like this:
gene expression data frame
Assuming column name as different samples and row name as different genes.
Now I want to know the number of genes left after I filter from each column with a number
For example,
sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]
Then,
sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]
But I don't want to repeat this 100 times as I have 100 samples.
Can anyone write a loop for me for this situation? Thank you
Here is a solution using two loops that calculates, by each sample (columns), the number of genes (rows) that have a value greater than the one indicated in the nums vector.
#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)
#Loop for each column
resul <- apply(df, 2, function(x){
#Get the length of rows that have a higher value than each nums entry
sapply(nums, function(y){
length(x[x>y])
})
})
#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
as.data.frame(resul))
We can loop over the columns and do this and create the grouping with cut
lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))
or findInterval and then split
lst1 <- lapply(df, function(x) split(x, findInterval(x, c(5, 10, 20, 30))))
If we go by the way the objects are created in the OP's post, there would be 100 * 4 i.e. 400 objects (100 columns) in the global environment. Instead, it can be single list object.
The objects can be created, but it is not recommended
v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
for(i in v2) {
for(j in v1) {
assign(sprintf('sample%d_more_than_%d', i, j),
value = df[df[,i] > j,, drop = FALSE])
}
}

Compare two ffdf

I have two very large data sets (50M rows, 130 columns) which i can't compare with basic packages. Therefore i have to use an ffdf. It's the first time i am working with the ff package. I am trying to compare two ffdf and then write the differences in two outputfile ("in_file1_not_in_file2", "in_file2_not_in_file1"). Here is an example:
# For easy reproduction; normally a CSV file
set.seed(1234)
data1 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:8, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
data2 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:10, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
# Convert to an ffdf
ffdata1 <- as.ffdf(data1)
ffdata2 <- as.ffdf(data2)
So now i am stuck. Normally i would combine all rows in one column and compare this with each other. Something like this:
# Normally - Combined columns
data1$CCID <- apply(data1, 1, paste, collapse='.')
data2$CCID <- apply(data2, 1, paste, collapse='.')
# Combine columns of ffdf?
ffdata1$CCID <- ??
ffdata2$CCID <- ??
# Normally - Comparison
cdata3 <- sapply(data2$CCID, FUN=function(x) { x == data1$CCID })
output1 <- data2[colSums(cdata3)>0,]
output2 <- data1[rowSums(cdata3)>0,]
# Comparison of ffdf?
ffcdata3 <- ??
ffoutput1 <- ??
ffoutput2 <- ??
I hope it is understandable and sorry that i have just no idea how to work with ffdf.

Using coefficient of variation in aggregate

I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
(
100*sd(x)/mean(x)
)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using tapply() instead of aggregate() (see ?tapply for more info). However, a minimal working example would be very helpful.
co.var <- function(x) ( 100*sd(x)/mean(x) )
## Data with multiple repeated measurements.
## There are three things (ID 1, 2, 3) that
## are measured two times, twice each (val1 and val2)
myDF<-data.frame(ID=c(1,2,3,1,2,3),val1=c(20,10,5,25,7,2),
val2=c(19,9,4,24,4,1))
## Calculate coefficient of variation for each measurement set
myDF$coVar<-apply(myDF[,c("val1","val2")],1,co.var)
## Use tapply() instead of aggregate
mySel<-tapply(seq_len(nrow(myDF)),myDF$ID,function(x){
curSub<-myDF[x,]
return(x[which(curSub$coVar==max(curSub$coVar))])
})
## The mySel vector is then the vector of rows that correspond to the
## maximum coefficient of variation for each ID
myDF[mySel,]
EDIT:
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
# Create a big dataset
myDF <- data.frame(val1 = c(20, 10, 5, 25, 7, 2),
val2 = c(19, 9, 4, 24, 4, 1))
myDF <- myDF[sample(seq_len(nrow(myDF)), 40000, replace = TRUE), ]
myDF <- cbind(myDF, rep(myDF, 49))
myDF$ID <- sample.int(nrow(myDF)/5, nrow(myDF), replace = TRUE)
# Define a new function to work (slightly) better with large datasets
co.var.df <- function(x) ( 100*apply(x,1,sd)/rowMeans(x) )
# Create two datasets to benchmark the two methods
# (A second method proved slower than the third, hence the naming)
myDF.firstMethod <- myDF
myDF.thirdMethod <- myDF
Time the original method
startTime <- Sys.time()
myDF.firstMethod$coVar <- apply(myDF.firstMethod[,
grep("val", names(myDF.firstMethod))], 1, co.var)
mySel <- tapply(seq_len(nrow(myDF.firstMethod)),
myDF.firstMethod$ID, function(x) {
curSub <- myDF.firstMethod[x, ]
return(x[which(curSub$coVar == max(curSub$coVar))])
}, simplify = FALSE)
endTime <- Sys.time()
R> endTime-startTime
Time difference of 17.87806 secs
Time second method
startTime3 <- Sys.time()
coVar3<-co.var.df(myDF.thirdMethod[,
grep("val",names(myDF.thirdMethod))])
mySel3 <- tapply(seq_along(coVar3),
myDF[, "ID"], function(x) {
return(x[which(coVar3[x] == max(coVar3[x]))])
}, simplify = FALSE)
endTime3 <- Sys.time()
R> endTime3-startTime3
Time difference of 2.024207 secs
And check to see that we get the same results:
R> all.equal(mySel,mySel3)
[1] TRUE
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must unlist the mySel or mySel3 objects:
myDF.firstMethod[unlist(mySel),]
myDF.thirdMethod[unlist(mySel3),]

Resources