This question already has answers here:
Named List To/From Data.Frame
(4 answers)
Closed 4 years ago.
I'm looking for a efficient way for the following transformation:
Example input:
ob <- list(a = 2, b = 3)
Expected output:
key value
1 a 2
2 b 3
Current (verbose) solution:
data.frame(key = names(ob), value = unlist(ob, use.names = FALSE))
Try stack
stack(ob)
# values ind
#1 2 a
#2 3 b
You would need to change the names though. Use setNames
setNames(stack(ob), nm = c("value", "key"))
benchmark
Addressing #Roland's comment, stack seems to be more efficient indeed. please don't use stack but OP's solution for efficiency reasons.
n <- 1e5
lst <- as.list(seq_len(n))
names(lst) <- paste0("a", seq_len(n))
library(microbenchmark)
benchmark <- microbenchmark(
snoram = snoram(lst),
markus = markus(lst), times = 50
)
benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval
# snoram 2.475258 2.594479 2.739639 2.652843 2.715575 5.92216 50
# markus 114.387692 119.028200 134.745626 137.524606 144.045112 162.11510 50
Functions used (so far)
snoram <- function(l) {
data.frame(key = names(l), value = unlist(l, use.names = FALSE),
stringsAsFactors = FALSE) # this gives a hugh performance gain
# thanks to #Roland
}
markus <- function(l) {
setNames(stack(l), nm = c("value", "key"))
}
Related
I have a very large data set with categorical labels a and a vector b that contains all possible labels in the data set:
a <- c(1,1,3,2) # artificial data
b <- c(1,2,3,4) # fixed categories
Now I want to find for each observation in a the set of all remaining categories (that is, the elements of b excluding the given observation in a). From these remaining categories, I want to sample one at random.
My approach using a loop is
goal <- numeric() # container for results
for(i in 1:4){
d <- setdiff(b, a[i]) # find the categories except the one observed in the data
goal[i] <- sample(d,1) # sample one of the remaining categories randomly
}
goal
[1] 4 4 1 1
However, this has to be done a large number of times and applied to very large data sets. Does anyone have a more efficient version that leads to the desired result?
EDIT:
The function by akrun is unfortunately slower than the original loop. If anyone has a creative idea with a competitive result, I'm happy to hear it!
We can use vapply
vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1))
set.seed(24)
a <- sample(c(1:4), 10000, replace=TRUE)
b <- 1:4
system.time(vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)))
# user system elapsed
# 0.208 0.007 0.215
It turns out that resampling the labels that are equal to the labels in the data is an even faster approach, using
test = sample(b, length(a), replace=T)
resample = (a == test)
while(sum(resample>0)){
test[resample] = sample(b, sum(resample), replace=T)
resample = (a == test)
}
Updated Benchmarks for N=10,000:
Unit: microseconds
expr min lq mean median uq max neval
loop 14337.492 14954.595 16172.2165 15227.010 15585.5960 24071.727 100
akrun 14899.000 15507.978 16271.2095 15736.985 16050.6690 24085.839 100
resample 87.242 102.423 113.4057 112.473 122.0955 174.056 100
shree(data = a, labels = b) 5195.128 5369.610 5472.4480 5454.499 5574.0285 5796.836 100
shree_mapply(data = a, labels = b) 1500.207 1622.516 1913.1614 1682.814 1754.0190 10449.271 100
Update: Here's a fast version with mapply. This method avoids calling sample() for every iteration so is a bit faster. -
mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
Here's a version without setdiff (setdiff can be a bit slow) although I think even more optimization is possible. -
vapply(a, function(x) sample(b[!b == x], 1), numeric(1))
Benchmarks -
set.seed(24)
a <- sample(c(1:4), 1000, replace=TRUE)
b <- 1:4
microbenchmark::microbenchmark(
akrun = vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)),
shree = vapply(a, function(x) sample(b[!b == x], 1), numeric(1)),
shree_mapply = mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
)
Unit: milliseconds
expr min lq mean median uq max neval
akrun 28.7347 30.66955 38.319655 32.57875 37.45455 237.1690 100
shree 5.6271 6.05740 7.531964 6.47270 6.87375 45.9081 100
shree_mapply 1.8286 2.01215 2.628989 2.14900 2.54525 7.7700 100
I have this data.frame:
set.seed(1)
df <- cbind(matrix(rnorm(26,100),26,100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Each row is 100 measurements from a certain subject (designated by id), which is associated with a parent ID (designated by parent.id). The relationship between parent.id and id is one-to-many.
I'm looking for a fast way to get the fraction of each df$id (for each of its 100 measurements) out the measurements of its parent.id. Meaning that for each id in df$id I want to divide each of its 100 measurements by the sum of its measurements across all df$id's which correspond to its df$parent.id.
What I'm trying is:
sum.df <- dplyr::select(df,-id) %>% dplyr::group_by(parent.id) %>% dplyr::summarise_all(sum)
fraction.df <- do.call(rbind,lapply(df$id,function(i){
pid <- dplyr::filter(df,id == i)$parent.id
(dplyr::filter(df,id == i) %>% dplyr::select(-id,-parent.id))/
(dplyr::filter(sum.df,parent.id == pid) %>% dplyr::select(-parent.id))
}))
But for the real dimensions of my data: length(df$id) = 10,000 with 1,024 measurements, this is not fast enough.
Any idea how to improve this, ideally using dplyr functions?
Lets compare these options with microbenchmark, all using the new definition for the dataset in #Sathish's answer:
OP method:
Units: seconds
min lq mean median uq max neval
1.423583 1.48449 1.602001 1.581978 1.670041 2.275105 100
#Sathish method speeds it up by a factor of about 5. This is valuable, to be sure
Units: milliseconds
min lq mean median uq max neval
299.3581 334.787 388.5283 363.0363 398.6714 951.4654 100
One possible base R implementation below, using principles of efficient R code, improves things by a factor of about 65 (24 milliseconds, vs 1,582 milliseconds):
Units: milliseconds
min lq mean median uq max neval
21.49046 22.59205 24.97197 23.81264 26.36277 34.72929 100
Here's the base R implementation. As is the case for the OP's implementation, the parent.id and id columns are not included in the resulting structure (here fractions). fractions is a matrix with rows ordered according to sort(interaction(df$id, df$parent.id, drop = TRUE)).
values <- df[1:100]
parents <- split(values, df$parent.id)
sums <- vapply(parents, colSums, numeric(100), USE.NAMES = FALSE)
fractions <- matrix(0, 26, 100)
f_count <- 0
for (p_count in seq_along(parents)){
parent <- as.matrix(parents[[p_count]])
dimnames(parent) <- NULL
n <- nrow(parent)
for (p_row in seq_len(nrow(parent))){
fractions[(f_count + p_row),] <- parent[p_row,] / sums[,p_count]
}
f_count <- f_count + p_row
}
Note: there's still room for improvement. split() is not particularly efficient.
Note 2: What "principles of efficient R code" were used?
Get rid of names whenever you can
It's faster to find things in a matrix than a data frame
Don't be afraid of for loops for efficiency, provided you're not growing an object
Prefer vapply to the other apply family functions.
The problem with your data is all rows are duplicate of each other, so I changed it slightly to reflect different values in the dataset.
Data:
set.seed(1L)
df <- cbind(matrix(rnorm(2600), nrow = 26, ncol = 100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Code:
library('data.table')
setDT(df) # assign data.table class by reference
# compute sum for each `parent.id` for each column (100 columns)
sum_df <- df[, .SD, .SDcols = which(colnames(df) != 'id' )][, lapply(.SD, sum ), by = .(parent.id ) ]
# get column names for sum_df and df which are sorted for consistency
no_pid_id_df <- gtools::mixedsort( colnames(df)[ ! ( colnames(df) %in% c( 'id', 'parent.id' ) ) ] )
no_pid_sum_df <- gtools::mixedsort( colnames(sum_df)[ colnames(sum_df) != 'parent.id' ] )
# match the `parent.id` for each `id` and then divide its value by the value of `sum_df`.
df[, .( props = {
pid <- parent.id
unlist( .SD[, .SD, .SDcols = no_pid_id_df ] ) /
unlist( sum_df[ parent.id == pid, ][, .SD, .SDcols = no_pid_sum_df ] )
}, parent.id ), by = .(id)]
Output:
# id props parent.id
# 1: A -0.95157186 e
# 2: A 0.06105359 e
# 3: A -0.42267771 e
# 4: A -0.03376174 e
# 5: A -0.16639600 e
# ---
# 2596: Z 2.34696158 e
# 2597: Z 0.23762369 e
# 2598: Z 0.60068440 e
# 2599: Z 0.14192337 e
# 2600: Z 0.01292592 e
Benchmark:
library('microbenchmark')
microbenchmark( sathish(), frank(), dan())
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sathish() 404.450219 413.456675 433.656279 420.46044 429.876085 593.44202 100 c
# frank() 2.035302 2.304547 2.707019 2.47257 2.622025 18.31409 100 a
# dan() 17.396981 18.230982 19.316653 18.59737 19.700394 27.13146 100 b
I have the following data frame 'df'.
Each participant (here 10 participants) saw several stimuli (here 100), and made
a judgment about it (here a random number). For each stimuli, I know the true
answer (here a random number; a different number for each stimuli but always
the same answer for all participanst)
participant <- rep(1:10, each=100)
stimuli <- rep(1:100, 10)
judgment <- rnorm(1000)
df1 <- data.frame(participant, stimuli, judgment)
df2 <- data.frame(stimuli=1:100, criterion=rnorm(100))
df <- merge(df1, df2, by='stimuli') %>% arrange(participant, stimuli)
Here is what I am trying to do:
1) Taking n randomly selected participants (here n is between 1 and 10).
2) Computing the mean of their judgments per stimuli
3) Computing the correlation between this mean and the true answer
I want to perform step 1-3 for all n (that is, I want to take 1 randomly selected participants and perform steps 1-3, then I want to take 2 randomly selected participants and perform steps 1-3 ... 10 randomly selected participants and perform steps 1-3.
The results should be a data frame with 10 rows and 2 variables: N and the correlation. I want to work only with dplyr.
My solution is based on lapply. Here it is:
participants_id = unique (df$participant)
MyFun = function(Data) {
HelpFun = function(x, Data) {
# x is the index for the number of participants.
# It Will be used in the lapply call bellow
participants_x = sample(participants_id, x)
filter(Data, participant %in% participants_x) %>%
group_by(stimuli) %>%
summarise( mean_x = mean(judgment),
criterion = unique(criterion) ) %>%
summarise(cor = cor(.$mean_x, .$criterion))
}
N <- length(unique(Data$participant))
lapply(1:N, HelpFun, Data) %>% bind_rows()
}
MyFun(df)
The problem is that this code is slow. Since every selection is random, I
perform all this 10,000 times. And this slow. On my machine (Windows 10, 16 GB) 1000 simulations take 2 minutes. 10,000 simulations takes 20 minutes. (I also tried with loops but it did not help, although for some reasons it was a little bit faster). It has to be a solution faster. After all, a computations are not so complicated.
Below I wrote 100 simulations only in order to not interfere with your computer.
system.time(replicate(100, MyFun(df), simplify = FALSE ) %>% bind_rows())
Any idea about making all of this faster?
Using data.table and for loops we can get 10 times faster solution.
My function:
minem <- function(n) { # n - simulation count
require(data.table)
participants_id <- unique(df$participant)
N <- length(unique(df$participant))
dt <- as.data.table(df)
setkey(dt, stimuli)
L <- list()
for (j in 1:n) {
corss <- rep(0, N)
for (i in 1:N) {
participants_x <- sample(participants_id, i)
xx <- dt[participant %in% participants_x,
.(mean_x = mean(judgment),
criterion = first(criterion)),
by = stimuli]
corss[i] <- cor(xx$mean_x, xx$criterion)
}
L[[j]] <- corss
}
unlist(L)
}
head(minem(10))
# [1] 0.13642499 -0.02078109 -0.14418400 0.04966805 -0.09108837 -0.15403185
Your function:
Meir <- function(n) {
replicate(n, MyFun(df), simplify = FALSE) %>% bind_rows()
}
Benchmarks:
microbenchmark::microbenchmark(
Meir(10),
minem(10),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Meir(10) 1897.6909 1956.3427 1986.5768 1973.5594 2043.4337 2048.5809 10 b
# minem(10) 193.5403 196.0426 201.4132 202.1085 204.9108 215.9961 10 a
around 10 times faster
system.time(minem(1000)) # ~19 sek
Update
If your data size and memory limit allows then you can do it much faster with this approach:
minem2 <- function(n) {
require(data.table)
participants_id <- unique(df$participant)
N <- length(unique(df$participant))
dt <- as.data.table(df)
setkey(dt, participant)
L <- lapply(1:n, function(x)
sapply(1:N, function(i)
sample(participants_id, i)))
L <- unlist(L, recursive = F)
names(L) <- 1:length(L)
g <- sapply(seq_along(L), function(x) rep(names(L[x]), length(L[[x]])))
L <- data.table(participant = unlist(L), .id = as.integer(unlist(g)),
key = "participant")
L <- dt[L, allow.cartesian = TRUE]
xx <- L[, .(mean_x = mean(judgment), criterion = first(criterion)),
keyby = .(.id, stimuli)]
xx <- xx[, cor(mean_x, criterion), keyby = .id][[2]]
xx
}
microbenchmark::microbenchmark(
Meir(100),
minem(100),
minem2(100),
times = 2, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval cld
# Meir(100) 316.34965 316.34965 257.30832 257.30832 216.85190 216.85190 2 c
# minem(100) 31.49818 31.49818 26.48945 26.48945 23.05735 23.05735 2 b
# minem2(100) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 2 a
But you will need to test yourself.
I want to count the number of 1's that occur from RIGHT to LEFT across multiple columns, which stops when encountering the first 0.
Example DF:
df<-data.frame(replicate(7,sample(0:1,30,rep=T)))
colnames(df)<-seq(1950,2010,10)
I've manually entered the desired result here under a new column "condition" as an example:
Thanks in advance for your help,
Cai
Here's a fully vectorized attempt
indx <- rowSums(df) == ncol(df) # Per Jaaps comment
df$condition <- ncol(df) - max.col(-df, ties = "last")
df$condition[indx] <- ncol(df) - 1
This is basically finds the first zero from the right and counts how many columns were before that (which are basically the 1s in a binary data)
EDIT
Had to add handling for the special case when all the rows are ones
df$condition <- apply(df, 1, function(x) {
y <- rev(x)
sum(cumprod(y))
})
[Edit: now works]
Try this
df$condition <- apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])})
we're matching the first 0, then summing up until this element.
If there's no zero we sum the full row
Here's a benchmark of all solutions :
library(stringr)
microbenchmark(
Moody_Mudskipper = apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])}),
akrun = apply(df, 1, function(x) {x1 <- rle(x)
x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
if(length(x2)==0) 0 else x2}),
akrun2 = str_count(do.call(paste0, df), "[1]+$"),
roland = apply(df, 1, function(x) {y <- rev(x);sum(y * cumprod(y != 0L))}),
David_Arenburg = ncol(df) - max.col(-df, ties = "last"),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval
# Moody_Mudskipper 1437.948 1480.417 1677.1929 1536.159 1597.209 3009.320 10
# akrun 6985.174 7121.078 7718.2696 7691.053 7856.862 9289.146 10
# akrun2 1101.731 1188.793 1290.8971 1226.486 1343.099 1790.091 10
# akrun3 693.315 791.703 830.3507 820.371 884.782 1030.240 10
# roland 1197.995 1270.901 1708.5143 1332.305 1727.802 4568.660 10
# David_Arenburg 2845.459 3060.638 3406.3747 3167.519 3495.950 5408.494 10
# David_Arenburg_corrected 3243.964 3341.644 3757.6330 3384.645 4195.635 4943.099 10
For a bigger example David's solution is indeed the fastest, as said in the chosen solution's comments:
df<-data.frame(replicate(7,sample(0:1,1000,rep=T)))
# Unit: milliseconds
# expr min lq mean median uq max neval
# Moody_Mudskipper 31.324456 32.155089 34.168533 32.827345 33.848560 44.952570 10
# akrun 225.592061 229.055097 238.307506 234.761584 241.266853 271.000470 10
# akrun2 28.779824 29.261499 33.316700 30.118144 38.026145 46.711869 10
# akrun3 14.184466 14.334879 15.528201 14.633227 17.237317 18.763742 10
# roland 27.946005 28.341680 29.328530 28.497224 29.760516 33.692485 10
# David_Arenburg 3.149823 3.282187 3.630118 3.455427 3.727762 5.240031 10
# David_Arenburg_corrected 3.464098 3.534527 4.103335 3.833937 4.187141 6.165159 10
We can loop through the rows, use rle
df$condition <- apply(df, 1, function(x) {x1 <- rle(x)
x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
if(length(x2)==0) 0 else x2})
Or another option is str_extract
library(stringr)
v1 <- str_extract(do.call(paste0, df), "1+$")
d$condition <- ifelse(is.na(v1), 0, nchar(v1))
Or with a slightly more efficient stringi
library(stringi)
v1 <- stri_count(stri_extract(do.call(paste0, df), regex = "1+$"), regex = ".")
v1[is.na(v1)] <- 0
df$condition <- v1
Or with a more compact option
stri_count(do.call(paste0, df), regex = '(?=1+$)')
Hello everyone,
I am working with large list, which contains lists. Each of the sub lists contains n elements. I always want to get the 3rd one, e.g.
l = list()
l[[1]] = list(A=runif(1), B=runif(1), C=runif(1))
l[[2]] = list(A=runif(1), B=runif(1), C=runif(1))
l[[3]] = list(A=runif(1), B=runif(1), C=runif(1))
res = sapply(l, function(x) x$C)
res = sapply(l, function(x) x[[3]]) #alternative
But my list contains several thousands of elements and I am performing this operation a lot of times. So, is there a faster way to do the operation above?
Beste regards,
Mario
If you do this mulitple times, then it would be better to convert your list to an easier structure like data.table.
library(data.table)
DT=rbindlist(l);
res = DT$C
# or if you prefer the 3rd element, not necessarily called 'C' then:
res = DT[[3]] # or DT[,C] which might be faster. Please check #richard-scriven comment
Alternatively if you want to keep base R you could use rbind
res = do.call(rbind.data.frame, l)$C # or [[3]]
Would this make things easier?
UPDATE
Here are some benchmarks showing different solutions to the problem:
preparations:
library(data.table)
library(microbenchmark)
# creating a list and filling it with items
nbr = 1e5;
l = vector("list",nbr)
for (i in 1:nbr) {
l[[i]] = list(A=runif(1), B=runif(1), C=runif(1))
}
# creating data.frame and data.table versions
DT <- rbindlist(l)
DF <- data.frame(rbindlist(l))
benchmarking:
# doing the benchmarking
op <-
microbenchmark(
LAPPLY.1 = lapply(l, function(x) x$C),
LAPPLY.2 = lapply(l, `[`, "C"),
LAPPLY.3 = lapply(l, `[[`, "C"),
SAPPLY.1 = sapply(l, function(x) x$C),
SAPPLY.2 = sapply(l, function(x) x[[3]]),
SAPPLY.3 = sapply(l, `[[`, 3),
DT.1 = rbindlist(l)$C,
DT.2 = DT$C,
DF.2 = DF$C,
times = 100
)
results:
op
## Unit: microseconds
## expr min lq mean median uq max neval
## LAPPLY.1 124088 142390 161672 154415 163240 396761 100
## LAPPLY.2 111397 134745 156012 150062 165229 364539 100
## LAPPLY.3 66965 71608 82975 77329 84949 323041 100
## SAPPLY.1 133220 149093 166653 159222 172495 311857 100
## SAPPLY.2 105917 119533 137990 133364 139216 346759 100
## SAPPLY.3 70391 74726 81910 80520 85792 110062 100
## DT.1 46895 48943 49113 49178 49391 51377 100
## DT.2 8 18 37 47 49 58 100
## DF.2 7 13 33 40 42 82 100
(1) In general it would be best to use a table like structure like data.frame or data.table in the first place - selecting columns from those costs the least of time.
(2) If this is not possible it is better to first turn the list into a data.frame or data.table to than extract the values in one single operation.
(3) Interestingly using sapply or lapply with the base R (optimized) [[-function results in process times that are only twice as bad as using rbind and than extracting the values as column.