I have a list where different rows are of different length (sometimes length of 1)
I would like to apply sample to each row by using
sapply(1:99,function(x) sample(mat[[]],1))
The problem of course is that whenever the row is of length one sample will choose from 1:x instead of always choose the same number.
Is there a way to force sample to return the same value whenever length is of 1?
What is an alternative way to avoid this problem?
Since the 1:x thing is hard coded into sample, the best option is just to use ifelse:
sapply(mat[1:99], function(x) if(length(x)==1) x else sample(x, 1))
You could use the example on the help page ?sample:
resample <- function(x, ...) x[sample.int(length(x), ...)]
Just use the above resample function in place of sample. Or rename it, modify it, etc. if you want it to work a little differently.
To satisfy my own curiosity I did a quick benchmark of the suggestions so far:
library(microbenchmark)
mylist <- lapply( sample( rep( 1:10, 10 ) ), rpois, lambda=3 )
resample <- function(x, ...) x[sample.int(length(x), ...)]
sample1 <- function(x) x[sample.int(length(x), 1)]
ie1 <- function(x) if(length(x)==1) x else sample(x,1)
ie2 <- function(x) ifelse( length(x)==1, x, sample(x,1) )
rep1 <- function(x) { if( length(x) < 2 ) x <- rep(x,2); sample(x,1) }
(out <- microbenchmark(
sapply(mylist, resample, size=1),
sapply(mylist, sample1),
sapply(mylist, ie1),
sapply(mylist, ie2),
sapply(mylist, rep1)
))
With results:
Unit: microseconds
expr min lq median uq max neval
sapply(mylist, resample, size = 1) 360.846 388.1455 398.4085 409.4925 2036.169 100
sapply(mylist, sample1) 339.499 365.7720 375.8300 391.6345 1846.100 100
sapply(mylist, ie1) 493.853 534.2900 543.3205 561.3840 2091.589 100
sapply(mylist, ie2) 1225.397 1291.6955 1328.4365 1395.1455 3787.850 100
sapply(mylist, rep1) 566.926 614.3405 627.2720 649.4405 2178.209 100
Once you have matrix vs. dataframe or whatever it is straightened out, here's a workaround I've used:
vec.len<-length(my_vector)
if (vec.len <2 ) my_vector<-rep(my_vector,2)
sample(my_vector,1)
Related
library(xml2)
library(rvest)
datpackage <- paste0("dat",1:10)
for(i in 1:10){
assign(datpackage[i], runif(2))
}
datlist <- list(dat1, dat2, dat3, dat4, dat5, dat6, dat7, dat8, dat9, dat10)
"datlist" is what I want, but is there easier way to make a list ?
datlist2 <- for (i in 1:10) {
list(paste0("dat",i))
}
datlist3 <- list(datpackage)
I've tried datlist2, and datlist3, but that's not the same as "datlist".
What should I have to do when I make a list with thousands of data?
We can use paste with mget if the objects are already created
datlist <- mget(paste0("dat", 1:10))
But, if we need to create a list of random uniform numbers,
datlist <- replicate(10, runif(2), simplify = FALSE)
For creating lists with random numbers I would also suggest:
datlist2 <- lapply(vector("list", 10), function(x) {runif(2)})
Benchmarking
May be worth adding that the lapply / vector approach appears to be faster:
funA <- function(x) {replicate(10, runif(2), simplify = FALSE)}
funB <- function(x) {lapply(vector("list", 10), function(x) {runif(2)})}
microbenchmark::microbenchmark(funA(), funB(), times = 1e4)
Results
Unit: microseconds
expr min lq mean median uq max neval cld
funA() 24.053 27.3305 37.98530 28.6665 34.4045 2478.510 10000 b
funB() 19.507 21.6400 30.37437 22.9235 27.0500 2547.145 10000 a
I am trying to merge two data frames. The original data frame is much larger than the data frame that is going to be merged with however there is only 1 possible match for each row. The row is found by matching the type (a factor) and the level. The level is an integer that will be put into one of several buckets (the example only has two)
My current method works but uses sapply and is slow for large numbers of rows. How can I vectorise this operation?
set.seed(123)
sample <- 100
data <- data.frame(type= sample(LETTERS[1:4], sample, replace=TRUE), level =round(runif(sample, 1,sample)), value = round(runif(sample, 200,1000)))
data2 <- data.frame(type= rep(LETTERS[1:4],2), lower= c(rep(1,4), rep(51,4)), upper = c(rep(50,4), rep(sample,4)), cost1 = runif(8, 0,1), cost2 = runif(8, 0,1),cost3 = runif(8, 0,1))
data2[,4:6] <- data2[,4:6]/rowSums(data2[,4:6]) #turns the variables in to percentages, not necessary on real data
x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)))
data3 <- cbind(data, percentage= data2[x, -c(1:3)])
If I understand the matching problem you've set up, the following code seems to speed things up a bit by dividing data by type and then using cut to find the proper bucket. I think it will accommodate larger numbers of pairs of lower and upper values but haven't checked carefully.
library(plyr)
percents <- function(value, cost) {
cost <- cost[cost[,1]== value[1,1],]
cost <- cost[order(cost[,2]),]
ints <- cut(value[,2], breaks=c(t(cost[,2:3])), labels=FALSE, include.lowest=TRUE )
cbind(value,percentage=cost[ceiling(ints/2),-(1:3)])
}
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) )
Setting
sample <- 10000
gives the following execution time comparisons
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) ),
times=10)
Unit: milliseconds
expr
{ x <- unlist(sapply(1:sample, function(n) which(ll <- data$type[n] == data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper))) data3 <- cbind(data, percentage = data2[x, -c(1:3)]) }
data4 <- rbind.fill(mapply(percents, value = split(data, data$type), cost = list(data2), SIMPLIFY = FALSE))
min lq mean median uq max neval
1198.18269 1214.10560 1225.85117 1226.79838 1234.2671 1258.63122 10
20.81022 20.93255 21.50001 21.24237 22.1305 22.65291 10
where the first numbers are for the code shown in your question and the second times are for the code in my post. For this case, the new code seems almost 60 times faster.
Edit
To use rbind_all and avoid mapply, use the following:
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 )),
times=10)
which gives slightly improved execution times
min lq mean median uq max neval
1271.57023 1289.17614 1297.68572 1301.84540 1308.31476 1313.56822 10
18.33819 18.57373 23.28578 19.53742 19.95132 58.96143 10
Edit 2
Modification to use the data2$lower values only for setting intervals
percents <- function(value, cost) {
cost <- cost[cost[,"type"] == value[1,"type"],]
cost <- cost[order(cost[,"lower"]),]
ints <- cut(value[,"value"], breaks= c(cost[,"lower"], max(cost[,"upper"])), labels=FALSE, right=FALSE, include.highest=TRUE )
cbind(value,percentage=cost[ints,-(1:3)])
}
to use with
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 ))
I have a dataframe that houses cols of numbers - id like to check the range between these cols by row and create a new col that contains this range....
tool1 tool2 tool3 range
1 34 12 33
na 19 23 4
its has to be able to handle NAs too, byt just ignoring them.
How could this be done?
I've decide to expand this, because operating on rows in R is always a pain. So I've decided to compare base R against the two very efficient packages data.table and dplyr
(I'm not a dplyr expert, so if someone wants to modify my answer, please do)
Note:
Your case isn't a classic case of operating on rows because it can be solved using vectorized pmax and pmin, which we won't be always able to use
So creating a bit bigger data than in your example
n <- 1e4
set.seed(123)
df <- data.frame(tool1 = sample(100, n, replace = T),
tool2 = sample(100, n, replace = T),
tool3 = sample(100, n, replace = T))
Loading the necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
Defining the functions
apply1 <- function(y) apply(y, 1, function(x) max(x, na.rm = T) - min(x, na.rm = T))
apply2 <- function(y) apply(y, 1, function(x) diff(range(x, na.rm = T)))
trans <- function(y) transform(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
DTfunc <- function(y) setDT(y)[, range := pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3)]
DTfunc2 <- function(y) set(y, j = "range", value = with(y, pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))) # Thanks to #Arun for this
dplyrfunc <- function(y) mutate(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
df2 <- as.data.table(df) # This is in order to avoid overriding df by `setDT` during benchmarking
Running some benchmarks
microbenchmark(apply1(df), apply2(df), trans(df), DTfunc(df2), DTfunc2(df2), dplyrfunc(df), times = 100)
Unit: microseconds
expr min lq median uq max neval
apply1(df) 37221.513 40699.3790 44103.3495 46777.305 94845.463 100
apply2(df) 262440.581 278239.6460 287478.4710 297301.116 343962.869 100
trans(df) 1088.799 1178.3355 1234.9940 1287.503 1965.328 100
DTfunc(df2) 2068.750 2221.8075 2317.5680 2400.400 5935.883 100
DTfunc2(df2) 903.981 959.0435 986.3355 1026.395 1235.951 100
dplyrfunc(df) 1040.280 1118.9635 1159.9815 1200.680 1509.189 100
Seems like the second data.table approach is the most efficient. Base R transform and dplyr both pretty much the same, while more efficient than the first data.table approach because of the overhead in calling [.data.table
Suppose I have some object (any object), for example:
X <- array(NA,dim=c(2,2))
Also I have some list:
L <- list()
I want L[[1]], L[[2]], L[[3]],...,L[[100]],...,L[[1000]] all to have the object X inside it. That is, if I type into the console L[[i]], it will return X, where i is in {1,2,...,1000}.
How do I do this efficiently without relying on a for loop or lapply?
Make a list of 1 and replicate it.
L <- rep(list(x), 1000)
Using replicate even if it still a kind of a loop solution:
L <- replicate(1000,X,simplify=FALSE)
EDIT benchmarkking the 2 solutions :
X <- array(NA,dim=c(2,2))
library(microbenchmark)
microbenchmark( rep(list(X), 10000),
replicate(10000,X,simplify=FALSE))
expr min lq median uq max neval
rep(list(X), 10000) 1.743070 2.114173 3.088678 5.178768 25.62722 100
replicate(10000, X, simplify = FALSE) 5.977105 7.573593 10.557783 13.647407 80.69774 100
rep is 5 times faster. I guess since replicate will evaluate the expression at each iteration.
I am using in my code colSums but I also need the standard deviation beside the sum.
I searched in the internet and found this page which contain only:
colSums
colMeans
http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html
I tried this:
colSd
but I got this error:
Error: could not find function "colSd"
How I can do the same thing but for standard deviation:
colSd
Here is the code:
results <- colSums(x,na.rm=TRUE)#### here I want colsd
I want to provide a fourth (very similar to #Thomas) approach and some benchmarking:
library("microbenchmark")
library("matrixStats")
colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds
colSdColMeans <- function(x, na.rm=TRUE) {
if (na.rm) {
n <- colSums(!is.na(x)) # thanks #flodel
} else {
n <- nrow(x)
}
colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
return(sqrt(colVar * n/(n-1)))
}
colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))
m <- matrix(runif(1e7), nrow=1e3)
microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))
# Unit: milliseconds
# expr min lq median uq max neval
# colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783 100
# colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715 100
# colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407 100
# colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806 100
all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE
colSds and rowSds are two of many similar functions in the matrixStats package
Use the following:
colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)
This is the quickest and shortest way to calculate the standard deviation of the columns:
sqrt(diag(cov(data_matrix)))
Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:
Calculate the co-variance matrix using cov
Extract the diagonal of the matrix using diag
Take the square root of the diagonal values using sqrt in order to get the standard deviation
I hope that helps :)
I don't know if these are particularly fast, but why not just use the formulae for SD:
x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))
# If you have a population:
colsdpop <- function(x,...)
sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`
# If you have a sample:
colsdsamp <- function(x)
sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`
Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.
I believe I have found a more elegant solution in diag(sqrt(var(data)))
This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.
EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.
I usually do column sd's with apply:
x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))
> apply(x, 2, sd)
y z
0.8022729 3.4700314
Verify:
> sd(x$y)
[1] 0.8022729
> sd(x$z)
[1] 3.470031
You can also do it with dplyr easily:
library(dplyr)
library(magrittr) # for pipes
> x %>% summarize_all(.,sd)
y z
1 0.8022729 3.470031
You can just use apply function
all.sd <- apply(data, 2,sd)