How to subtract 2 colmun in a .csv file uploaded in R?
I have named the new column using reading <- $started_time- $ended_time
Since you do not post any example data I post an example based on the iris built-in dataset: You can simply use - to subtract vector of the same length (if the length is not the same the shorter vector will be recycled).
You can select the column from your dataset with the $ operator or with [] operator
data(iris)
#assigning the result to a new column
iris$subtraction <- iris$Sepal.Length-iris$Sepal.Width
iris$subtraction <- iris[,1]-iris[,2]
#assigning the result to a new variable
subtraction <- iris[,1]-iris[,2]
subtraction <- iris$Sepal.Length-iris$Sepal.Width
EDIT
a mincrobenchmark of 3 equivalent solutions:
library(microbenchmark)
library(data.table)
library(dplyr)
library(ggplot2)
#prepare simulation ------------------------------------------------------------
#number of rows to be tested
nr <- seq(100000,10000000,100000)
#initialize an list to store results
time <- as.list(rep(NA,100))
#benchmark
for (i in 1:length(nr)) {
set.seed(5)
#create data
df <- data.frame(x=rnorm(nr[i]),y=rnorm(nr[i]))
dt <- data.table(x=rnorm(nr[i]),y=rnorm(nr[i]))
#benchmark
x <- print(microbenchmark(
base=df$new.col <- df$x-df$y,
DT=dt <- dt[,new.col:=x-y],
dplyr=df %>% mutate(new.col=x-y),
times = 10
))
#store results
time[[i]] <- x[,c(1,4)]
}
#discard the first 4 elements because they run in microsenconds
bench <- do.call(rbind,time[5:100])
#add the number of rows as column
bench$nrow <- rep(nr[5:100],each=3)
ggplot(bench,aes(x=nrow,y=mean,group=expr,col=expr))+
geom_smooth(se=F)+
theme_minimal()+
xlab("# rows")+
ylab("time (milliseconds)")
As you can see, for this simple task both the base and data.table solutions are equivalent, while the mutate solution is a bit slower. However, the entire simulation runs in a minute and the single operations in few milliseconds.
my PC has 16Gb RAM and 12 cores.
EDIT
After the OP asked for a Date case, here a small example with date as POSIXct class:
day <- Sys.Date()
hm <- merge(0:23, seq(0, 45, by = 15))
datetime <- merge(last7days, chron(time = paste(hm$x, ':', hm$y, ':', 0)))
colnames(datetime) <- c('date', 'time')
# create datetime
dt <- as.POSIXct(paste(datetime$date, datetime$time))
df <- data.frame(x=sample(dt,200000,replace = T),y=sample(dt,200000,replace = T))
microbenchmark(df$x-df$y)
the operation runs in a few milliseconds, as expected:
Unit: milliseconds
expr min lq mean median uq max neval
df$x - df$y 1.459801 1.544301 2.755227 1.624501 1.845401 62.7416 100
Related
I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a
I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
The output of expensive is a character vector, of n elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a
I am currently using following code to merge >130 data frames and the code takes too many hours to run (I actually never got to the completion on such a big dataset, only on subsets). Each table has two columns: unit (string) and counts (integer). I am merging by units.
tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x))))
MyMerge <- function(x, y){
df <- merge(x, y, by="unit", all.x= TRUE, all.y= TRUE)
return(df)
}
data <- Reduce(MyMerge, tables)
Is there any way to speed this up easily? Each table/dataframe separately has around 500,000 rows and many of those are unique to that table. Therefore, by merging multiple tables one quickly gets number of the rows of the merged dataframe to many millions..
At the end, I will drop rows with too low summary counts from my big merged table, but I don't want to to that during merging as the order of my files would matter then..
Here a small comparison, first with a rather small dataset, then with a larger one:
library(data.table)
library(plyr)
library(dplyr)
library(microbenchmark)
# sample size:
n = 4e3
# create some data.frames:
df_list <- lapply(1:100, function(x) {
out <- data.frame(id = c(1:n),
type = sample(c("coffee", "americano", "espresso"),n, replace=T))
names(out)[2] <- paste0(names(out)[2], x)
out})
# transform dfs into data.tables:
dt_list <- lapply(df_list, function(x) {
out <- as.data.table(x)
setkey(out, "id")
out
})
# set options to outer join for all methods:
mymerge <- function(...) base::merge(..., by="id", all=T)
mydplyr <- function(...) dplyr::full_join(..., by="id")
myplyr <- function(...) plyr::join(..., by="id", type="full")
mydt <- function(...) merge(..., by="id", all=T)
# Compare:
microbenchmark(base = Reduce(mymerge, df_list),
dplyr= Reduce(mydplyr, df_list),
plyr = Reduce(myplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
This gives the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
base 944.0048 956.9049 974.8875 962.9884 977.6824 1221.5301 50 c
dplyr 316.5211 322.2476 329.6281 326.9907 332.6721 381.6222 50 a
plyr 2682.9981 2754.3139 2788.7470 2773.8958 2812.5717 3003.2481 50 d
dt 537.2613 554.3957 570.8851 560.5323 572.5592 757.6631 50 b
We can see that the two contestants are dplyr and data.table. Changing the sample size to 5e5 yields the following comparisons, showing that indeed data.table dominates. Note that I added this part after #BenBolker's suggestion.
microbenchmark(dplyr= Reduce(mydplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
Unit: seconds
expr min lq mean median uq max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748 50 b
dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235 50 a
I am trying to merge two data frames. The original data frame is much larger than the data frame that is going to be merged with however there is only 1 possible match for each row. The row is found by matching the type (a factor) and the level. The level is an integer that will be put into one of several buckets (the example only has two)
My current method works but uses sapply and is slow for large numbers of rows. How can I vectorise this operation?
set.seed(123)
sample <- 100
data <- data.frame(type= sample(LETTERS[1:4], sample, replace=TRUE), level =round(runif(sample, 1,sample)), value = round(runif(sample, 200,1000)))
data2 <- data.frame(type= rep(LETTERS[1:4],2), lower= c(rep(1,4), rep(51,4)), upper = c(rep(50,4), rep(sample,4)), cost1 = runif(8, 0,1), cost2 = runif(8, 0,1),cost3 = runif(8, 0,1))
data2[,4:6] <- data2[,4:6]/rowSums(data2[,4:6]) #turns the variables in to percentages, not necessary on real data
x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)))
data3 <- cbind(data, percentage= data2[x, -c(1:3)])
If I understand the matching problem you've set up, the following code seems to speed things up a bit by dividing data by type and then using cut to find the proper bucket. I think it will accommodate larger numbers of pairs of lower and upper values but haven't checked carefully.
library(plyr)
percents <- function(value, cost) {
cost <- cost[cost[,1]== value[1,1],]
cost <- cost[order(cost[,2]),]
ints <- cut(value[,2], breaks=c(t(cost[,2:3])), labels=FALSE, include.lowest=TRUE )
cbind(value,percentage=cost[ceiling(ints/2),-(1:3)])
}
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) )
Setting
sample <- 10000
gives the following execution time comparisons
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) ),
times=10)
Unit: milliseconds
expr
{ x <- unlist(sapply(1:sample, function(n) which(ll <- data$type[n] == data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper))) data3 <- cbind(data, percentage = data2[x, -c(1:3)]) }
data4 <- rbind.fill(mapply(percents, value = split(data, data$type), cost = list(data2), SIMPLIFY = FALSE))
min lq mean median uq max neval
1198.18269 1214.10560 1225.85117 1226.79838 1234.2671 1258.63122 10
20.81022 20.93255 21.50001 21.24237 22.1305 22.65291 10
where the first numbers are for the code shown in your question and the second times are for the code in my post. For this case, the new code seems almost 60 times faster.
Edit
To use rbind_all and avoid mapply, use the following:
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 )),
times=10)
which gives slightly improved execution times
min lq mean median uq max neval
1271.57023 1289.17614 1297.68572 1301.84540 1308.31476 1313.56822 10
18.33819 18.57373 23.28578 19.53742 19.95132 58.96143 10
Edit 2
Modification to use the data2$lower values only for setting intervals
percents <- function(value, cost) {
cost <- cost[cost[,"type"] == value[1,"type"],]
cost <- cost[order(cost[,"lower"]),]
ints <- cut(value[,"value"], breaks= c(cost[,"lower"], max(cost[,"upper"])), labels=FALSE, right=FALSE, include.highest=TRUE )
cbind(value,percentage=cost[ints,-(1:3)])
}
to use with
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 ))
I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
(
100*sd(x)/mean(x)
)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using tapply() instead of aggregate() (see ?tapply for more info). However, a minimal working example would be very helpful.
co.var <- function(x) ( 100*sd(x)/mean(x) )
## Data with multiple repeated measurements.
## There are three things (ID 1, 2, 3) that
## are measured two times, twice each (val1 and val2)
myDF<-data.frame(ID=c(1,2,3,1,2,3),val1=c(20,10,5,25,7,2),
val2=c(19,9,4,24,4,1))
## Calculate coefficient of variation for each measurement set
myDF$coVar<-apply(myDF[,c("val1","val2")],1,co.var)
## Use tapply() instead of aggregate
mySel<-tapply(seq_len(nrow(myDF)),myDF$ID,function(x){
curSub<-myDF[x,]
return(x[which(curSub$coVar==max(curSub$coVar))])
})
## The mySel vector is then the vector of rows that correspond to the
## maximum coefficient of variation for each ID
myDF[mySel,]
EDIT:
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
# Create a big dataset
myDF <- data.frame(val1 = c(20, 10, 5, 25, 7, 2),
val2 = c(19, 9, 4, 24, 4, 1))
myDF <- myDF[sample(seq_len(nrow(myDF)), 40000, replace = TRUE), ]
myDF <- cbind(myDF, rep(myDF, 49))
myDF$ID <- sample.int(nrow(myDF)/5, nrow(myDF), replace = TRUE)
# Define a new function to work (slightly) better with large datasets
co.var.df <- function(x) ( 100*apply(x,1,sd)/rowMeans(x) )
# Create two datasets to benchmark the two methods
# (A second method proved slower than the third, hence the naming)
myDF.firstMethod <- myDF
myDF.thirdMethod <- myDF
Time the original method
startTime <- Sys.time()
myDF.firstMethod$coVar <- apply(myDF.firstMethod[,
grep("val", names(myDF.firstMethod))], 1, co.var)
mySel <- tapply(seq_len(nrow(myDF.firstMethod)),
myDF.firstMethod$ID, function(x) {
curSub <- myDF.firstMethod[x, ]
return(x[which(curSub$coVar == max(curSub$coVar))])
}, simplify = FALSE)
endTime <- Sys.time()
R> endTime-startTime
Time difference of 17.87806 secs
Time second method
startTime3 <- Sys.time()
coVar3<-co.var.df(myDF.thirdMethod[,
grep("val",names(myDF.thirdMethod))])
mySel3 <- tapply(seq_along(coVar3),
myDF[, "ID"], function(x) {
return(x[which(coVar3[x] == max(coVar3[x]))])
}, simplify = FALSE)
endTime3 <- Sys.time()
R> endTime3-startTime3
Time difference of 2.024207 secs
And check to see that we get the same results:
R> all.equal(mySel,mySel3)
[1] TRUE
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must unlist the mySel or mySel3 objects:
myDF.firstMethod[unlist(mySel),]
myDF.thirdMethod[unlist(mySel3),]