I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
And the loop:
for (i in 1:length(nums)){
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
but the code is writing
I tried with:
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?
The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply, you can Vectorize the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
# inspect the output
Here is a sample dataset:
data <- data.frame(x=c(4,3,4,4,99),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
I would like to create a new column for means using rowMeans. I would like to keep na.rm=F because if its truly NA I do not want to include that into my means calculation.
But if its either 88/99 I would like R to ignore it while calculating the mean and still use the remaining valid values. So far I have the below.
data$mean <- rowMeans(subset(data, select = c(`x`,`y`,`z`,`w`)), na.rm = T)
But I am not sure how to add in a function where it would just ignore the 88 and 99 from calculations.
This is what I am hoping to get
data <- data.frame(x=c(4,3,4,4,99),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4),
Any help is appreciated - thank you!
Using rowMeans nevertheless with na.rm=TRUE, but on a subset and temporally replaceing 88 and 99 with NA.
s <- rowSums(is.na(data)) == 0 ## store row subset
v <- c("x", "y", "z", "w") ## col subset to calc. mean
data$mean <- NA ## ini column
m <- as.matrix(data[v]) ## we'll ned a matrix
data$mean[s] <- rowMeans(replace(m[s, v], m[s, v] %in% c(88, 99), NA), na.rm=TRUE)
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
Or simply using apply but is much slower.
f <- \(x) if (any(is.na(x))) NA else mean(x[!x %in% c(88, 99)])
cbind(data, mean=apply(data, 1, f))
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
From microbenchmark.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# apply 35.018162 35.941815 38.834333 36.394632 36.960161 212.469412 100 b
# rowMeans 1.097393 1.119396 1.493563 1.193787 1.226691 9.352118 100 a
data <- data.frame(x=c(4,3,4,4,99),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
df$mean <- apply(data, 1, function(x) {
idx <- which((x %in% c(88, 89)) == FALSE)
mean(x[ idx ], na.rm = TRUE)
x y z w mean
1 4 4 88 4 4.00
2 3 NA NA 5 4.00
3 4 3 4 2 3.25
4 4 2 4 3 3.25
5 99 4 5 4 28.00
I have the following dataset:
df <- data.frame(a=1:10,b=10:1)
I have a following function:
fun <- function(x,y) x*y/1000+x+y
I want the following output:
for (i in 2:10){df$a[i] = fun(df$a[i],df$a[i-1])};for (i in 2:10){df$b[i] = fun(df$b[i],df$b[i-1])}}
# a b
# 1 1.000000 10.00000
# 2 3.002000 19.09000
# 3 6.011006 27.24272
# 4 10.035050 34.43342
# 5 15.085225 40.64002
# 6 21.175737 45.84322
# 7 28.323967 50.02659
# 8 36.550559 53.17667
# 9 45.879514 55.28303
# 10 56.338309 56.33831
Essentially, the element for row i is the function output from the last row and current row, and this is executed recursively. Is there a better way to do it?
We can use the accumulate function from the purrr package.
df <- data.frame(a=1:10,b=10:1)
fun <- function(x,y) x*y/1000+x+y
df$a <- accumulate(df$a, fun)
df$b <- accumulate(df$b, fun)
# a b
# 1 1.000000 10.00000
# 2 3.002000 19.09000
# 3 6.011006 27.24272
# 4 10.035050 34.43342
# 5 15.085225 40.64002
# 6 21.175737 45.84322
# 7 28.323967 50.02659
# 8 36.550559 53.17667
# 9 45.879514 55.28303
# 10 56.338309 56.33831
A base R option is with Reduce with accumulate = TRUE
df[] <- lapply(df, function(x) Reduce(fun, x, accumulate = TRUE))
# a b
#1 1.000000 10.00000
#2 3.002000 19.09000
#3 6.011006 27.24272
#4 10.035050 34.43342
#5 15.085225 40.64002
#6 21.175737 45.84322
#7 28.323967 50.02659
#8 36.550559 53.17667
#9 45.879514 55.28303
#10 56.338309 56.33831
I want to find the max and min for each Gene in the following table.
I know, that the following function gives the max (or min), but I could not manage to get both at the same time.
tapply(df$Value, df$Gene, max)
Small test set:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
A 123
A 1
B 3
B 5
B 6
C 1
D 3
D 45
D 98
D 234
D 4')
Is the function that returns both the max and the min
Here you'd do:
tapply(df$Value, df$Gene, range)
# $A
# [1] 1 123
# $B
# [1] 3 6
# $C
# [1] 1 1
# $D
# [1] 3 234
You can keep using tapply, and simply modify the FUN argument to return multiple summary statistics. For example:
do.call(rbind, tapply(df$Value, df$Gene, FUN = function(x) c(max = max(x), min = min(x))))
# max min
#A 123 1
#B 6 3
#C 1 1
#D 234 3
I have a function (call it random_func) that generates random numbers according to some rules using parameters. I'm trying to repeatedly call that function and store the results in a dataframe.
df <- lapply(c(1,2,3,4,5), FUN = function(x) replicate(100, expr = random_func(n=10, param=x)))
Right now, the output is a list of 5 vectors each with 100 elements. What R voodoo do I need to do in order to get it to look something like:
param, result
1, 5
1, 6
1, 8
5, 10
do.call(rbind, #rbind results for different x together
lapply(c(1,2), FUN = function(x)
data.frame(param = x, #will be recycled
result = do.call(what = c, #concatenate results of replicate
replicate(n = 2,
expr = rnorm(n = 3, mean = x), #replace with random_func
simplify = FALSE))))) #when FALSE, replicate returns list
# param result
# 1 1 2.3709584
# 2 1 0.4353018
# 3 1 1.3631284
# 4 1 1.6328626
# 5 1 1.4042683
# 6 1 0.8938755
# 7 2 3.5115220
# 8 2 1.9053410
# 9 2 4.0184237
# 10 2 1.9372859
# 11 2 3.3048697
# 12 2 4.2866454
rerun and map_df solution
from purrr
Random function
random_func <- function(n, param) {
myfun <- function() {
df <- 100 %>%
rerun(x=10, y=1:5) %>%
map_df(~data.frame(param=.x$y, result=random_func(n=.x$x, param=.x$y)))
df <- myfun()
param result
1 1 10.15325
2 2 19.52867
3 3 30.08218
4 4 40.06418
5 5 48.39804
6 1 11.00435
Additional validation
df %>%
group_by(param) %>%
summarise(mean = mean(result))
param mean
1 1 10.00634
2 2 20.03874
3 3 30.11093
4 4 40.06166
5 5 50.02632
expr min lq mean median uq max neval
myfun() 65.93166 66.80521 69.42349 68.5152 69.57185 90.77295 100
I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
# user system elapsed
# 22.267 0.194 22.626
# user system elapsed
# 2.073 0.077 2.155
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
This is the test_dat
> test_dat
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
the test_dat_with_NA we get:
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
could not delete the column A
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)