I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]
And the loop:
library(plyr)
for (i in 1:length(nums)){
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
max=max(num1),pos=sum(target))
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
num1
but the code is writing
'num1'
I tried with:
min=min(as.name(nums[i]))
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?
The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
f('num1')
f('num2')
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply, you can Vectorize the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):
library(lazyeval)
library(dplyr)
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
)
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
lapply(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
}),
numericCols
)
# inspect the output
liDFY[[numericCols[1]]]
Related
Here is a sample dataset:
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
I would like to create a new column for means using rowMeans. I would like to keep na.rm=F because if its truly NA I do not want to include that into my means calculation.
But if its either 88/99 I would like R to ignore it while calculating the mean and still use the remaining valid values. So far I have the below.
data$mean <- rowMeans(subset(data, select = c(`x`,`y`,`z`,`w`)), na.rm = T)
But I am not sure how to add in a function where it would just ignore the 88 and 99 from calculations.
This is what I am hoping to get
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4),
mean=c(4,NA,3.25,3.25,4.3))
Any help is appreciated - thank you!
Using rowMeans nevertheless with na.rm=TRUE, but on a subset and temporally replaceing 88 and 99 with NA.
s <- rowSums(is.na(data)) == 0 ## store row subset
v <- c("x", "y", "z", "w") ## col subset to calc. mean
data$mean <- NA ## ini column
m <- as.matrix(data[v]) ## we'll ned a matrix
data$mean[s] <- rowMeans(replace(m[s, v], m[s, v] %in% c(88, 99), NA), na.rm=TRUE)
data
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
Or simply using apply but is much slower.
f <- \(x) if (any(is.na(x))) NA else mean(x[!x %in% c(88, 99)])
cbind(data, mean=apply(data, 1, f))
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
From microbenchmark.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# apply 35.018162 35.941815 38.834333 36.394632 36.960161 212.469412 100 b
# rowMeans 1.097393 1.119396 1.493563 1.193787 1.226691 9.352118 100 a
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
df$mean <- apply(data, 1, function(x) {
idx <- which((x %in% c(88, 89)) == FALSE)
mean(x[ idx ], na.rm = TRUE)
})
x y z w mean
1 4 4 88 4 4.00
2 3 NA NA 5 4.00
3 4 3 4 2 3.25
4 4 2 4 3 3.25
5 99 4 5 4 28.00
I have the following dataset:
df <- data.frame(a=1:10,b=10:1)
I have a following function:
fun <- function(x,y) x*y/1000+x+y
I want the following output:
for (i in 2:10){df$a[i] = fun(df$a[i],df$a[i-1])};for (i in 2:10){df$b[i] = fun(df$b[i],df$b[i-1])}}
df
# a b
# 1 1.000000 10.00000
# 2 3.002000 19.09000
# 3 6.011006 27.24272
# 4 10.035050 34.43342
# 5 15.085225 40.64002
# 6 21.175737 45.84322
# 7 28.323967 50.02659
# 8 36.550559 53.17667
# 9 45.879514 55.28303
# 10 56.338309 56.33831
Essentially, the element for row i is the function output from the last row and current row, and this is executed recursively. Is there a better way to do it?
We can use the accumulate function from the purrr package.
library(purrr)
df <- data.frame(a=1:10,b=10:1)
fun <- function(x,y) x*y/1000+x+y
df$a <- accumulate(df$a, fun)
df$b <- accumulate(df$b, fun)
df
# a b
# 1 1.000000 10.00000
# 2 3.002000 19.09000
# 3 6.011006 27.24272
# 4 10.035050 34.43342
# 5 15.085225 40.64002
# 6 21.175737 45.84322
# 7 28.323967 50.02659
# 8 36.550559 53.17667
# 9 45.879514 55.28303
# 10 56.338309 56.33831
A base R option is with Reduce with accumulate = TRUE
df[] <- lapply(df, function(x) Reduce(fun, x, accumulate = TRUE))
df
# a b
#1 1.000000 10.00000
#2 3.002000 19.09000
#3 6.011006 27.24272
#4 10.035050 34.43342
#5 15.085225 40.64002
#6 21.175737 45.84322
#7 28.323967 50.02659
#8 36.550559 53.17667
#9 45.879514 55.28303
#10 56.338309 56.33831
This question already has answers here:
How to get summary statistics by group
(14 answers)
Closed 4 years ago.
I want to find the max and min for each Gene in the following table.
I know, that the following function gives the max (or min), but I could not manage to get both at the same time.
tapply(df$Value, df$Gene, max)
Appreciate!
Small test set:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
A 123
A 1
B 3
B 5
B 6
C 1
D 3
D 45
D 98
D 234
D 4')
range()
Is the function that returns both the max and the min
Here you'd do:
tapply(df$Value, df$Gene, range)
# $A
# [1] 1 123
# $B
# [1] 3 6
# $C
# [1] 1 1
# $D
# [1] 3 234
You can keep using tapply, and simply modify the FUN argument to return multiple summary statistics. For example:
do.call(rbind, tapply(df$Value, df$Gene, FUN = function(x) c(max = max(x), min = min(x))))
# max min
#A 123 1
#B 6 3
#C 1 1
#D 234 3
I have a function (call it random_func) that generates random numbers according to some rules using parameters. I'm trying to repeatedly call that function and store the results in a dataframe.
df <- lapply(c(1,2,3,4,5), FUN = function(x) replicate(100, expr = random_func(n=10, param=x)))
Right now, the output is a list of 5 vectors each with 100 elements. What R voodoo do I need to do in order to get it to look something like:
param, result
1, 5
1, 6
1, 8
...
5, 10
set.seed(42)
do.call(rbind, #rbind results for different x together
lapply(c(1,2), FUN = function(x)
data.frame(param = x, #will be recycled
result = do.call(what = c, #concatenate results of replicate
replicate(n = 2,
expr = rnorm(n = 3, mean = x), #replace with random_func
simplify = FALSE))))) #when FALSE, replicate returns list
# param result
# 1 1 2.3709584
# 2 1 0.4353018
# 3 1 1.3631284
# 4 1 1.6328626
# 5 1 1.4042683
# 6 1 0.8938755
# 7 2 3.5115220
# 8 2 1.9053410
# 9 2 4.0184237
# 10 2 1.9372859
# 11 2 3.3048697
# 12 2 4.2866454
rerun and map_df solution
from purrr
library(dplyr)
library(purrr)
Random function
random_func <- function(n, param) {
rnorm(n)+(param*10)
}
solution
myfun <- function() {
df <- 100 %>%
rerun(x=10, y=1:5) %>%
map_df(~data.frame(param=.x$y, result=random_func(n=.x$x, param=.x$y)))
}
Output
df <- myfun()
head(df)
param result
1 1 10.15325
2 2 19.52867
3 3 30.08218
4 4 40.06418
5 5 48.39804
6 1 11.00435
Additional validation
df %>%
group_by(param) %>%
summarise(mean = mean(result))
param mean
1 1 10.00634
2 2 20.03874
3 3 30.11093
4 4 40.06166
5 5 50.02632
Performance
library(microbenchmark)
microbenchmark(myfun())
expr min lq mean median uq max neval
myfun() 65.93166 66.80521 69.42349 68.5152 69.57185 90.77295 100
I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)