Grouping an entire data set and aggregating - r

I have a dataset of 20 variables V1,V2,V3......V20 with 1,200 rows.
I want to average of every four rows in my data frame, i.e my output dataset should have 20 columns
containing V1,V2,V3…V20 and 300 rows containing average of data in group of 4.
I cannot use tapply as for that I have to input 1 variable at a time; I want to average all the 20 variables at a time.
Is there an efficient way to do this? I want to use functions from apply family and would
like to avoid looping.

Using lapply with colMeans
set.seed(42)
dat <- as.data.frame(matrix(sample(1:20, 20*1200, replace=TRUE), ncol=20))
n <- seq_len(nrow(dat))
res <- do.call(rbind,lapply(split(dat, (n-1)%/%4 +1),colMeans, na.rm=TRUE))
dim(res)
#[1] 300 20
Explanation
Here the idea is to create a grouping variable that splits the datasets into subsets of datasets in a list with the condition that 1:4 rows goes into first subset, 5:8 to 2nd subset, and ..., the last subset would have 297:300. For easy understanding, using a subset of rows. Suppose if your dataset has 10 rows:
n1 <- seq_len(10)
n1
#[1] 1 2 3 4 5 6 7 8 9 10
(n1-1) %/%4 #created a numeric index to split by group
# [1] 0 0 0 0 1 1 1 1 2 2
I added 1 to the above to start from 1 instead of 0
(n1-1) %/%4 +1
#[1] 1 1 1 1 2 2 2 2 3 3
You could also use gl ie.
gl(10, 4, 10)
For the dataset, it should be
gl(1200, 4, 1200)
Now, you can either split n1 by the newly created grouping index or the dataset
split(n1,(n1-1) %/%4 +1) # you can check the result of this
For a subset of 10 rows of the dataset
split(dat[1:10,], (n1-1) %/%4 +1)
and then use lapply along with colMeans to get the column means of each list element and rbind them using do.call(rbind,..)
Or
summarise_each from dplyr
library(dplyr)
res2 <- dat %>%
mutate(N= (row_number()-1)%/%4+1) %>%
group_by(N) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE))) %>%
select(-N)
dim(res2)
#[1] 300 20
all.equal(as.data.frame(res), as.data.frame(res2), check.attributes=FALSE)
#[1] TRUE
Or
Using data.table
library(data.table)
DT1 <- setDT(dat)[, N:=(seq_len(.N)-1)%/%4 +1][,
lapply(.SD, mean, na.rm=TRUE), by=N][,N:=NULL]
dim(DT1)
#[1] 300 20

Related

Split all values in column and store them in a single numeric vector

I have a dataframe in R and I want to create a single numeric vector by splitting all of the character values in a specific column and then appending them to the vector or list. The values in the column are all comma-separated numbers and there are rows with missing values or NA.
Current data
id col
1 2,6,10
2 NA
3 5, 10
4 1
Final vector
# v <- c(2, 6, 10, 5, 10, 1)
# v
[1] 2 6 10 5 10 1
I'm able to do this by iterating through all the values in the column but I know this isn't the most efficient way since R is made to work easily with vectors. Is there a better way to do this?
v <- c()
for(val in df$col){
if(!is.na(val)){
ints <- as.numeric(unlist(strsplit(val, ",")))
v <- c(v, ints)
}
}
You already have the answer in your code since all the functions you are using are vectorised.
v <- as.numeric(na.omit(unlist(strsplit(df$col, ','))))
v
#[1] 2 6 10 5 10 1
Does this work:
library(dplyr)
library(tidyr)
df %>% separate_rows(col) %>% na.omit() %>% pull(col) %>% as.numeric() -> v
v
[1] 2 6 10 5 10 1
Data used:
df
# A tibble: 4 x 2
id col
<dbl> <chr>
1 1 2,6,10
2 2 NA
3 3 5, 10
4 4 1

Get the mean across list of dataframes by rows

I have a list of dataframes and I want to calculate a mean from each first rows, for all second rows etc.
I think this is possible by creating some common factor as index, put dataframes together using rbind and then calculate the mean value using aggregate(value ~ row.index, mean, large.df). However, I guess there is more straightforward way?
Here is my example:
df1 = data.frame(val = c(4,1,0))
df2 = data.frame(val = c(5,2,1))
df3 = data.frame(val = c(6,3,2))
myLs=list(df1, df2, df3)
[[1]]
val
1 4
2 1
3 0
[[2]]
val
1 5
2 2
3 1
[[3]]
val
1 6
2 3
3 2
And my expected dataframe output, as rowise means:
df.means
mean
1 5
2 2
3 1
My first steps, not working as expected yet:
# Calculate the mean of list by rows
lapply(myLs, function(x) mean(x[1,]))
A simple way would be to cbind the list and calculate mean of each row with rowMeans
rowMeans(do.call(cbind, myLs))
#[1] 5 2 1
We can also use bind_cols from dplyr to combine all the dataframes.
rowMeans(dplyr::bind_cols(myLs))
Here is another base R solution using unlist + data.frame + rowMeans, i.e.,
rowMeans(data.frame(unlist(myLs,recursive = F)))
# [1] 5 2 1
Using double loop:
sapply(1:3, function(i) mean(sapply(myLs, function(j) j[i, ] )))
# [1] 5 2 1
Another base R possibility could be:
Reduce("+", myLs)/length(myLs)
val
1 5
2 2
3 1

filter rows when all columns greater than a value

I have a data frame and I would like to subset the rows where all columns values meet my cutoff.
here is the data frame:
A B C
1 1 3 5
2 4 3 5
3 2 1 2
What I would like to select is rows where all columns are greater than 2.
Second row is what I want to get.
[1] 4 3 5
here is my code:
subset_data <- df[which(df[,c(1:ncol(df))] > 2),]
But my code is not applied on all columns.
Do you have any idea how can I fix this.
We can create a logical matrix my comparing the entire data frame with 2 and then do rowSums over it and select only those rows whose value is equal to number of columns in df
df[rowSums(df > 2) == ncol(df), ]
# A B C
#2 4 3 5
A dplyr approach using filter_all and all_vars
library(dplyr)
df %>% filter_all(all_vars(. > 2))
# A B C
#1 4 3 5
dplyr > 1.0.0
#1. if_all
df %>% filter(if_all(.fns = ~. > 2))
#2. across
df %>% filter(across(.fns = ~. > 2))
An apply approach
#Using apply
df[apply(df > 2, 1, all), ]
#Using lapply as shared by #thelatemail
df[Reduce(`&`, lapply(df, `>`, 2)),]

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

Perform ifelse() on every element of a data frame, but different test for each column in R

I've got a large data frame [4000,600] and I'd like to convert elements to 0 if they are smaller than three orders of magnitude less than each column maximum. So each element would need to be compared to the maximum value of its column and if the element < 0.001*$column_max then it should be converted to 0 and if it isn't, it should remain the same.
I am having a tough time getting apply() to let me use an ifelse() function. Is there a better approach or function I am missing?? I'm fairly new to R.
Use lapply to loop over each column with a replace call:
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))
dat
# a b
#1 1 3
#2 2 4
#3 1001 3003
dat[] <- lapply(dat, function(x) replace(x, x < max(x)/10^3, 0) )
dat
# a b
#1 0 0
#2 2 4
#3 1001 3003
This should work with ifelse if you use apply column-wise:
df <- data.frame(a = c(1:10, 4000), b = c(4:13, 7000))
apply(df, 2, function(x){ifelse(x < 0.001*max(x), 0, x)})
We could do this without using ifelse
library(dplyr)
dat %>%
mutate_each(funs((.>= 0.001*max(.))*.))
# a b
#1 0 0
#2 2 4
#3 1001 3003
data
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))

Resources