Compute group aggregates over a dynamic number of columns in R - r

I have a large dataset similar to the following table (called results.raw further down) with some independent (X000 to X306) and some dependent variables (they have different names):
X000 X001 X002 ... X306 MEASURE1 OUT2 ... RESULTN
1 2 1 2 1 2 2
1 2 1 2 2 3 1
...
2 3 1 4 5 3 3
...
I want to average this dataset grouping whenever the independent variables are equal. I came up with the following R command, which seems to work, but is very slow
aggregate(results.raw, by = as.list(lapply(as.list(colnames(results.raw)[1:307]), FUN = function (x) { results.raw[,x] })), FUN = mean)
How can this be made faster?

We can either use tidyverse
library(dplyr)
results.raw %>%
group_by_at(1:307) %>%
summarise_all(mean)
Or with data.table
library(data.table)
setDT(results.raw)[, , lapply(.SD, mean), by = c(names(results.raw)[1:307])]

Related

Repeat (duplicate) just one row twice in R

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!
Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6
One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

subset dataframe based on hierarchical preference of factor levels within column in R

I have a dataframe which I would like to subset based on hierarchical preference of factor levels within a column. With following example I want to show, that per level of "ID" I want to select only one "method". Specifically, if possible keeping CACL, if CACL doesn't exist for this level, then subset for "KCL" and if that doesn't exist, then subset for "H2O".
ID<-c(1,1,1,2,2,3)
method<-c("CACL","KCL","H2O","H2O","KCL","H2O")
df1<-data.frame(ID,method)
ID method
1 1 CACL
2 1 KCL
3 1 H2O
4 2 H2O
5 2 KCL
6 3 H2O
ID<-c(1,2,3)
method<-c("CACL","KCL","H2O")
df2<-data.frame(ID,method)
ID method
1 1 CACL
2 2 KCL
3 3 H2O
I have done something similar subsetting by selecting a minimum number within a level, but am not able to adapt it. Am wondering whether I should use ifelse here too?
#if present, choose rows containing "number" 2 instead of 1 (this column contained only the two numbers 1 and 2)
library(dplyr)
new<-df %>%
group_by(col1,col2,col3) %>%
summarize(number = ifelse(any(number > 1), min(number[number>1]),1))
dfnew<-merge(new,df,by=c("colxyz","number"),all.x=T)
You can use order with match and then simply !duplicated:
df1 <- df1[order(match(df1$method, c("CACL","KCL","H2O"))),]
df1[!duplicated(df1$ID),]
# ID method
#1 1 CACL
#5 2 KCL
#6 3 H2O
#Variant not changing df1
i <- order(match(df1$method, c("CACL","KCL","H2O")))
df1[i[!duplicated(df1$ID[i])],]
An option using dplyr:
df1 %>%
mutate(preference = match(method, c("CACL","KCL","H2O"))) %>%
group_by(ID) %>%
filter(preference == min(preference)) %>%
select(-preference)
# A tibble: 3 x 2
# Groups: ID [3]
ID method
<dbl> <fct>
1 1 CACL
2 2 KCL
3 3 H2O

Aggregate in R taking way too long

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3
Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1
library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources