Aggregate in R taking way too long - r

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3

Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1

library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

Related

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

Data manipulation in R with data over time

Based on the R data.frame below, I am looking for an elegant solution to count the number of people transitioning between groups between times.
dat <- data.frame(people = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
time = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
group = c(5,4,4,3,2,4,4,3,2,1,5,5,4,4,4,3,3,2,2,1))
I would like a generalized solution as my problem is much larger in scale. I was considering that something with mutate could accomplish this but I'm not sure where to start
An example of the start of the output I am looking for would be this:
dat_result <- data.frame(time_start = c(1,1,1,1,1),
time_end = c(2,2,2,2,2),
group_start = c(1,1,1,1,1),
group_end = c(1,2,3,4,5),
count = "")
which would be repeated for all time transitions and all group transitions. Time is of course linear so 1 can only go to 2 and 2 to 3, etc. However, any group can transition to any other group including staying in the same group between times.
I'm using package data.table because it makes easy to work by groups. The same steps can be made using dplyr, but I'm not familiar with it.
library(data.table)
# Convert to data.table:
setDT(dat)
# Make sure your data is ordered by people and time:
setorder(dat, people, time)
# Create a new column with the next group
dat[, next.group := shift(group, -1), by = people]
# Remove rows where there's no change:
# (since this will remove data, you may want to atributte to a new object)
new <- dat[group != next.group]
# Add end.time:
new[, end.time := shift(time, -1, max(dat$time)), by = people]
# Count the ocurrences (and order the result):
> new[, .N, by = .(time, end.time, group, next.group)][order(time, end.time, group)]
time end.time group next.group N
1: 1 3 5 4 1
2: 2 3 4 3 1
3: 2 4 3 2 1
4: 2 5 5 4 1
5: 3 4 3 2 1
6: 3 4 4 3 1
7: 4 5 2 1 2
8: 4 5 3 2 1

Compute group aggregates over a dynamic number of columns in R

I have a large dataset similar to the following table (called results.raw further down) with some independent (X000 to X306) and some dependent variables (they have different names):
X000 X001 X002 ... X306 MEASURE1 OUT2 ... RESULTN
1 2 1 2 1 2 2
1 2 1 2 2 3 1
...
2 3 1 4 5 3 3
...
I want to average this dataset grouping whenever the independent variables are equal. I came up with the following R command, which seems to work, but is very slow
aggregate(results.raw, by = as.list(lapply(as.list(colnames(results.raw)[1:307]), FUN = function (x) { results.raw[,x] })), FUN = mean)
How can this be made faster?
We can either use tidyverse
library(dplyr)
results.raw %>%
group_by_at(1:307) %>%
summarise_all(mean)
Or with data.table
library(data.table)
setDT(results.raw)[, , lapply(.SD, mean), by = c(names(results.raw)[1:307])]

R Conditional Counter based on multiple columns

I have a data frame with multiple responses from subjects (subid), which are in the column labelled trials. The trials count up and then start over within one subject.
Here's an example dataframe:
subid <- rep(1:2, c(10,10))
trial <- rep(1:5, 4)
response <- rnorm(20, 10, 3)
df <- as.data.frame(cbind(subid,trial, response))
df
subid trial response
1 1 1 3.591832
2 1 2 8.980606
3 1 3 12.943185
4 1 4 9.149388
5 1 5 10.192392
6 1 1 15.998124
7 1 2 13.288248
I want a column that increments every time trial starts over within one subject ID (subid):
df$block <- c(rep(1:2, c(5,5)),rep(1:2, c(5,5)))
df
subid trial response block
1 1 1 3.591832 1
2 1 2 8.980606 1
3 1 3 12.943185 1
4 1 4 9.149388 1
5 1 5 10.192392 1
6 1 1 15.998124 2
7 1 2 13.288248 2
The trials are not predictable in where they will start over. My solution so far is messy and utilizes a for loop.
Solution:
block <- 0
blocklist <- 0
for (i in seq_along(df$trial)){
if (df$trial[i]==1){
block = block + 1}else
if (df$trial!=1){
block = block}
blocklist<- c(blocklist, block)
}
blocklist <- blocklist[-1]
df$block <- blocklist
This solution does not start over at a new subid. Before I came to this I was attempting to use Wickham's tidyverse with mutate() and ifelse() in a pipe. If somebody knows a way to accomplish this with that package I would appreciate it. However, I'll use a solution from any package. I've searched for about a day now and don't think this is a duplicate question to other questions like this.
We can do this with ave from base R
df$block <- with(df, ave(trial, subid, FUN = function(x) cumsum(x==1)))
Or with dplyr
library(dplyr)
df %>%
group_by(subid) %>%
mutate(block = cumsum(trial==1))

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

Resources