R Conditional Counter based on multiple columns - r

I have a data frame with multiple responses from subjects (subid), which are in the column labelled trials. The trials count up and then start over within one subject.
Here's an example dataframe:
subid <- rep(1:2, c(10,10))
trial <- rep(1:5, 4)
response <- rnorm(20, 10, 3)
df <- as.data.frame(cbind(subid,trial, response))
df
subid trial response
1 1 1 3.591832
2 1 2 8.980606
3 1 3 12.943185
4 1 4 9.149388
5 1 5 10.192392
6 1 1 15.998124
7 1 2 13.288248
I want a column that increments every time trial starts over within one subject ID (subid):
df$block <- c(rep(1:2, c(5,5)),rep(1:2, c(5,5)))
df
subid trial response block
1 1 1 3.591832 1
2 1 2 8.980606 1
3 1 3 12.943185 1
4 1 4 9.149388 1
5 1 5 10.192392 1
6 1 1 15.998124 2
7 1 2 13.288248 2
The trials are not predictable in where they will start over. My solution so far is messy and utilizes a for loop.
Solution:
block <- 0
blocklist <- 0
for (i in seq_along(df$trial)){
if (df$trial[i]==1){
block = block + 1}else
if (df$trial!=1){
block = block}
blocklist<- c(blocklist, block)
}
blocklist <- blocklist[-1]
df$block <- blocklist
This solution does not start over at a new subid. Before I came to this I was attempting to use Wickham's tidyverse with mutate() and ifelse() in a pipe. If somebody knows a way to accomplish this with that package I would appreciate it. However, I'll use a solution from any package. I've searched for about a day now and don't think this is a duplicate question to other questions like this.

We can do this with ave from base R
df$block <- with(df, ave(trial, subid, FUN = function(x) cumsum(x==1)))
Or with dplyr
library(dplyr)
df %>%
group_by(subid) %>%
mutate(block = cumsum(trial==1))

Related

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

To count how many times one row is equal to a value

To count how many times one row is equal to a value
I have a df here:
df <- data.frame('v1'=c(1,2,3,4,5),
'v2'=c(1,2,1,1,2),
'v3'=c(NA,2,1,4,'1'),
'v4'=c(1,2,3,NaN,5),
'logical'=c(1,2,3,4,5))
I would like to know how many times one row is equal to the value of the variable 'logical' with a new variable 'count'
I wrte a for loop like this:
attach(df)
df$count <- 0
for(i in colnames(v1:v4)){
if(df$logical == i){
df$count <- df$count+1}
}
but it doesn't work. there's still all 0 in the new variable 'count'.
Please help to fix it.
the perfect result should looks like this:
df <- data.frame('v1'=c(1,2,3,4,5),
'v2'=c(1,2,1,1,2),
'v3'=c(NA,2,1,4,'1'),
'v4'=c(1,2,3,NaN,5),
'logical'=c(1,2,3,4,5),
'count'=c(3,4,2,2,2))
Many thanks from a beginner.
We can use rowSums after creating a logical matrix
df$count <- rowSums(df[1:4] == df$logical, na.rm = TRUE)
df$count
#[1] 3 4 2 2 2
Personally I guess so far the solution by #akrun is an elegant and also the best efficient way to add the column count.
Another way (I don't know if that is the one you are looking for the "elegance") you can used to "attach" the column the count column to the end of df might be using within, i.e.,
df <- within(df, count <- rowSums(df[1:4]==logical,na.rm = T))
such that you will get
> df
v1 v2 v3 v4 logical count
1 1 1 <NA> 1 1 3
2 2 2 2 2 2 4
3 3 1 1 3 3 2
4 4 1 4 NaN 4 2
5 5 2 1 5 5 2

Aggregate in R taking way too long

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3
Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1
library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

Sequentially numbering repetitive interactions in R

I have a data frame in R that has been previously sorted with data that looks like the following:
id creatorid responderid
1 1 2
2 1 2
3 1 3
4 1 3
5 1 3
6 2 3
7 2 3
I'd like to add a value, called repetition to the data frame that shows how many times that combination of (creatorid,responderid) has previously appeared. For example, the output in this case would be:
id creatorid responderid repetition
1 1 2 0
2 1 2 1
3 1 3 0
4 1 3 1
5 1 3 2
6 2 3 0
7 2 3 1
I have a hunch that this is something that can be easily done with dlply and transform, but I haven't been able to work it out. Here's the simple code that I'm using to attempt it:
dlply(df, .(creatorid, responderid), transform, repetition=function(dfrow) {
seq(0,nrow(dfrow)-1)
})
Unfortunately, this throws the following error (pasted from my real data - the first repetition appears 166 times):
Error in data.frame(list(id = c(39684L, 55374L, 65158L, 54217L, 10004L, :
arguments imply differing number of rows: 166, 0
Any suggestions on an easy and efficient way to accomplish this task?
Using plyr:
ddply(df, .(creatorid, responderid), function(x)
transform(x, repetition = seq_len(nrow(x))-1))
Using data.table:
require(data.table)
dt <- data.table(df)
dt[, repetition := seq_len(.N)-1, by = list(creatorid, responderid)]
using ave:
within(df, {repetition <- ave(id, list(creatorid, responderid),
FUN=function(x) seq_along(x)-1)})

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources