Sequentially numbering repetitive interactions in R - r

I have a data frame in R that has been previously sorted with data that looks like the following:
id creatorid responderid
1 1 2
2 1 2
3 1 3
4 1 3
5 1 3
6 2 3
7 2 3
I'd like to add a value, called repetition to the data frame that shows how many times that combination of (creatorid,responderid) has previously appeared. For example, the output in this case would be:
id creatorid responderid repetition
1 1 2 0
2 1 2 1
3 1 3 0
4 1 3 1
5 1 3 2
6 2 3 0
7 2 3 1
I have a hunch that this is something that can be easily done with dlply and transform, but I haven't been able to work it out. Here's the simple code that I'm using to attempt it:
dlply(df, .(creatorid, responderid), transform, repetition=function(dfrow) {
seq(0,nrow(dfrow)-1)
})
Unfortunately, this throws the following error (pasted from my real data - the first repetition appears 166 times):
Error in data.frame(list(id = c(39684L, 55374L, 65158L, 54217L, 10004L, :
arguments imply differing number of rows: 166, 0
Any suggestions on an easy and efficient way to accomplish this task?

Using plyr:
ddply(df, .(creatorid, responderid), function(x)
transform(x, repetition = seq_len(nrow(x))-1))
Using data.table:
require(data.table)
dt <- data.table(df)
dt[, repetition := seq_len(.N)-1, by = list(creatorid, responderid)]
using ave:
within(df, {repetition <- ave(id, list(creatorid, responderid),
FUN=function(x) seq_along(x)-1)})

Related

Expand.grid with unknown number of columns

I have the following data frame:
map_value LDGroup ComboNum
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 3 1
1 3 2
I want to find all combinations, selecting one from each LD group. Expand.grid seems to work for this, doing
expand.grid(df[df$LDGroup==1,3],df[df$LDGroup==2,3],df[df$LDGroup==3,3])
My problem is that I have about 500 map_values I need to do this for and I do not know what number of LDGroups will exist for each map_value. Is there a way to dynamically provide the function arguments?
We can split the 3rd column by the 'LDGroup' and apply the expand.grid
out <- expand.grid(split(df$ComboNum, df$LDGroup))
names(out) <- paste0("Var", names(out))

R Conditional Counter based on multiple columns

I have a data frame with multiple responses from subjects (subid), which are in the column labelled trials. The trials count up and then start over within one subject.
Here's an example dataframe:
subid <- rep(1:2, c(10,10))
trial <- rep(1:5, 4)
response <- rnorm(20, 10, 3)
df <- as.data.frame(cbind(subid,trial, response))
df
subid trial response
1 1 1 3.591832
2 1 2 8.980606
3 1 3 12.943185
4 1 4 9.149388
5 1 5 10.192392
6 1 1 15.998124
7 1 2 13.288248
I want a column that increments every time trial starts over within one subject ID (subid):
df$block <- c(rep(1:2, c(5,5)),rep(1:2, c(5,5)))
df
subid trial response block
1 1 1 3.591832 1
2 1 2 8.980606 1
3 1 3 12.943185 1
4 1 4 9.149388 1
5 1 5 10.192392 1
6 1 1 15.998124 2
7 1 2 13.288248 2
The trials are not predictable in where they will start over. My solution so far is messy and utilizes a for loop.
Solution:
block <- 0
blocklist <- 0
for (i in seq_along(df$trial)){
if (df$trial[i]==1){
block = block + 1}else
if (df$trial!=1){
block = block}
blocklist<- c(blocklist, block)
}
blocklist <- blocklist[-1]
df$block <- blocklist
This solution does not start over at a new subid. Before I came to this I was attempting to use Wickham's tidyverse with mutate() and ifelse() in a pipe. If somebody knows a way to accomplish this with that package I would appreciate it. However, I'll use a solution from any package. I've searched for about a day now and don't think this is a duplicate question to other questions like this.
We can do this with ave from base R
df$block <- with(df, ave(trial, subid, FUN = function(x) cumsum(x==1)))
Or with dplyr
library(dplyr)
df %>%
group_by(subid) %>%
mutate(block = cumsum(trial==1))

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

Create a vector listing run length of original vector with same length as original vector

This problem seems trivial but I'm at my wits end after hours of reading.
I need to generate a vector of the same length as the input vector that lists for each value of the input vector the total count for that value. So, by way of example, I would want to generate the last column of this dataframe:
> df
customer.id transaction.count total.transactions
1 1 1 4
2 1 2 4
3 1 3 4
4 1 4 4
5 2 1 2
6 2 2 2
7 3 1 3
8 3 2 3
9 3 3 3
10 4 1 1
I realise this could be done two ways, either by using run lengths of the first column, or grouping the second column using the first and applying a maximum.
I've tried both tapply:
> tapply(df$transaction.count, df$customer.id, max)
And rle:
> rle(df$customer.id)
But both return a vector of shorter length than the original:
[1] 4 2 3 1
Any help gratefully accepted!
You can do it without creating transaction counter with:
df$total.transactions <- with( df,
ave( transaction.count , customer.id , FUN=length) )
You can use rle with rep to get what you want:
x <- rep(1:4, 4:1)
> x
[1] 1 1 1 1 2 2 2 3 3 4
rep(rle(x)$lengths, rle(x)$lengths)
> rep(rle(x)$lengths, rle(x)$lengths)
[1] 4 4 4 4 3 3 3 2 2 1
For performance purposes, you could store the rle object separately so it is only called once.
Or as Karsten suggested with ddply from plyr:
require(plyr)
#Expects data.frame
dat <- data.frame(x = rep(1:4, 4:1))
ddply(dat, "x", transform, total = length(x))
You are probably looking for split-apply-combine approach; have a look at ddply in the plyr package or the split function in base R.

Resources