Data manipulation in R with data over time - r

Based on the R data.frame below, I am looking for an elegant solution to count the number of people transitioning between groups between times.
dat <- data.frame(people = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
time = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
group = c(5,4,4,3,2,4,4,3,2,1,5,5,4,4,4,3,3,2,2,1))
I would like a generalized solution as my problem is much larger in scale. I was considering that something with mutate could accomplish this but I'm not sure where to start
An example of the start of the output I am looking for would be this:
dat_result <- data.frame(time_start = c(1,1,1,1,1),
time_end = c(2,2,2,2,2),
group_start = c(1,1,1,1,1),
group_end = c(1,2,3,4,5),
count = "")
which would be repeated for all time transitions and all group transitions. Time is of course linear so 1 can only go to 2 and 2 to 3, etc. However, any group can transition to any other group including staying in the same group between times.

I'm using package data.table because it makes easy to work by groups. The same steps can be made using dplyr, but I'm not familiar with it.
library(data.table)
# Convert to data.table:
setDT(dat)
# Make sure your data is ordered by people and time:
setorder(dat, people, time)
# Create a new column with the next group
dat[, next.group := shift(group, -1), by = people]
# Remove rows where there's no change:
# (since this will remove data, you may want to atributte to a new object)
new <- dat[group != next.group]
# Add end.time:
new[, end.time := shift(time, -1, max(dat$time)), by = people]
# Count the ocurrences (and order the result):
> new[, .N, by = .(time, end.time, group, next.group)][order(time, end.time, group)]
time end.time group next.group N
1: 1 3 5 4 1
2: 2 3 4 3 1
3: 2 4 3 2 1
4: 2 5 5 4 1
5: 3 4 3 2 1
6: 3 4 4 3 1
7: 4 5 2 1 2
8: 4 5 3 2 1

Related

Group observations into specified number of groups according to id with data.table solution

I have the following data.table:
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt
id obs
1: 1 0.1470735
2: 1 1.6954685
3: 1 2.3947260
4: 1 2.1782338
5: 1 0.5168873
6: 2 -0.8879545
7: 2 1.9320034
8: 2 2.6269272
9: 2 1.5212627
10: 2 -0.1581711
Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping. For this example, let's say I want to end up with a data.table like this:
id obs group
1: 1 0.1470735 A
2: 1 1.6954685 A
3: 1 2.3947260 A
4: 1 2.1782338 A
5: 1 0.5168873 A
6: 2 -0.8879545 A
7: 2 1.9320034 A
8: 2 2.6269272 A
9: 2 1.5212627 A
10: 2 -0.1581711 A
Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C.
My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
Could someone please help me with an elegant data.table way to accomplish this?
This is the same as #Shree's answer, just using length.out in rep and no dplyr.
I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
You can make an id table; assign groups there; and if necessary merge back:
# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)
# make an id table
idDT = unique(DT[, "id"])
# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]
# merge back if needed
DT[idDT, on=.(id), g := i.g]
You refer to "my actual dataset" -- but R allows you to juggle multiple tables. Trying to do everything in one is almost always counterproductive.
EDIT: Didn't notice that you needed this with data.table. I'll leave this out here as an alternative.
I am creating a dataframe with id and randomly assigned group. This will be joined with your data to get groups for each record by id -
library(dplyr)
library(data.table)
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
max_per_group <- 5
n_ids <- length(unique(dt$id))
data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
left_join(dt, ., by = "id")
id obs grp
1 1 1.28879713 S
2 1 1.04471197 S
3 1 0.36470847 S
4 1 0.46741567 S
5 1 1.07749891 S
6 2 1.73640785 K
7 2 1.61144042 K
8 2 2.85196859 K
9 2 1.84848117 K
10 2 2.11395863 K
11 3 0.88623462 S
12 3 2.11706351 S
13 3 1.29225433 S
14 3 0.30458037 S
15 3 -1.72070005 S
16 4 2.24593162 U
17 4 2.10346287 U
18 4 2.28724412 U
19 4 0.02978044 U
20 4 0.56234660 U
21 5 2.92050008 F
22 5 1.08048974 F
23 5 0.58885261 F
24 5 1.53299092 F
25 5 1.47271123 F

Repeat sequence by group

I have the following dataframe:
a <- data.frame(
group1=factor(rep(c("a","b"),each=6,times=1)),
time=rep(1:6,each=1,times=2),
newcolumn = c(1,1,2,2,3,3,1,1,2,2,3,3)
)
I'm looking to replicate the output of newcolumn with a rep by group function (the time variable is there for ordering purposes). In other words, for each group, ordered by time, how can I assign a sequence 1,1,2,2,n,n? I also need a general solution (in the case that groups are of differing number of rows, or I want to repeat values 3,10,n times).
For instance, I can generate that sequence with this:
newcolumn=rep(1:3,each=2,times=2)
But that wouldn't work in a group by statement where group1 has differing rows.
We specify the length.out in the rep after grouping by 'group1'
library(dplyr)
a %>%
group_by(group1) %>%
mutate(new = rep(seq_len(n()/2), each = 2, length.out = n()))
NOTE: each and times are not used in the same call. Either we use each or times
EDIT: Based on comments from #r2evans
A data.table alternative:
library(data.table)
DT <- as.data.table(a[1:2])
DT[order(time),newcolumn := rep(seq_len(.N/2), each=2, length.out=.N),by=c("group1")]
DT
# group1 time newcolumn
# 1: a 1 1
# 2: a 2 1
# 3: a 3 2
# 4: a 4 2
# 5: a 5 3
# 6: a 6 3
# 7: b 1 1
# 8: b 2 1
# 9: b 3 2
# 10: b 4 2
# 11: b 5 3
# 12: b 6 3

Wrapping cumulative sum from a set starting row in R

I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.

Is my way of duplicating rows in data.table efficient?

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.
My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.
Here is an exemplatory data.table DT for my annual data and how I currently duplicate:
library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1 10 1 3
[2,] a_2 11 3 6
[3,] a_3 12 5 9
[4,] b_1 13 7 12
[5,] b_2 14 9 15
[6,] b_3 15 11 18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT, ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
[...]
The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:
#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:
DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?
Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).
library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))
DT[,rep:=1L][c(2,7),rep:=c(2L,3L)] # duplicate row 2 and triple row 7
DT[,num:=1:.N] # to group each row by itself
DT
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 2 1 3
4: 1 3 1 4
5: 2 1 1 5
6: 2 1 1 6
7: 3 2 3 7
DT[,cbind(.SD,dup=1:rep),by="num"]
num x y rep dup
1: 1 1 1 1 1
2: 2 1 1 1 NA # why these NA?
3: 2 1 1 2 NA
4: 3 1 2 1 1
5: 4 1 3 1 1
6: 5 2 1 1 1
7: 6 2 1 1 1
8: 7 3 2 3 1
9: 7 3 2 3 2
10: 7 3 2 3 3
Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :
DT[rep(num,rep)]
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 1 2 2
4: 1 2 1 3
5: 1 3 1 4
6: 2 1 1 5
7: 2 1 1 6
8: 3 2 3 7
9: 3 2 3 7
10: 3 2 3 7
where in this example data the column rep happens to be the same name as the rep() base function.
Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.
In the meantime, try :
DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.
Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}
The fastest and most succinct way of doing it:
DT[rep(1:nrow(DT), endMonth - startMonth)]
We can also enumerate by group by:
dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources