Aggregate monthly status data to sequence data - r

I wonder if there is a simple solution for the following problem: Imagine working with monthly status information about whether somebody works (work=1) or not (work=0). this illustrates the original data:
orig <- data.frame(id=c(rep(1:2, each=10)),
month.nr=c(rep(1:10,2)),
work.yn=c(0,1,1,0,0,0,1,1,1,0,
1,1,1,1,0,1,1,0,0,1))
id month.nr work.yn
1 1 0
1 2 1
1 3 1
1 4 0
1 5 0
1 6 0
1 7 1
1 8 1
1 9 1
1 10 0
2 1 1
2 2 1
2 3 1
2 4 1
2 5 0
2 6 1
2 7 1
2 8 0
2 9 0
2 10 1
I'm looking for a simple function or algorithm which transforms the data keeping only start and end months of working periods and which numbers the resulting sequences by person (id). The resulting data for the sample above would look like this:
id month.start.work month.end.work sequence.nr
1 2 3 1
1 7 9 2
2 1 4 1
2 6 7 2
2 10 10 3
As my data volume is not so small a resource efficient solution is very much appreciated.
Edit: to do the task with a loop (and maybe a lag function) would work, but I´m looking for a more vectorized solution.

Here's somewhat similar solution using the rleid function in data.table v >= 1.9.6 (the newest stable version)
library(data.table) # v.1.9.6+
setDT(orig)[, indx := rleid(work.yn)
][work.yn != 0, .(start = month.nr[1L],
end = month.nr[.N]),
by = .(id, indx)
][, seq := 1:.N,
by = id][]
# id indx start end seq
# 1: 1 2 2 3 1
# 2: 1 4 7 9 2
# 3: 2 6 1 4 1
# 4: 2 8 6 7 2
# 5: 2 10 10 10 3
Slight variant of the above without having to create index first, thereby avoiding one grouping operation:
setDT(orig)[, if (work.yn[1L])
.(start=month.nr[1L], end=month.nr[.N]),
by=.(id, rleid(work.yn))
][, seq := seq_len(.N), by=id][]
Or we could just use range for shorter code
setDT(orig)[, if (work.yn[1L]) as.list(range(month.nr)),
by = .(id, rleid(work.yn))
][, seq := seq_len(.N), by = id][]

You can use the data.table package, with this small utility function:
library(data.table)
f = function(x, y)
{
r = rle(x)
end = y[cumsum(r$lengths)[!!r$values]]
start = end - r$lengths[!!r$values] + 1
list(month.start=start, month.end=end)
}
setDT(orig)[, f(work.yn,month.nr),id][, sequence.nr:=seq(.N),id][]
# id month.start month.end sequence.nr
#1: 1 2 3 1
#2: 1 7 9 2
#3: 2 1 4 1
#4: 2 6 7 2
#5: 2 10 10 3

A solution using dplyr library.
require("dplyr")
orig %>% filter(work.yn == 1) %>% group_by(id) %>%
mutate(sequence.nr = cumsum(diff(c(-1, month.nr)) != 1)) %>%
group_by(id, sequence.nr) %>% mutate(start_mon = min(month.nr),
end_mon = max(month.nr)) %>%
select(-month.nr, -work.yn) %>% distinct
# id sequence.nr start_mon end_mon
# 1 1 1 2 3
# 2 1 2 7 9
# 3 2 1 1 4
# 4 2 2 6 7
# 5 2 3 10 10

Related

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

How to group rows in data frame while counting occurrences in one column and summing values in other?

I am trying to modify my data frame:
start end duration_time
1 1 2 2.438
2 2 1 3.901
3 1 2 18.037
4 2 3 85.861
5 3 4 83.922
and create something like this:
start end duration_time weight
1 1 2 20.475 2
2 2 1 3.901 1
4 2 3 85.861 1
5 3 4 83.922 1
So the duplicate start-end combinations will be removed, the weight will raise and duration time will sum
I already have a part working I just can't get the weight to work:
library('plyr')
df <- read.table(header = TRUE, text = "start end duration_time
1 1 2 2.438
2 2 1 3.901
3 1 2 18.037
4 2 3 85.861
5 3 4 83.922")
ddply(df, c("start","end"), summarise, weight=? ,duration_time=sum(duration_time))
A base R option is aggregate
do.call(data.frame, aggregate(duration_time~., df1,
FUN = function(x) c(duration_time=sum(x), weight = length(x))))
Simplest solution using data.table :
library(data.table)
setDT(df)[, .(duration_time=sum(duration_time), wt = .N) , by =c("start", "end")]
start end duration_time wt
1: 1 2 20.475 2
2: 2 1 3.901 1
3: 2 3 85.861 1
4: 3 4 83.922 1
Trying something using dplyr, tidyr
library(dplyr)
library(tidyr)
df1 <- df %>% unite(by_var, start,end)
df2 <- cbind(df1 %>% count(by_var), df1 %>% group_by(by_var)%>%
summarise( duration_time=sum(duration_time))%>%
separate(by_var, c("start","end")))[c(3,4,5,2)]
> df2
start end duration_time n
1 1 2 20.475 2
2 2 1 3.901 1
3 2 3 85.861 1
4 3 4 83.922 1

How can I arrange data from wide format to long format, and specify relationships

Currently I have a file which I need to converted from wide format to long format. The example of the data is:
Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0
However, I need it in the long format as follows
Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0
So far I have tried in R to use the melt function
datalong <- melt(exp2_simon_shortform, id ="Subject")
But it treats every single column name as a unique variable each with its own value. Does anybody know how I could get from wide to long as specified, making reference to the column header names?
Cheers.
EDIT: I've realised I made an error. My final output needs to be as follows. So from the Cat1_ portion, I actually need to get out "Cat" and "1"
Subject Animal CatNumber Weight Sickness
1 Cat 1 10 1
1 Cat 2 11 0
1 Cat 3 12 0
2 Cat 1 7 1
2 Cat 2 8 0
2 Cat 3 9 0
Any updated solutions much appreciated.
The "dplyr" + "tidyr" approach might be something like:
library(dplyr)
library(tidyr)
mydf %>%
gather(var, val, -Subject) %>%
separate(var, into = c("CatNumber", "variable")) %>%
spread(variable, val)
# Subject CatNumber Sick Weight
# 1 1 Cat1 1 10
# 2 1 Cat2 0 11
# 3 1 Cat3 0 12
# 4 2 Cat1 1 7
# 5 2 Cat2 0 8
# 6 2 Cat3 0 9
Add a mutate in there along with gsub to remove the "Cat" part of the "CatNumber" column.
Update
Based on the discussions in chat, your data actually look something more like:
A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")
colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])
subject = 1:60
set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE),
nrow = length(subject), dimnames = list(NULL, colNames))
mydf <- data.frame(Subject = subject, M)
Thus, you will need to do a few additional steps to get the output you desire. Try:
library(dplyr)
library(tidyr)
mydf %>%
group_by(Subject) %>% ## Your ID variable
gather(var, val, -Subject) %>% ## Make long data. Everything except your IDs
separate(var, into = c("partA", "partB")) %>% ## Split new column into two parts
mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
separate(partA, into = c("A1", "A2")) %>% ## Split this new column
spread(partB, val) ## Transform to wide form
Which yields:
Source: local data frame [900 x 5]
Subject A1 A2 ResumptionMisses ResumptionTime
(int) (chr) (chr) (int) (int)
1 1 ATCint 1 9 3
2 1 ATCint 2 4 3
3 1 ATCint 3 2 2
4 1 ATCint 4 7 4
5 1 ATCint 5 7 1
6 1 Blank 1 4 10
7 1 Blank 2 2 4
8 1 Blank 3 7 5
9 1 Blank 4 1 9
10 1 Blank 5 10 10
.. ... ... ... ... ...
You can do it with base reshape, like:
reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
v.names=c("Weight","Sick"), timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
Alternatively, since reshape expects names like variablename_groupname you could change the names then get reshape to do the hard work:
names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1,
sep="_", timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
We can use melt from library(data.table) which can take multiple patterns for measure variable.
library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'),
variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT
# Subject CatNumber Weight Sick
#1: 1 1 10 1
#2: 1 2 11 0
#3: 1 3 12 0
#4: 2 1 7 1
#5: 2 2 8 0
#6: 2 3 9 0
If we need the 'Animal' column, we can grep for 'Cat' columns and remove the suffix substring with sub, assign (:=) it to create the 'Animal' column.
DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]
DT
# Subject CatNumber Weight Sick Animal
#1: 1 1 10 1 Cat
#2: 1 2 11 0 Cat
#3: 1 3 12 0 Cat
#4: 2 1 7 1 Cat
#5: 2 2 8 0 Cat
#6: 2 3 9 0 Cat

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

Resources