How can I arrange data from wide format to long format, and specify relationships - r

Currently I have a file which I need to converted from wide format to long format. The example of the data is:
Subject,Cat1_Weight,Cat2_Weight,Cat3_Weight,Cat1_Sick,Cat2_Sick,Cat3_Sick
1,10,11,12,1,0,0
2,7,8,9,1,0,0
However, I need it in the long format as follows
Subject,CatNumber,Weight,Sickness
1,1,10,1
1,2,11,0
1,3,12,0
2,1,7,1
2,2,8,0
2,3,9,0
So far I have tried in R to use the melt function
datalong <- melt(exp2_simon_shortform, id ="Subject")
But it treats every single column name as a unique variable each with its own value. Does anybody know how I could get from wide to long as specified, making reference to the column header names?
Cheers.
EDIT: I've realised I made an error. My final output needs to be as follows. So from the Cat1_ portion, I actually need to get out "Cat" and "1"
Subject Animal CatNumber Weight Sickness
1 Cat 1 10 1
1 Cat 2 11 0
1 Cat 3 12 0
2 Cat 1 7 1
2 Cat 2 8 0
2 Cat 3 9 0
Any updated solutions much appreciated.

The "dplyr" + "tidyr" approach might be something like:
library(dplyr)
library(tidyr)
mydf %>%
gather(var, val, -Subject) %>%
separate(var, into = c("CatNumber", "variable")) %>%
spread(variable, val)
# Subject CatNumber Sick Weight
# 1 1 Cat1 1 10
# 2 1 Cat2 0 11
# 3 1 Cat3 0 12
# 4 2 Cat1 1 7
# 5 2 Cat2 0 8
# 6 2 Cat3 0 9
Add a mutate in there along with gsub to remove the "Cat" part of the "CatNumber" column.
Update
Based on the discussions in chat, your data actually look something more like:
A = c("ATCint", "Blank", "None"); B = 1:5; C = c("ResumptionTime", "ResumptionMisses")
colNames <- expand.grid(A, B, C)
colNames <- sprintf("%s%d_%s", colNames[[1]], colNames[[2]], colNames[[3]])
subject = 1:60
set.seed(1)
M <- matrix(sample(10, length(subject) * length(colNames), TRUE),
nrow = length(subject), dimnames = list(NULL, colNames))
mydf <- data.frame(Subject = subject, M)
Thus, you will need to do a few additional steps to get the output you desire. Try:
library(dplyr)
library(tidyr)
mydf %>%
group_by(Subject) %>% ## Your ID variable
gather(var, val, -Subject) %>% ## Make long data. Everything except your IDs
separate(var, into = c("partA", "partB")) %>% ## Split new column into two parts
mutate(partA = gsub("(.*)([0-9]+)", "\\1_\\2", partA)) %>% ## Make new col easy to split
separate(partA, into = c("A1", "A2")) %>% ## Split this new column
spread(partB, val) ## Transform to wide form
Which yields:
Source: local data frame [900 x 5]
Subject A1 A2 ResumptionMisses ResumptionTime
(int) (chr) (chr) (int) (int)
1 1 ATCint 1 9 3
2 1 ATCint 2 4 3
3 1 ATCint 3 2 2
4 1 ATCint 4 7 4
5 1 ATCint 5 7 1
6 1 Blank 1 4 10
7 1 Blank 2 2 4
8 1 Blank 3 7 5
9 1 Blank 4 1 9
10 1 Blank 5 10 10
.. ... ... ... ... ...

You can do it with base reshape, like:
reshape(dat, idvar="Subject", direction="long", varying=list(2:4,5:7),
v.names=c("Weight","Sick"), timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0
Alternatively, since reshape expects names like variablename_groupname you could change the names then get reshape to do the hard work:
names(dat) <- gsub("Cat(.+)_(.+)", "\\2_\\1", names(dat))
reshape(dat, idvar="Subject", direction="long", varying=-1,
sep="_", timevar="CatNumber")
# Subject CatNumber Weight Sick
#1.1 1 1 10 1
#2.1 2 1 7 1
#1.2 1 2 11 0
#2.2 2 2 8 0
#1.3 1 3 12 0
#2.3 2 3 9 0

We can use melt from library(data.table) which can take multiple patterns for measure variable.
library(data.table)#v1.9.6+
DT <- melt(setDT(df1), measure=patterns('Weight$', 'Sick$'),
variable.name='CatNumber', value.name=c('Weight', 'Sick'))[order(Subject)]
DT
# Subject CatNumber Weight Sick
#1: 1 1 10 1
#2: 1 2 11 0
#3: 1 3 12 0
#4: 2 1 7 1
#5: 2 2 8 0
#6: 2 3 9 0
If we need the 'Animal' column, we can grep for 'Cat' columns and remove the suffix substring with sub, assign (:=) it to create the 'Animal' column.
DT[, Animal := sub('\\d+\\_.*', '', grep('Cat', colnames(df1), value=TRUE))]
DT
# Subject CatNumber Weight Sick Animal
#1: 1 1 10 1 Cat
#2: 1 2 11 0 Cat
#3: 1 3 12 0 Cat
#4: 2 1 7 1 Cat
#5: 2 2 8 0 Cat
#6: 2 3 9 0 Cat

Related

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

Add missing values in time series efficiently

I have 500 datasets (panel data). In each I have a time series (week) across different shops (store). Within each shop, I would need to add missing time series observations.
A sample of my data would be:
store week value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like:
store week value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
I currently use the following code (which works, but takes very very long on my data):
stores<-unique(mydata$store)
for (i in 1:length(stores)){
mydata <- merge(
expand.grid(week=min(mydata$week):max(mydata$week)),
mydata, all=TRUE)
mydata[is.na(mydata)] <- 0
}
Are there better and more efficient ways to do so?
Here's a dplyr/tidyr option you could try:
library(dplyr); library(tidyr)
group_by(df, store) %>%
complete(week = full_seq(week, 1L), fill = list(value = 0))
#Source: local data frame [9 x 3]
#
# store week value
# (int) (int) (dbl)
#1 1 1 50
#2 1 2 0
#3 1 3 52
#4 1 4 10
#5 2 1 4
#6 2 2 0
#7 2 3 0
#8 2 4 84
#9 2 5 2
By default, if you don't specify the fill parameter, new rows will be filled with NA. Since you seem to have many other columns, I would advise to leave out the fill parameter so you end up with NAs, and if required, make another step with mutate_each to turn NAs into 0 (if that's appropriate).
group_by(df, store) %>%
complete(week = full_seq(week, 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -store, -week)

Aggregate monthly status data to sequence data

I wonder if there is a simple solution for the following problem: Imagine working with monthly status information about whether somebody works (work=1) or not (work=0). this illustrates the original data:
orig <- data.frame(id=c(rep(1:2, each=10)),
month.nr=c(rep(1:10,2)),
work.yn=c(0,1,1,0,0,0,1,1,1,0,
1,1,1,1,0,1,1,0,0,1))
id month.nr work.yn
1 1 0
1 2 1
1 3 1
1 4 0
1 5 0
1 6 0
1 7 1
1 8 1
1 9 1
1 10 0
2 1 1
2 2 1
2 3 1
2 4 1
2 5 0
2 6 1
2 7 1
2 8 0
2 9 0
2 10 1
I'm looking for a simple function or algorithm which transforms the data keeping only start and end months of working periods and which numbers the resulting sequences by person (id). The resulting data for the sample above would look like this:
id month.start.work month.end.work sequence.nr
1 2 3 1
1 7 9 2
2 1 4 1
2 6 7 2
2 10 10 3
As my data volume is not so small a resource efficient solution is very much appreciated.
Edit: to do the task with a loop (and maybe a lag function) would work, but I´m looking for a more vectorized solution.
Here's somewhat similar solution using the rleid function in data.table v >= 1.9.6 (the newest stable version)
library(data.table) # v.1.9.6+
setDT(orig)[, indx := rleid(work.yn)
][work.yn != 0, .(start = month.nr[1L],
end = month.nr[.N]),
by = .(id, indx)
][, seq := 1:.N,
by = id][]
# id indx start end seq
# 1: 1 2 2 3 1
# 2: 1 4 7 9 2
# 3: 2 6 1 4 1
# 4: 2 8 6 7 2
# 5: 2 10 10 10 3
Slight variant of the above without having to create index first, thereby avoiding one grouping operation:
setDT(orig)[, if (work.yn[1L])
.(start=month.nr[1L], end=month.nr[.N]),
by=.(id, rleid(work.yn))
][, seq := seq_len(.N), by=id][]
Or we could just use range for shorter code
setDT(orig)[, if (work.yn[1L]) as.list(range(month.nr)),
by = .(id, rleid(work.yn))
][, seq := seq_len(.N), by = id][]
You can use the data.table package, with this small utility function:
library(data.table)
f = function(x, y)
{
r = rle(x)
end = y[cumsum(r$lengths)[!!r$values]]
start = end - r$lengths[!!r$values] + 1
list(month.start=start, month.end=end)
}
setDT(orig)[, f(work.yn,month.nr),id][, sequence.nr:=seq(.N),id][]
# id month.start month.end sequence.nr
#1: 1 2 3 1
#2: 1 7 9 2
#3: 2 1 4 1
#4: 2 6 7 2
#5: 2 10 10 3
A solution using dplyr library.
require("dplyr")
orig %>% filter(work.yn == 1) %>% group_by(id) %>%
mutate(sequence.nr = cumsum(diff(c(-1, month.nr)) != 1)) %>%
group_by(id, sequence.nr) %>% mutate(start_mon = min(month.nr),
end_mon = max(month.nr)) %>%
select(-month.nr, -work.yn) %>% distinct
# id sequence.nr start_mon end_mon
# 1 1 1 2 3
# 2 1 2 7 9
# 3 2 1 1 4
# 4 2 2 6 7
# 5 2 3 10 10

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

Resources