R: Transposing from long to wide and aggregating rows with matching ID - r

This is something I've been working around for a while just making separate data frames and doing full_join but I think there's an easier way.
Overall, I'm wanting to calculate the differences between an individual ID's value from time 1 to time 2 by type from a long form data frame. This is one of the ways I think I could do it but if other people have other techniques or ideas I'd like to hear them too.
However, I'd also like to know how to address this transposing issue anyway because I'm curious.
Here's my issue.
I have a data frame in long form with 5 different measures for two different time periods. I want to convert this data frame from long form into a wide form so that instead of having a DF look like this (note, not all types are included -- just did 2 for sake of length):
(example df1)
ID Time Value Type
1 1 7 Type1
1 2 8 Type1
2 1 9 Type1
2 2 10 Type1
1 1 13 Type2
1 2 15 Type2
2 1 17 Type2
2 2 19 Type2
I want it to look more like this:
(example df 2)
ID Type1.1 Type1.2 Type2.1 Type2.2
1 7 8 13 15
2 9 10 17 19
I use:
library(dplyr)
library(tidyr)
df.new <- df %>%
spread(Type, Measurement.Value)
and get this from example df 1 which is on the right track:
(example df 3)
ID Time Type1 Type2
1 1 7 13
1 2 8 15
2 1 9 17
2 2 10 19
But now I want to spread the time for each type. When I do something like this on example df3:
newer.df <- df.new %>%
spread(Time, Type1)
to make this:
ID Type1.1 Type1.2
1 7 NA
1 NA 8
2 9 NA
2 NA 10
So, it's producing an NA for each row -- is there a way I can collapse rows on to each other by ID? I think I'm missing something.
Remember, in my example code I'm only using 2 types but in reality I have 5 types -- just wanted to give simplified code.

We can use dcast() from reshape2 package.
library(reshape2)
dcast(df, ID ~ Type + Time, value.var = "Value")
# ID Type1_1 Type1_2 Type2_1 Type2_2
#1 1 7 8 13 15
#2 2 9 10 17 19

Or using the original tidyr package, we could do this:
library(tidyr)
df$Type <- paste(df$Type, df$Time, sep="_")
df$Time <- NULL
spread(df, key=Type, value=Value)
ID Type1_1 Type1_2 Type2_1 Type2_2
1 7 8 13 15
2 9 10 17 19
Nulling the time column did the trick for me. It seems that spread considers all columns not used otherwise as what dcast would call id.vars. There might be a more elegant solution using tidyr, though.

Related

Equivalent to first./last. SAS processing in R

I did find a thread on this (R equivalent of .first or .last sas operator) but it did not fully answer my question.
I come from a SAS background and a common operation is, for example, when you have your patient ID with several different values, and you want to keep only the row with the minimum/maximum value for another variable for each ID. For example, I might have data with dates of a certain medical problem for each ID, and I want a dataset with just the first/last problem date for each patient.
Here's a simple example that gets me what I'm want, but I want to know if there's a better way to do it. I sort by ID, and then count, and I want to just keep the row with the largest count for each ID.
testdata<-data.frame(id=c(1,1,1,2,3,3,4,3,4,4,4),
count=c(5,9,2,6,16,12,0,11,8,8,7))
library(dplyr)
testdata2<-arrange(testdata,id,count)
testdata3<-cbind(testdata2,!duplicated(testdata2$id,fromLast=TRUE))
testdata4<-subset(testdata3,testdata3[,3]=='TRUE')[,-3]
> testdata4
id count
3 1 9
4 2 6
7 3 16
11 4 8
Is there a more compact way to do this?
Thank you.
do.call(rbind.data.frame,
c(by(testdata, testdata$id, function(d) d[c(1L,nrow(d)),]), stringsAsFactors=FALSE))
# id count
# 1.1 1 5
# 1.3 1 2
# 2.4 2 6
# 2.4.1 2 6
# 3.5 3 16
# 3.8 3 11
# 4.7 4 0
# 4.11 4 7
Breaking it down:
d[c(1L,nrow(d)),] returns the first and last row from the dataframe. (I'm assuming the frame has already been ordered appropriately.)
by(testdata, testdata$id, function breaks the larger frame into smaller frames by $id, and passes each smaller frame to the anonymous function. This returns a by-list of each return value.
do.call(rbind.data.frame, grabs the list and row-binds them back together into a single frame. Since the default is to use factors, I added stringsAsFactors=FALSE.
If you want to use dplyr, you can do:
library(dplyr)
group_by(testdata, id) %>%
slice(c(1,n())) %>%
ungroup()
# # A tibble: 8 × 2
# id count
# <dbl> <dbl>
# 1 1 5
# 2 1 2
# 3 2 6
# 4 2 6
# 5 3 16
# 6 3 11
# 7 4 0
# 8 4 7
where n() is a special function within dplyr pipes that returns the number of rows in that (optionally-grouped) frame.

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14
EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14
I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

How to aggregate some columns while keeping other columns in R?

I have a data frame like this:
id no age
1 1 7 23
2 1 2 23
3 2 1 25
4 2 4 25
5 3 6 23
6 3 1 23
and I hope to aggregate the date frame by id to a form like this: (just sum the no if they share the same id, but keep age there)
id no age
1 1 9 23
2 2 5 25
3 3 7 23
How to achieve this using R?
Assuming that your data frame is named df.
aggregate(no~id+age, df, sum)
# id age no
# 1 1 23 9
# 2 3 23 7
# 3 2 25 5
Even better, data.table:
library(data.table)
# convert your object to a data.table (by reference) to unlock data.table syntax
setDT(DF)
DF[ , .(sum_no = sum(no), unq_age = unique(age)), by = id]
Alternatively, you could use ddply from plyr package:
require(plyr)
ddply(df,.(id,age),summarise,no = sum(no))
In this particular example the results are identical. However, this is not always the case, the difference between the both functions is outlined here. Both functions have their uses and are worth exploring, which is why I felt this alternative should be mentioned.

Resources