Loop or apply for sum of rows based on multiple conditions in R dataframe - r

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)

You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7

You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts

You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.

library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

Related

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

R Order only one factor level (or column if after) to affect order long to wide (using spread)

I have a problem after changing my dataset from long to wide (using spread, from the tidyr library on the Result_Type column). I have the following example df:
Group<-c("A","A","A","B","B","B","C","C","C","D", "D")
Result_Type<-c("Final.Result", "Verification","Test", "Verification","Final.Result","Fast",
"Verification","Fast", "Final.Result", "Test", "Final.Result")
Result<-c(7,1,8,7,"NA",9,10,12,17,50,11)
df<-data.frame(Group, Result_Type, Result)
df
Group Result_Type Result
1 A Final.Result 7
2 A Verification 1
3 A Test 8
4 B Verification 7
5 B Final.Result NA
6 B Fast 9
7 C Verification 10
8 C Fast 12
9 C Final.Result 17
10 D Test 50
11 D Final.Result 11
In the column Result_type there are many possible result types and in some datasets I have Result_Type 's that will not occur in other datasets. However, one level: Final.Resultdoes occur in every dataset.
Also: This is example data but the actual data has many different columns, and as these differ across the datasets I use, I used spread (from the tidyr library) so I don't have to give any specific column names other than my target columns.
library("tidyr")
df_spread<-spread(df, key = Result_Type, value = Result)
Group Fast Final.Result Test Verification
1 A <NA> 7 8 1
2 B 9 NA <NA> 7
3 C 12 17 <NA> 10
4 D <NA> 11 50 <NA>
What I would like is that once I convert the dataset from long to wide, Final.Result is the first column, how the rest of the columns is arranged doesn't matter, so I would like it to be like this (without calling any names of the other columns that are spread, or using order index numbers):
Group Final.Result Fast Test Verification
1 A 7 <NA> 8 1
2 B NA 9 <NA> 7
3 C 17 12 <NA> 10
4 D 11 <NA> 50 <NA>
I saw some answers that indicated you can reverse the order of the spreaded columns, or turn off the ordering of spread, but that doesn't make sure that Final.Result is always the first column of the spread levels.
I hope I am making myself clear, it's a little complicated to explain. If someone needs extra info I will be happy to explain more!
spread creates columns in the order of the key column's factor levels. Within the tidyverse, forcats::fct_relevel is a convenience function for rearranging factor levels. The default is that the level(s) you specify will be moved to the front.
library(dplyr)
library(tidyr)
...
levels(df$Result_Type)
#> [1] "Fast" "Final.Result" "Test" "Verification"
Calling fct_relevel will put "Final.Result" as the first level, keeping the rest of the levels in their previous order.
reordered <- df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result"))
levels(reordered$Result_Type)
#> [1] "Final.Result" "Fast" "Test" "Verification"
Adding that into your pipeline puts Final.Result as the first column after spreading.
df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result")) %>%
spread(key = Result_Type, value = Result)
#> Group Final.Result Fast Test Verification
#> 1 A 7 <NA> 8 1
#> 2 B NA 9 <NA> 7
#> 3 C 17 12 <NA> 10
#> 4 D 11 <NA> 50 <NA>
Created on 2018-12-14 by the reprex package (v0.2.1)
One option is to refactor Result_Type to put final.result as the first one:
df$Result_Type<-factor(df$Result_Type,levels=c("Final.Result",as.character(unique(df$Result_Type)[!unique(df$Result_Type)=="Final.Result"])))
spread(df, key = Result_Type, value = Result)
Group Final.Result Verification Test Fast
1 A 7 1 8 NA
2 B NA 7 NA 9
3 C 17 10 NA 12
4 D 11 NA 50 NA
If you'd like you can use this opportunity to also sort the rest of the columns whichever way you want.

R: Transposing from long to wide and aggregating rows with matching ID

This is something I've been working around for a while just making separate data frames and doing full_join but I think there's an easier way.
Overall, I'm wanting to calculate the differences between an individual ID's value from time 1 to time 2 by type from a long form data frame. This is one of the ways I think I could do it but if other people have other techniques or ideas I'd like to hear them too.
However, I'd also like to know how to address this transposing issue anyway because I'm curious.
Here's my issue.
I have a data frame in long form with 5 different measures for two different time periods. I want to convert this data frame from long form into a wide form so that instead of having a DF look like this (note, not all types are included -- just did 2 for sake of length):
(example df1)
ID Time Value Type
1 1 7 Type1
1 2 8 Type1
2 1 9 Type1
2 2 10 Type1
1 1 13 Type2
1 2 15 Type2
2 1 17 Type2
2 2 19 Type2
I want it to look more like this:
(example df 2)
ID Type1.1 Type1.2 Type2.1 Type2.2
1 7 8 13 15
2 9 10 17 19
I use:
library(dplyr)
library(tidyr)
df.new <- df %>%
spread(Type, Measurement.Value)
and get this from example df 1 which is on the right track:
(example df 3)
ID Time Type1 Type2
1 1 7 13
1 2 8 15
2 1 9 17
2 2 10 19
But now I want to spread the time for each type. When I do something like this on example df3:
newer.df <- df.new %>%
spread(Time, Type1)
to make this:
ID Type1.1 Type1.2
1 7 NA
1 NA 8
2 9 NA
2 NA 10
So, it's producing an NA for each row -- is there a way I can collapse rows on to each other by ID? I think I'm missing something.
Remember, in my example code I'm only using 2 types but in reality I have 5 types -- just wanted to give simplified code.
We can use dcast() from reshape2 package.
library(reshape2)
dcast(df, ID ~ Type + Time, value.var = "Value")
# ID Type1_1 Type1_2 Type2_1 Type2_2
#1 1 7 8 13 15
#2 2 9 10 17 19
Or using the original tidyr package, we could do this:
library(tidyr)
df$Type <- paste(df$Type, df$Time, sep="_")
df$Time <- NULL
spread(df, key=Type, value=Value)
ID Type1_1 Type1_2 Type2_1 Type2_2
1 7 8 13 15
2 9 10 17 19
Nulling the time column did the trick for me. It seems that spread considers all columns not used otherwise as what dcast would call id.vars. There might be a more elegant solution using tidyr, though.

Add row value to previous row value in R

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a column to the previous value and repeat right down a column of 1000s of values? Note, I do not want a cumulative sum and therefore the cumsum function is of no use. Say my column is called WD, I want to add WD1 to WD2, WD2 to WD3, WD3 to WD4 etc. all the way down and output these sums as a new column. Is there an easy way? Many thanks.
A reproducible example:
set.seed(111)
df1 <- data.frame(WD=sample(10))
#result
df1
WD new
1 6 6
2 7 13
3 3 10
4 4 7
5 8 12
6 10 18
7 1 11
8 2 3
9 9 11
10 5 14
We add the current row (WD[-nrow(df1)]) with the next row (WD[-1L]) and concatenate with the first element to create the column.
df1$newColumn <- with(df1, c(WD[1],WD[-1]+WD[-nrow(df1)]))
Another option, using lag() from dplyr:
library(dplyr)
mutate(df1, new = WD + lag(WD, default = 0))
Or using shift() from data.table:
library(data.table)
setDT(df1)[, new := WD + shift(WD, fill = 0)]
Note: The default type of shift() is "lag". The other possible value is "lead".
Which gives:
# WD new
#1 6 6
#2 7 13
#3 3 10
#4 4 7
#5 8 12
#6 10 18
#7 1 11
#8 2 3
#9 9 11
#10 5 14

R sort summarise ddply by group sum

I have a data.frame like this
x <- data.frame(Category=factor(c("One", "One", "Four", "Two","Two",
"Three", "Two", "Four","Three")),
City=factor(c("D","A","B","B","A","D","A","C","C")),
Frequency=c(10,1,5,2,14,8,20,3,5))
Category City Frequency
1 One D 10
2 One A 1
3 Four B 5
4 Two B 2
5 Two A 14
6 Three D 8
7 Two A 20
8 Four C 3
9 Three C 5
I want to make a pivot table with sum(Frequency) and used the ddply function like this:
ddply(x,.(Category,City),summarize,Total=sum(Frequency))
Category City Total
1 Four B 5
2 Four C 3
3 One A 1
4 One D 10
5 Three C 5
6 Three D 8
7 Two A 34
8 Two B 2
But I need this results sorted by the total in each Category group. Something like this:
Category City Frequency
1 Two A 34
2 Two B 2
3 Three D 14
4 Three C 5
5 One D 10
6 One A 1
7 Four B 5
8 Four C 3
I have looked and tried sort, order, arrange, but nothing seems to do what I need. How can I do this in R?
Here is a base R version, where DF is the result of your ddply call:
with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])
produces:
Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3
The logic is basically the same as David's, calculate the sum of Total for each Category, use that number for all rows in each Category (we do this with ave(..., FUN=sum)), and then sort by that plus some tie breakers to make sure stuff comes out as expected.
This is a nice question and I can't think of a straight way of doing this rather than creating a total size index and then sorting by it. Here's a possible data.table approach which uses setorder function which will order your data by reference
library(data.table)
Res <- setDT(x)[, .(Total = sum(Frequency)), by = .(Category, City)]
setorder(Res[, size := sum(Total), by = Category], -size, -Total, Category)[]
# Category City Total size
# 1: Two A 34 36
# 2: Two B 2 36
# 3: Three D 8 13
# 4: Three C 5 13
# 5: One D 10 11
# 6: One A 1 11
# 7: Four B 5 8
# 8: Four C 3 8
Or if you deep in the Hdleyverse, we can reach a similar result using the newer dplyr package (as suggested by #akrun)
library(dplyr)
x %>%
group_by(Category, City) %>%
summarise(Total = sum(Frequency)) %>%
mutate(size= sum(Total)) %>%
ungroup %>%
arrange(-size, -Total, Category)

Resources