Operate over levels of two factors - r

I have a dataset that looks something like this, with many classes, each with many (5-10) subclasses, each with a value associated with it:
> data.frame(class=rep(letters[1:4], each=4), subclass=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8), value=1:16)
class subclass value
1 a 1 1
2 a 1 2
3 a 2 3
4 a 2 4
5 b 3 5
6 b 3 6
7 b 4 7
8 b 4 8
9 c 5 9
10 c 5 10
11 c 6 11
12 c 6 12
13 d 7 13
14 d 7 14
15 d 8 15
16 d 8 16
I want to first sum the values for each class/subclass, then take the median value for each class among all the subclasses.
I.e., the intermediate step would sum the values for each subclass for each class, and would look like this (note that I don't need to keep the data from this intermediate step):
> data.frame(class=rep(letters[1:4], each=2), subclass=1:8, sum=c(3,7,11,15,19,23,27,31))
class subclass sum
1 a 1 3
2 a 2 7
3 b 3 11
4 b 4 15
5 c 5 19
6 c 6 23
7 d 7 27
8 d 8 31
The second step would take the median for each class among all the subclasses, and would look like this:
> data.frame(class=letters[1:4], median=c(median(c(3,7)), median(c(11,15)), median(c(19,23)), median(c(27,31))))
class median
1 a 5
2 b 13
3 c 21
4 d 29
This is the only data I need to keep. Note that both $class and $subclass will be factor variables, and value will always be a non-missing positive integer. Each class will have a varying number of subclasses.
I'm sure I can do this with some nasty for loops, but I was hoping for a better way that's vectorized and easier to maintain.

Here is another example of using aggregate
temp <- aggregate(df$value,list(class=df$class,subclass=df$subclass),sum)
aggregate(temp$x,list(class=temp$class),median)
Output:
class x
1 a 5
2 b 13
3 c 21
4 d 29
Or if you like a one-liner solution, you can do:
aggregate(value ~ class, median, data=aggregate(value ~ ., sum, data=df))

You could try for your first step:
df_sums <- aggregate(value ~ class + subclass, sum, data=df)
Then:
aggregate(value ~ class, median, data=df_sums)

Here are two other alternatives.
The first uses ave within a within statement where we progressively reduce our source data.frame after adding in our aggregated data. Since this will result in many repeated rows, we can safely use unique as the last step to get the output you want.
unique(within(mydf, {
Sum <- ave(value, class, subclass, FUN = sum)
rm(subclass, value)
Median <- ave(Sum, class, FUN = median)
rm(Sum)
}))
# class Median
# 1 a 5
# 5 b 13
# 9 c 21
# 13 d 29
A second option is to use the "data.table" package and "compound" your statements as below. V1 is the name that will be automatically created by data.table if a name is not specified by the user.
library(data.table)
DT <- data.table(mydf)
DT[, sum(value), by = c("class", "subclass")][, median(V1), by = "class"]
# class V1
# 1: a 5
# 2: b 13
# 3: c 21
# 4: d 29

Related

Keep the row if the specific column is the minimum value of that row

I cannot share the dataset but I will explain it as best as I can.
The dataset has 50 columns 48 of them are in Y/m/d h:m:s format. also the data has many NA, but it must not be removed.
Let's say there is a column B. I want to remove the rows if the value of B is not the earliest in that row.
How can I do this in R? For example, the original would be like this:
df <- data.frame(
A = c(11,19,17,6,13),
B = c(18,9,5,16,12),
C = c(14,15,8,87,16))
A B C
1 11 18 14
2 19 9 15
3 17 5 8
4 6 16 87
5 13 12 16
but I want this:
A B C
1 19 9 15
2 17 5 8
3 13 12 16
You could use apply() to find the minimum for each row.
df |> subset(B == apply(df, 1, min, na.rm = TRUE))
# A B C
# 2 19 9 15
# 3 17 5 8
# 5 13 12 16
The tidyverse equivalent is
library(tidyverse)
df %>% filter(B == pmap(across(A:C), min, na.rm = TRUE))
If you are willing to use data.table, you could do the following for the example.
library(data.table)
setDT(df)
df[(B < A & B < C)]
A B C
1: 19 9 15
2: 17 5 8
3: 13 12 16
More generally, you could do
df <- as.data.table(df)
df[, min := do.call(pmin, .SD)][B == min, !"min"]
.SDcols in the first [ would let you control which columns you want to take the min over, if you wanted to eg. exclude some. I am not super knowledgeable about the inner workings of data.table, but I believe that creating this new column is probably efficient RAM-wise.

(Using a custom function to) Sum above N rows in a datatable (dataframe) by groups

I need a function that sums the above N+1 rows in dataframes (data tables) by groups.
An equivalent function for a vector, would be something like below. (Please forgive me if the function below is inefficient)
Function1<-function(x,N){
y<-vector(length=length(x))
for (i in 1:length(x))
if (i<=N)
y[i]<-sum(x[1:i])
else if (i>N)
y[i]<-sum(x[(i-N):i])
return(y)}
Function1(c(1,2,3,4,5,6),3)
#[1] 1 3 6 10 14 18 # Sums previous (above) 4 values (rows)
I wanted to use this function with sapply, like below..
sapply(X=DF<-data.frame(A=c(1:10), B=2), FUN=Function1(N=3))
but couldn't.. because I could not figure out how to set a default for the x in my function. Thus, I built another function for data.frames.
Function2<-function(x, N)
if(is.data.frame(x)) {
y<-data.frame()
for(j in 1:ncol(x))
for(i in 1:nrow(x))
if (i<=N) {
y[i,j]<-sum(x[1:i,j])
} else if (i>N) {
y[i,j]<-sum(x[(i-N):i,j])}
return(y)}
DF<-data.frame(A=c(1:10), B=2)
Function2(DF, 2)
# V1 V2
1 1 2
2 3 4
3 6 6
4 9 6
5 12 6
6 15 6
7 18 6
8 21 6
9 24 6
10 27 6
However, I still need to perform this by groups. For example, for the following data frame with a character column.
DF<-data.frame(Name=rep(c("A","B"),each=5), A=c(1:10), B=2)
I would like to apply my function by group "Name" -- which would result in.
A 1 2
A 3 4
A 6 6
A 9 6
A 12 6
B 6 2
B 13 4
B 21 6
B 24 6
B 27 6
#Perform function2 separately for group A and B.
I was hoping to use function with the data.table package (by=Groups), but couldn't figure out how.
What would be the best way to do this?
(Also, it would be really nice, if I could learn how to make my Function1 to work in sapply)
With data.table, we group by 'Name', loop through the columns of interest specified in .SDcols (here all the columns are of interest so we are not specifying it) and apply the Function1
library(data.table)
setDT(DF)[, lapply(.SD, Function1, 2), Name]
# Name A B
# 1: A 1 2
# 2: A 3 4
# 3: A 6 6
# 4: A 9 6
# 5: A 12 6
# 6: B 6 2
# 7: B 13 4
# 8: B 21 6
# 9: B 24 6
#10: B 27 6

ratios according to two variables, function aggregate in R?

I've been playing with some data in order to obtain the ratios between two levels within one variable and taking into account two other variables. I've been using the function aggregate(), which is very useful to calculate means and sums. However, I'm stuck when I want to calculate some ratios (divisions).
Here you find a dataframe very similar to my data:
w<-c("A","B","C","D","E","F","A","B","C","D","E","F")
x<-c(1,1,1,1,1,1,2,2,2,2,2,2)
y<-c(3,4,5,6,8,10,3,4,5,7,9,10)
z<-runif(12)
df<-data.frame(w,x,y,z)
df
w x y z
1 A 1 3 0.93767621
2 B 1 4 0.09169992
3 C 1 5 0.49012926
4 D 1 6 0.90886690
5 E 1 8 0.37058120
6 F 1 10 0.83558267
7 A 2 3 0.42670001
8 B 2 4 0.05656252
9 C 2 5 0.70694423
10 D 2 7 0.13634309
11 E 2 9 0.92065671
12 F 2 10 0.56276176
What I want is to obtain the ratios of z from the two levels of x and taking into account the variables w and y. So the level "A" from the variable "w" in the level "3" from the variable "y" should be:
df$z[1]/df$z[7]
With aggregate function should be something like this:
final<-aggregate(z~y:w, data=df)
However, I know that I miss something because in the variable y there are some classes that not appear in the two categories of w (e.g. 7, 8 and 9).
Any help will be welcomed!
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'w', 'y', if the nrow (.N) is 2, we divide the first value by the second or else return the 'z'. Assign (:=) the output to a new column 'z1'.
library(data.table)
setDT(df)[,z1 :=if(.N==2) z[1]/z[2] else z , by = .(w,y)]
df
# w x y z z1
# 1: A 1 3 0.93767621 2.1975069
# 2: B 1 4 0.09169992 1.6212135
# 3: C 1 5 0.49012926 0.6933068
# 4: D 1 6 0.90886690 0.9088669
# 5: E 1 8 0.37058120 0.3705812
# 6: F 1 10 0.83558267 1.4847894
# 7: A 2 3 0.42670001 2.1975069
# 8: B 2 4 0.05656252 1.6212135
# 9: C 2 5 0.70694423 0.6933068
#10: D 2 7 0.13634309 0.1363431
#11: E 2 9 0.92065671 0.9206567
#12: F 2 10 0.56276176 1.4847894
If we just want the summary output we don't need to use :=
setDT(df)[, list(z=if(.N==2) z[1]/z[2] else z) , by = .(w,y)]
Or using aggregate
aggregate(z~w+y, df, FUN=function(x)
if(length(x)==2) x[1]/x[2] else x)

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources