How to use apply function once for each unique factor value - r

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?

If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18

Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

Related

Calculate a new column based on the difference between current row and previous row in another variable [duplicate]

This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?
The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.
You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))
Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Delete following observations when goal has been reached

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

transform data frame to another form

I'm new in [r]. And recently i'm stuck in how to perform operation in data.frame.
Now I have a data.frame called frame. And I want to transform it to another form.
> frame
A B Freq total
1 0 0 75 110
2 1 0 21 110
3 0 1 8 110
4 1 1 6 110
the expected form is:
> frame(B=1)
A Freq total
1 0 8 83
2 1 6 27
Can anyone give some suggestions? Thanks
One option would be using dplyr. We group by 'A', and create a new column 'total' as the sum of "Freq", filter the rows where 'B' = 1, and select all other columns except 'B'
library(dplyr)
frame %>%
group_by(A) %>%
mutate(total= sum(Freq)) %>%
filter(B==1)%>%
select(-B)
# A Freq total
#1 0 8 83
#2 1 6 27
Or using data.table, we convert the data.frame to data.table (setDT(frame) or we can do as.data.table(frame)), create a new column total as the sum of 'Freq', grouped by 'A', subset the rows with B=1, and remove the 'B' column by assigning it to NULL.
library(data.table)
setDT(frame)[, total:= sum(Freq), A][B==1][,B:=NULL]
# A Freq total
#1: 0 8 83
#2: 1 6 27
Or using base R, we create the 'total' using transform/ave and then subset the rows that are 1 for 'B'.
subset(transform(frame, total=ave(Freq, A, FUN=sum)), B==1, select=-B)
# A Freq total
#3 0 8 83
#4 1 6 27
Below is an example using functions in the base package - aggregate() and merge().
frame <- read.table(header = T, text = "
A B Freq total
1 0 0 75 110
2 1 0 21 110
3 0 1 8 110
4 1 1 6 110")
# obtain sum by column A
frame1 <- aggregate(frame$Freq, by = list(frame$A), sum)
names(frame1) <- c("A", "total")
# merge Freq
frame2 <- merge(frame1, frame[frame$B == 1, c(1,3)], by="A")
# A total Freq
#1 0 83 8
#2 1 27 6

Resources