I have a data-frame like this:
ID Time Testscore
20 2 300
20 1 350
20 3 -150
30 2 200
30 1 100
40 1 300
40 2 NA
Three questions:
How can I calculate the difference between last score and first score grouped by ID and Time whereas the last time is the bigger number (some with more repeated measures than other)
How to deal with NA in calculation
Is there a way to arrange the Time varible in ascending ordered and keep the ID grouped up?
Thanks for the help.
Using tapply
with(dat, tapply(Testscore, ID, \(x) x[length(x)] - x[1]))
# 20 30 40
# -450 -100 NA
or ave.
transform(dat, d=ave(Testscore, ID, FUN=\(x) x[length(x)] - x[1]))
# ID Time Testscore d
# 1 20 2 300 -450
# 2 20 1 350 -450
# 3 20 3 -150 -450
# 4 30 2 200 -100
# 5 30 1 100 -100
# 6 40 1 300 NA
# 7 40 2 NA NA
Here by ID and Time, but doesn't make much sense with your sample data.
with(dat, tapply(Testscore, list(ID, Time), \(x) x[length(x)] - x[1]))
transform(dat, d=ave(Testscore, ID, Time, FUN=\(x) x[length(x)] - x[1]))
Sorting dataframe rows is usually done with the order function.
dfrm <- dfrm[ order(dfrm$ID, dfrm$Time) , ]
Then you can use split in traditional R or group_by in the tidyverse to separately handle the difference calculations.
diffs <- sapply( split(dfrm, dfrm$ID), function(grp){
grp[ max(grp$Time, na.rm=TRUE), "Testscore"] -
grp[ min(grp$Time, na.rm=TRUE), "Testscore"] }
diffs
#---------------
20 30 40
-500 100 NA
I didn't see a request to put these differences along side the dataframe.
Using dplyr:
df %>%
arrange(ID, Time) %>%
group_by(ID) %>%
mutate(Diff = last(Testscore) - first(Testscore))
# A tibble: 7 × 4
# Groups: ID [3]
# ID Time Testscore Diff
# <dbl> <dbl> <dbl> <dbl>
# 1 20 1 350 -500
# 2 20 2 300 -500
# 3 20 3 -150 -500
# 4 30 1 100 100
# 5 30 2 200 100
# 6 40 1 300 NA
# 7 40 2 NA NA
Related
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
Say, I have a data frame with 3 columns
ID Type Amount
1 4 100
1 4 50
1 1 20
2 4 30
2 1 10
I want to do some calculations in the data frame which are based on the groups of ID and Type. For example, I want to calculate the sum of amount for type 4 - sum of amount for type 1 for all of the IDs of the data frame and append it to the end, so the final result would be something like
ID Type Amount Calculation
1 4 100 (100 + 50) - 20
1 4 50 (100 + 50) - 20
1 1 20 (100 + 50) - 20
2 4 30 30 - 10
2 1 10 30 - 10
Is there an easy way to implement this? Easy, because I want to do some more complexe calculations, but want to get the basics right first.
I tried to work it out with dplyr
Something like
df %>%
group_by(ID) %>%
sum( Calculation = Amount[Type == 4] - Amount[Type == 1])
This gave me the same value for all the columns in my data frame, so it doesn't seem to work.. Any ideas?
This does what you need with dplyr
library(dplyr)
df <- data.frame(ID = c(1,1,1,2,2), Type = c(4,4,1,4,1), Amount = c(100,50,20,30,10))
df %>% group_by(ID) %>% mutate(Calculation = sum(Amount[Type == 4]) - sum(Amount[Type == 1]))
# A tibble: 5 x 4
# Groups: ID [2]
ID Type Amount Calculation
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 130
2 1 4 50 130
3 1 1 20 130
4 2 4 30 20
5 2 1 10 20
I need a dpylr solution that creates a cumsum column.
# Input dataframe
df <- data.frame(OilChanged = c("No","No","Yes","No","No","No","No","No","No","No","No","Yes","No"),
Odometer = c(300,350,410,420,430,450,500,600,600,600,650,660,700))
# Create difference column - first row starting with zero
df <- df %>% dplyr::mutate(Odometer_delta = Odometer - lag(Odometer, default = Odometer[1]))
I'm trying to make a reset condition based on the factor column for a cumulative sum.
The result needs to be exactly like this.
# Wanted result dataframe
df <- data.frame(OilChanged = c("No","No","Yes","No","No","No","No","No","No","No","No","Yes","No"),
Odometer = c(300,350,410,420,430,450,500,600,600,600,650,660,700),
Diff = c(0,50,60,10,10,20,50,100,0,0,50,10,40),
CumSum = c(0,50,110,10,20,40,90,190,190,190,240,250,40))
You can create a new group everytime OilChanged == 'Yes' and take cumsum of Diff value in each group.
library(dplyr)
df %>%
group_by(grp = lag(cumsum(OilChanged == 'Yes'), default = 0)) %>%
mutate(newcumsum = cumsum(Diff)) %>%
ungroup %>%
select(-grp)
# OilChanged Odometer Diff CumSum newcumsum
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 No 300 0 0 0
# 2 No 350 50 50 50
# 3 Yes 410 60 110 110
# 4 No 420 10 10 10
# 5 No 430 10 20 20
# 6 No 450 20 40 40
# 7 No 500 50 90 90
# 8 No 600 100 190 190
# 9 No 600 0 190 190
#10 No 600 0 190 190
#11 No 650 50 240 240
#12 Yes 660 10 250 250
#13 No 700 40 40 40
I want to create a cumulative sum by id. But, it should not sum the value that belongs to the row where is being calculated.
I've already tried with cumsum. However, I do not know how to add a statement which specifies to do not add the amount of the row where the sum is made. The result column I am looking for is the third column called: "sum".
For example, for id 1, the first row is sum=0, because should not add this row. But, for id 1 and row 2 sum=100 because the amount of id 1 previous to the row 2 was 100 and so on.
id amount sum
1: 1 100 0
2: 1 20 100
3: 1 150 120
4: 2 60 0
5: 2 100 60
6: 1 30 270
7: 2 40 160
This is what I've tried:
df[,sum:=cumsum(amount),
by ="id"]
data: df <- data.table(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
You can do this without using lag:
> df %>%
group_by(id) %>%
mutate(sum = cumsum(amount) - amount)
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
#1 1 100 0
#2 1 20 100
#3 1 150 120
#4 2 60 0
#5 2 100 60
#6 1 30 270
#7 2 40 160
With dplyr -
df %>%
group_by(id) %>%
mutate(sum = lag(cumsum(amount), default = 0)) %>%
ungroup()
# A tibble: 7 x 3
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
Thanks to #thelatemail here's the data.table version -
df[, sum := cumsum(shift(amount, fill=0)), by=id]
Here is an option in base R
df$Sum <- with(df, ave(amount, id, FUN = cumsum) - amount)
df$Sum
#[1] 0 100 120 0 60 270 160
Or by removing the last observation, take the cumsum
with(df, ave(amount, id, FUN = function(x) c(0, cumsum(x[-length(x)]))))
You can shift the values you're summing by using the lag function.
library(tidyverse)
df <- data.frame(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
df %>%
group_by(id) %>%
mutate(sum = cumsum(lag(amount, 1, default=0)))
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
I have data that looks the following way:
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
40 100 32
40 101 27
40 200 18
I want to get a table with the Total of last Round (200) minus the Total of first Round (100) ;
For example - for Participant 1 - it is 42 - 5 = 37.
The final output should look like:
Participant Total
1 37
2
40 -14
With base R
aggregate(Total ~ Participant, df[df$Round %in% c(100, 200), ], diff)
# Participant Total
# 1 1 37
# 2 2
# 3 40 -14
Or similarly combined with subset
aggregate(Total ~ Participant, df, subset = Round %in% c(100, 200), diff)
Or with data.table
library(data.table) ;
setDT(df)[Round %in% c(100, 200), diff(Total), by = Participant]
# Participant V1
# 1: 1 37
# 2: 40 -14
Or using binary join
setkey(setDT(df), Round)
df[.(c(100, 200)), diff(Total), by = Participant]
# Participant V1
# 1: 1 37
# 2: 40 -14
Or with dplyr
library(dplyr)
df %>%
group_by(Participant) %>%
filter(Round %in% c(100, 200)) %>%
summarise(Total = diff(Total))
# Source: local data table [2 x 2]
#
# Participant Total
# 1 1 37
# 2 40 -14
you can try this
library(dplyr)
group_by(df, Participant) %>%
filter(row_number()==1 | row_number()==max(row_number())) %>%
mutate(df = diff(Total)) %>%
select(Participant, df) %>%
unique()
Source: local data frame [3 x 2]
Groups: Participant
Participant df
1 1 37
2 2 57
3 40 -14
try this:
df <- read.table(header = TRUE, text = "
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
2 200 80
40 100 32
40 101 27
40 200 18")
library(data.table)
setDT(df)[ , .(Total = Total[Round == 200] - Total[Round == 100]), by = Participant]
Everyone loves a bit of sqldf, so if your requirement isn't to use apply then try this:
Firstly some test data:
df <- read.table(header = TRUE, text = "
Participant Round Total
1 100 5
1 101 8
1 102 12
1 200 42
2 100 14
2 101 71
2 200 80
40 100 32
40 101 27
40 200 18")
Next use SQL to create 2 columns - one for the 100 round and one for the 200 round and subtract them
rolled <- sqldf("
SELECT tab_a.Participant AS Participant
,tab_b.Total_200 - tab_a.Total_100 AS Difference
FROM (
SELECT Participant
,Total AS Total_100
FROM df
WHERE Round = 100
) tab_a
INNER JOIN (
SELECT Participant
,Total AS Total_200
FROM df
WHERE Round = 200
) tab_b ON (tab_a.Participant = tab_b.Participant)
")