dplyr: Create a new column with a complex operation using mutate() - r

I'd like to create a new data frame (new.df) using an original (df) but with a new column (Age) using a complex operation with mutate function in dplyr package. My steps are:
# Artificial dataframe
IDtest<-c(1,1,1,1,1,1,2,2,2,3,3,3,3)
Class<-c(1,1,2,2,2,3,1,1,2,1,2,2,3)
Day<-c(0,47,76,100,150,173,0,47,76,0,47,76,100)
Area<-c(0.45,0.85,1.50,1.53,1.98,5.2,
0.36,0.58,1.2,
0.85,1.36,2.26,3.59)
df<-data.frame(cbind(IDtest, Class, Day, Area))
str(df)
#Split each IDtest
df[df[,1]==1,]
# IDtest Class Day Area
#1 1 1 0 0.45
#2 1 1 47 0.85
#3 1 2 76 1.50
#4 1 2 100 1.53
#5 1 2 150 1.98
#6 1 3 173 5.20
Inside each IDtest:
Last Area inside each Class minus a factor (If the Area < 1 = 0; < 2.9 = 1; < 8.9 = 3; < 24.9 = 9; > 25 = 25); and
Than 1) divided by make subtraction between last and first Area and divided by Day inside each Class
Than 1) and 2) sum Day inside each Class minus last Day. Finally sum of all divided by 365 and create a new column Age
#For Class 1
(0.85-0)/((0.85-0.45)/47) + (47 - 0)
#For Class 2
(1.98-1)/((1.98-0.85)/150) + (157 - 47)
#For Class 3
(5.20-3)/((5.2-1.98)/173) + (173 - 150)
#Final
Age<-((0.85-0)/((0.85-0.45)/47) + (47 - 0) +
(1.98-1)/((1.98-0.85)/150) + (157 - 47) +
(5.20-3)/((5.2-1.98)/173) + (173 - 150))/365
Age
#[1] 1.44702
# Desirable output
new.df
# IDtest Class Day Area Age
#1 1 1 0 0.45 1.44702
#2 1 1 47 0.85 1.44702
#3 1 2 76 1.50 1.44702
#4 1 2 100 1.53 1.44702
#5 1 2 150 1.98 1.44702
#6 1 3 173 5.20 1.44702
Please any ideas?

It's quite tricky, so I have made all steps separatedly, to make you easier detecting any possible missunderstanding.
Is it possible that a mistake exists in this line of yours?
(1.98-1)/((1.98-0.85)/150) + (157 - 47) # 157? wouldn't it be 150?
That said, my results for Class 1 are the same as yours, but please be careful with Class 2 and 3, because I'm not sure of properly understanding the second and third steps, I'm not absolutely sure about your use of "last" (i.e. "last" in the Class or the "previous" Class).
In the second step I use "last" in the Class and in third one I use a for loop to use "the previous". I think you can addapt the idea
df2 <- df %>%
group_by(IDtest, Class) %>%
mutate(
DayOrder = row_number()
)
df2 <- df2 %>%
mutate(step1a = Area[max(DayOrder)], # I divide step1 in several steps to make it clearer
minus = # what you want to substract
case_when(
step1a < 1 ~ 0,
step1a < 2.9 ~ 1,
step1a < 8.9 ~ 3,
step1a < 24.9 ~ 9,
step1a > 25 ~ 25
),
step1done = step1a - minus,
step2a = Area[max(DayOrder)] - Area[min(DayOrder)], # "Last" inside the same Class (as it is inside mutate, which is under group_by)
step2b = Day[max(DayOrder)],
step2done = step2a / step2b,
step1by2 = step1done / step2done
)
df2$step3 <- NA
for (i in 1:max(df2$Class)){
if(i == 1){
df2$step3[Class == i] <- max(df2$Day[df2$Class == i]) - 0 # quite silly
}else{
df2$step3[Class == i] <- max(df2$Day[df2$Class == i]) - max(df2$Day[df2$Class == i - 1]) # "Last" as the "previous" Class, not inside the same Class
}}
df2 %>%
mutate(
step3done = step1by2 + step3,
Age = step3done / 365 # Do you want "age" as a unique value?? not a value for each person? This case I would do this outside mutate and add as a new column
)
If I have misunderstood you, I hope you can at least take some ideas!

Related

Dynamic summarise throught dynamic multiplication for an external vector

I have a tibble such:
X = tibble(Name = rep(c("A","B","C"),5),
Coeffs_1 = runif(15,0,1),
Coeffs_2 = runif(15,0,1)) %>% arrange(Name)
Y = runif(10,0,100)
In this example the amount of "Reps" per "Names" is arbitrarily fixed at 5, and the amount of Coeffs_2 is arbitrarily fixed at 2, but in my code they could be any number and I don't know the exact number.
I also have a vector with n = reps*2 elements:
Y = runif(10,0,100)
In this specific case, it's 10 = 5*2.
My task is to summarise, per each Name, per each Coeff, this formula:
Result_x[1] = sum(Coeff_x[1]*Y[2] + Coeff_x[2]* Y[3] + ... + Coeff_x[Reps]*Y[Reps+1]) - Y[1]
Result_x[2] = sum(Coeff_x[1]*Y[3] + Coeff_x[2]* Y[4] + ... + Coeff_x[Reps]*Y[Reps+2]) - Y[2]
.
.
.
Result_x[Reps] = sum(Coeff_x[1]* Y[Reps+1] + Coeff_x[2]* Y[Reps+2] + ... + Coeff_x[Reps]*Y[Reps*2]) - Y[Reps]
So that in the end, the final summarise tibble should look like:
Name
Lag
Result_1
...
Result_x
A
+1
a number
numbers
a number
A
+2
a number
numbers
a number
A
...
a number
numbers
a number
A
Reps
a number
numbers
a number
B
+1
a number
numbers
a number
B
+2
a number
numbers
a number
...
...
a number
numbers
a number
...
Reps
a number
numbers
a number
The dynamic nature of the issue makes hard for me to define it well with a for cycle, and the presence of the external vector that must be re-indexed and properly summarised for each row in the original tibble makes me difficult to work with a pipeline.
I thought that defining a custom function could help but again, it messes with pipeline code.
Split the 'X' by 'Name', loop over the list (map), while creating shifted lead values of 'Y' in a list with n specified as a vector. Loop over the list, summarise across the 'Coeff' columns for each of the nested list by taking the sum of product of the column value with the corresponding 'y' length corrected and subtract from the first value of 'y'
library(dplyr)
library(purrr)
library(data.table)
X %>%
group_split(Name) %>%
map_dfr(~ map_dfr(shift(Y, n = 1:nrow(.x), type = 'lead'),
function(y) .x %>%
summarise(Name = first(Name), across(starts_with('Coeff'),
~ sum(. * y[seq_along(.)], na.rm = TRUE) - first(y)))) ) %>%
mutate(Lag = rowid(Name))
-output
# A tibble: 15 × 4
Name Coeffs_1 Coeffs_2 Lag
<chr> <dbl> <dbl> <int>
1 A 127. 54.4 1
2 A 162. 134. 2
3 A 127. 68.2 3
4 A 109. 38.0 4
5 A 108. 94.0 5
6 B 175. 197. 1
7 B 187. 240. 2
8 B 151. 200. 3
9 B 132. 159. 4
10 B 102. 152. 5
11 C 48.8 131. 1
12 C 89.1 128. 2
13 C 42.5 98.7 3
14 C 29.4 95.7 4
15 C 41.7 50.1 5

Conditionally add values to a new column and replace values in the conditioning column in R

I am working on a project where I need to read files into my environment and afterwards based on the row's name change a value and add new values to new columns: i.e.
X1 Area Mean Min Max file_row_name
55 0.165 31.384 4 82 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
56 0.097 45.867 4 121 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
168 0.042 28.252 20 49 ./Fluorescence Analysis/T0-25-50-10x-1-1.csv
So in the example I want to look at each row's file_row_name and if the rows have the same name, create two variables: Conc & Rep and replace the values at file_row_name so as to look like this:
X1 Area Mean Min Max file_row_name Conc Rep
55 0.165 31.384 4 82 T0 12.5 3
56 0.097 45.867 4 121 T0 12.5 3
168 0.042 28.252 20 49 T0 25 1
So far what I've done is:
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- "T0"
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- "T0"
But this takes too long and when I try an if clause:
if(my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")){
my_df$Conc = "12.5"
my_df$Rep = 3
my_df$file_row_name = "T0"
}
it tells me that:
Warning message:
In if (my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv", :
the condition has longitud > 1 and only the first element will be used
And if I manage to bypass that warning message with another code piece, basically the columns file_row_name Conc and Rep get replaced with the same value and nothing is changed based on condition.
Instead of if (which is not vectorized), we create a logical row index and use to assign
i1 <- my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")
mydf[i1, c("Conc", "Rep", "file_row_name")] <- list("12.5", 3, "T0")

How to use if else statement in a dataframe when comparing dates?

I have a dataframe D and I would want to calculate a daily return of "Close" only if they share the same month. So for example there would be 0 for 1995-08-01
Date Close Month
1 1995-07-27 163.32 1995-07
2 1995-07-28 161.36 1995-07
3 1995-07-30 162.91 1995-07
4 1995-08-01 162.95 1995-08
5 1995-08-02 162.69 1995-08
I am trying to use an if-else statement and looping to apply it on other dataframes.
D1 <- D[-1,]
for (i in c("Close"))
{ TT <- dim(D)[1]
if (D[1:(TT-1),"Month"] == D[2:TT,"Month"]) {
D1[,i] = round((100*(log(D[2:TT,i]/D[1:(TT-1),i]))), digits = 4)
}
else {
D1[i] = 0 }
}
I get these results but in the forth row it should be 0.0000 because the forth row is a from different month than the the third row. Moreover, I get this warning message : "Warning message: In if (D[1:(TT - 1), "Month"] == D[2:TT, "Month"]) { : the condition has length > 1 and only the first element will be used". Can you please help me out? Thank you.
Date Close Month
1 1995-07-27 0.5903 1995-07
2 1995-07-28 1.4577 1995-07
3 1995-07-30 0.9139 1995-07
4 1995-08-01 0.0006 1995-08
5 1995-08-02 0.0255 1995-08
Next time you should REALLY provide a reproducible example here I did it for you. My solution uses diff and ifelse as requested.
month <- c(1,1:5,5:6)
data <- (1:8)*(1:8)
df <- data.frame(cbind(month, data))
diffs <- sapply(df, diff)
diffs <- data.frame(rbind(NA, diffs))
df$result <- ifelse(diffs$month==0, diffs$data, 0)
df
month data result
1 1 1 NA
2 1 4 3
3 2 9 0
4 3 16 0
5 4 25 0
6 5 36 0
7 5 49 13
8 6 64 0
if() expects a single value (usually TRUE or FALSE, but can also be 0 or 1, and it can handle other single values, e.g., it treats positive values like ones). You are feeding in a vector of values. The warning message is telling you that it is ignoring all the other values of the vector except the first, which is usually a strong indication that your code is not doing what you intend it to do.
Here's one do-it-yourself approach with no loops (I'm sure some time-series package has a function to calculate returns):
# create your example dataset
D <- data.frame(
Date = (as.Date("1995-07-27") + 0:6)[-c(3,5)],
Close = 162 + c(1.32, -.64, .91, .95, .69)
)
# get lagged values as new columns
D$Close_lag <- dplyr::lag(D$Close)
D$Date_lag <- dplyr::lag(D$Date)
# calculate all returns
D$return <- D$Close / D$Close_lag - 1
# identify month switches
D$new_month <- lubridate::month(D$Date) != lubridate::month(D$Date_lag)
# replace returns with zeros when month switches
D[!is.na(D$return) & D$new_month==TRUE, "return"] <- 0
# print results
D

Combine if else statement with two conditions

I have one dataset which contain two columns "Code" and "Gross_i". You can see data below:
# Data
TABLE<-data.frame(Code=as.integer(c("1","2","3","4","5")),
Gross_i=as.integer(c("10","20","30","40","50")
))
TAX_RATE1<-0.20
TAX_RATE2<-0.25
My intention here is to multiply second column "Gross_i" with two different tax rates.So I need to multiply first three code "1","2" and "3" with TAX_RATE1 (20%) and observation "4" and "5" with TAX_RATE2 (25%). In order to do this I try this line of code (If else statment) but results are not good:
pit_1=if_else(filter(Code %in% c("1","2","3")),Gross_i*TAX_RATE1,Gross_i*TAX_RATE2)
So can anybody help how to fix this line of code?
This approach can be useful with dplyr in the field:
library(dplyr)
#Code
TABLE %>% mutate(Value=if_else(Code %in% c("1","2","3"),Gross_i*TAX_RATE1,Gross_i*TAX_RATE2))
Output:
Code Gross_i Value
1 1 10 2.0
2 2 20 4.0
3 3 30 6.0
4 4 40 10.0
5 5 50 12.5
If you have only two tax rates, you can do :
library(dplyr)
TABLE %>% mutate(pit_1 = Gross_i * c(TAX_RATE2, TAX_RATE1)[(Code %in% 1:3) + 1])
# Code Gross_i pit_1
#1 1 10 2.0
#2 2 20 4.0
#3 3 30 6.0
#4 4 40 10.0
#5 5 50 12.5
If you lot of rates like this it would be easy to specify conditions within case_when :
TABLE %>%
mutate(pit_1 = Gross_i * case_when(Code %in% 1:3 ~ TAX_RATE1,
TRUE ~ TAX_RATE2))
Your attempt at if_else translates easily to the following using ifelse
pit_1 <- ifelse(TABLE$Code %in% c("1", "2", "3"),
yes = TABLE$Gross_i * TAX_RATE1,
no = TABLE$Gross_i * TAX_RATE2)

For loop that references prior rows

I'm interested in filtering out data based on a set of rules.
I have a dataset that contains play data for all games in which a team had a .8 win probability at some point. What I'd like to do is find that point in which the win probability reached .8 and remove every play thereafter until the next game data begins. The dataset contains numerous games so once a game ends data from a new one begins in which the win probability goes back to around .5.
Here are the relevant columns and each row is a play in the game:
game_id = unique num for each game
team = team that will eventually get an .8 win prob
play_id = num that is increased (but not necessary in seq order for some reason) after each play
win_per = num showing what the teams win percentage chance at the start of that recorded play was
Example df
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
So in this small example, I have recorded 5 plays of two teams (a and b) who both obtained a win_prob of at least .8 at some point in the game. In both example cases, I would want to have all the plays removed AFTER they attained this .8 mark regardless of whether the win_prob kept rising or fell back below .8.
So team a would have the final two rows of data removed (win_prob == .81 and .85) and team b would have the final row removed (win_prob = .77)
I'm imagining running a for loop that checks if the team in any row is the same team as the prior row, and if so, find a win_prob >= .8 with the lowest play-id (as this would be the first time the team reached .8) and then somehow remove the rest of the rows following that match UNTIL the team != prior row's team.
Of course, you might know a better way as well. Thank you so much for helping me out!
No need to use a loop, that whole selection can be performed in 1 line using the dplyr package:
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
library(dplyr)
#group by team
#find the first row that exceeds .80 and add temp column
#save the row from 1 to the row that exceeds 0.80
#remove temp column
df %>% group_by(team, game_id) %>%
mutate(g80= min(which(win_per>=0.80))) %>%
slice(1:g80) %>%
select(-g80)
# A tibble: 7 x 4
# Groups: team [2]
game_id team play_id win_per
<dbl> <fct> <dbl> <dbl>
1 122 a 1 0.5
2 122 a 5 0.6
3 122 a 22 0.86
4 144 b 45 0.54
5 144 b 47 0.43
6 144 b 55 0.47
7 144 b 58 0.81
Here is a base R way using cumsum in ave
subset(df, ave(win_per > 0.8, game_id, FUN = function(x) c(0, cumsum(x)[-length(x)])) == 0)
# game_id team play_id win_per
#1 122 a 1 0.50
#2 122 a 5 0.60
#3 122 a 22 0.86
#6 144 b 45 0.54
#7 144 b 47 0.43
#8 144 b 55 0.47
#9 144 b 58 0.81
and using the similar concept in dplyr
library(dplyr)
df %>% group_by(game_id) %>% filter(lag(cumsum(win_per > 0.8) == 0, default = TRUE))

Resources