Converting dplyr to data.table - r

library(dplyr)
library(data.table)
df <- data.frame(year = rep(2003:2004, each = 4),
mg = rep(rep(c("a", "b"), each = 2), times = 2),
plant_date = c(20, 30, 20, 30, 33, 40, 33, 40),
stage1 = c(40, 50, 42, 52, 43, 55, 48, 57),
stage2 = c(55, 65, 57, 66, 58, 68, 59, 65),
stage3 = c(61, 75, 63, 76, 66, 77, 68, 79))
set.seed(123)
dat <- data.frame(year = rep(2003:2004, each = 365), doy = rep(1:365, times = 2),
rainfall = sample(0:20, 730, replace = T))
final.dat <- dat %>% dplyr::left_join(df)
I have used the dplyr to do so some calculation as follows:
final.dat %>% dplyr::group_by(year, plant_date, mg) %>%
dplyr::summarise(
sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2]),
var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))
I want to do the same thing using data.table
final.dat <- data.table(final.dat)
final.dat[, j = list(sum(rainfall[doy >= plant_date & doy <= stage1]),
sum(rainfall[doy >= plant_date & doy <= stage2]),
mean(rainfall[doy >= plant_date & doy <= stage1]),
mean(rainfall[doy >= plant_date & doy <= stage2])),
by = list(year, plant_date, mg)]
I have two questions:
1) How can I keep my column names as in dplyr instead of the default V1, V2, V3 and V4.
2) I do not know how to implement this part of dplyr in data.table
var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))
Thanks

dt = as.data.table(final.dat) # or setDT to convert in place
dt[, .(sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2]))
, by = .(year, plant_date, mg)][
, `:=`(var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))][]

Backing up a step, your approach to conditions like doy >= plant_date & doy <= stage1 can be simplified in two ways...
First, there's doy %between% list(plant_date, stage1) so you don't have to type the var name twice.
Second, since doy and the other columns come from different tables, you can do this inside a non-equi join to update df instead of making new table final.dat:
setDT(df)
setDT(dat)
df[, c("sum_rs1", "mean_rs1") :=
dat[df, on=.(doy >= plant_date, doy <= stage1), .(sum(rainfall), mean(rainfall)), by=.EACHI][, .(V1, V2)]]
df[, c("sum_rs2", "mean_rs2") :=
dat[df, on=.(doy >= plant_date, doy <= stage2), .(sum(rainfall), mean(rainfall)), by=.EACHI][, .(V1, V2)]]
# same as #eddi's
df[, `:=`(
var.x1 = (sum_rs1 > sum_rs2) * (mean_rs1 - mean_rs2),
var.x2 = (mean_rs1 > mean_rs2) * (sum_rs1 - sum_rs2)
)]

For summarise, we can just name as in a list
list(a = 1:2, b = 3:5)
Similarly,
final.dat[, list(sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2])),
by = list(year, plant_date, mg)]
# year plant_date mg sum_rain_stage1 sum_rain_stage2 mean_rain_stage1 mean_rain_stage2
#1: 2003 20 a 251 355 11.952381 9.861111
#2: 2003 30 a 176 304 8.380952 8.444444
#3: 2003 20 b 261 361 11.347826 9.500000
#4: 2003 30 b 185 313 8.043478 8.459459
#5: 2004 33 a 109 247 9.909091 9.500000
#6: 2004 40 a 134 279 8.375000 9.620690
#7: 2004 33 b 157 253 9.812500 9.370370
#8: 2004 40 b 158 242 8.777778 9.307692

Related

How To return the true condition only of a result on a list in r?

I have a problem with my R code.
Here, I have a list named bought_list with lists of customer and checkout (checkout is a data frame),
And this how checkout lists looks like:
items price qty total
Milk 10 2 20
Dolls 15 10 150
Chocolate 5 5 25
Toys 50 1 50
I want to know which one is for play_purpose and date_purpose
So I made a variable of boolean
play_purpose <- Bought_list[["checkout"]][,"total"] >= 50 & Bought_list[["checkout"]][,"total"] <= 150
date_purpose <- Bought_list[["checkout"]][,"total"] > 0 & Bought_list[["checkout"]][,"total"] < 50
How to return the items name and total value of selected condition like this?
for play_purpose:
Dolls 150
Toys 50
for date_purpose :
Milk 20
Chocolate 25
I'm not clear on the structure of your data, but you could subset with your current code:
play_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] >= 50 &
Bought_list[["checkout"]][, "total"] <= 150, c(1, 4)]
# items total
#2 Dolls 150
#4 Toys 50
date_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] > 0 &
Bought_list[["checkout"]][, "total"] < 50, c(1, 4)]
# items total
#1 Milk 20
#3 Chocolate 25
Another option is to use dplyr:
Bought_list$checkout %>%
filter(total >= 50 & total <= 150) %>%
select(items, total)
Bought_list$checkout %>%
filter(total > 0 & total < 50) %>%
select(items, total)
Or if you are needing to applying this function to multiple dataframes in the list, then we could use map from purrr:
map(Bought_list, ~ .x %>%
filter(total >= 50 & total <= 150) %>%
select(items, total))
map(Bought_list, ~ .x %>%
filter(total > 0 & total < 50) %>%
select(items, total))
Data
Bought_list <- list(checkout = structure(list(items = c("Milk", "Dolls", "Chocolate",
"Toys"), price = c(10L, 15L, 5L, 50L), qty = c(2L, 10L, 5L, 1L
), total = c(20L, 150L, 25L, 50L)), class = "data.frame", row.names = c(NA,
-4L)))

for loop in R: How to apply a defined function to loop through rows and columns in R?

I have a dataset named trainset that has 50 variables. For each row, I need to sum up the values under columns called Systolic.Blood.Pressure, Blood.Urea.Nitrogen, Blood.Sodium, Age, heart.rR and COPD values. I already defined a function to sum these up:
m.gwtg = function(Systolic.BP, Sodium, COPD){
if (Systolic.BP>=200){
pt.sbp = 0
}else if (Systolic.BP>= 190){
pt.sbp = 2
}else if (Systolic.BP>= 180){
pt.sbp = 4
}else if (Systolic.BP>= 170){
pt.sbp = 6
}else if (Systolic.BP>= 160){
pt.sbp = 8
}else if (Systolic.BP>= 150){
pt.sbp = 9
}else if (Systolic.BP>= 140){
if (Sodium>=139){
pt.sodium = 0
}else if (Sodium>= 137){
pt.sodium = 1
}else if (Sodium>= 134){
pt.sodium = 2
}
if (Age>=110){
pt.age = 28
}else if (Age>= 100){
pt.age = 25
}else if (Age>= 90){
pt.age = 22
}else if (Age>= 80){
pt.age = 19
}else if (Age>= 70){
if (COPD == 1){
pt.copd =2
} else {
pt.copd = 0
}
total = pt.sbp + pt.bun+ pt.sodium +pt.age + pt.hr+ pt.copd
return(if (total < 79){
outcome = 0
} else {
outcome = 1
})
}
I have problem with coding out the for loop to loop through the trainset and apply the function defined for each row. I tried to code the loop like this:
for (i in 1:nrow(trainset)) {
Systolic.BP[i] <- trainset$Systolic.blood.pressure[i]
Sodium[i] <- trainset$Blood.sodium[i]
COPD[i] <- trainset$COPD[i]
total[i] <- Systolic.BP[i]+ Sodium[i] +COPD[i]
outcome.gwtg.trainset[i]= m.gwtg(total[i])
}
I got quite confused on on the code out the for loop. Thanks for any answers!
You should avoid loops and similar functions (eg, *apply() and purrr::map()) whenever possible in R. R is designed to work with vectors, and loops are much much slower.
Instead of this loop operation, do the following:
Recode each column into its transformed values
Add up the transformed values
Recode the transformed values into the 0/1 outcome
These operations should be done vector-wise to be efficient (and to avoid the tricky indexing problems you are encountering).
For example:
m.gwtg <- function(data) {
data <- dplyr::mutate(data,
pt.sbp = dplyr::case_when(
Systolic.BP >= 200 ~ 0,
Systolic.BP >= 190 ~ 2,
Systolic.BP >= 180 ~ 4,
Systolic.BP >= 170 ~ 6,
Systolic.BP >= 160 ~ 8,
Systolic.BP >= 150 ~ 9,
Systolic.BP >= 140 ~ 11,
Systolic.BP >= 130 ~ 13,
Systolic.BP >= 120 ~ 15,
Systolic.BP >= 110 ~ 17,
Systolic.BP >= 100 ~ 19,
Systolic.BP >= 90 ~ 21,
Systolic.BP >= 80 ~ 23,
Systolic.BP >= 70 ~ 24,
Systolic.BP >= 60 ~ 26,
Systolic.BP >= 50 ~ 28, # should Systolic.BP < 50 be NA or 28?
# else
TRUE ~ NA_real_
),
pt.bun = dplyr::case_when(
BUN >= 150 ~ 28,
BUN >= 140 ~ 27,
BUN >= 130 ~ 25,
BUN >= 120 ~ 23,
BUN >= 110 ~ 21,
BUN >= 100 ~ 19,
BUN >= 90 ~ 17,
BUN >= 80 ~ 15,
BUN >= 70 ~ 13,
BUN >= 60 ~ 11,
BUN >= 50 ~ 9,
BUN >= 40 ~ 8,
BUN >= 30 ~ 6,
BUN >= 20 ~ 4,
BUN >= 10 ~ 2,
BUN < 10 ~ 0,
# else
TRUE ~ NA_real_
),
pt.sodium = dplyr::case_when(
Sodium >= 139 ~ 0,
Sodium >= 137 ~ 1,
Sodium >= 134 ~ 2,
Sodium >= 131 ~ 3,
Sodium < 131 ~ 4,
# else
TRUE ~ NA_real_
),
pt.age = dplyr::case_when(
Age >= 110 ~ 28,
Age >= 100 ~ 25,
Age >= 90 ~ 22,
Age >= 80 ~ 19,
Age <= 70 ~ 17,
Age >= 60 ~ 14,
Age >= 50 ~ 11,
Age <= 40 ~ 8,
Age >= 30 ~ 6,
Age >= 20 ~ 3,
Age < 20 ~ 0,
# else
TRUE ~ NA_real_
),
pt.hr = dplyr::case_when(
HR >= 105 ~ 8,
HR >= 100 ~ 6,
HR >= 95 ~ 5,
HR >= 90 ~ 4,
HR >= 85 ~ 3,
HR >= 80 ~ 1,
HR < 80 ~ 0,
# else
TRUE ~ NA_real_
),
pt.copd = dplyr::case_when(
COPD == 1 ~ 2,
COPD == 0 ~ 0,
# else
TRUE ~ NA_real_
),
total = pt.sbp + pt.bun + pt.sodium + pt.age + pt.hr + pt.copd,
outcome = dplyr::if_else(total < 79, 0, 1)
)
return(data)
}
example_data <- data.frame(
Systolic.BP = c(170, 160, 200),
BUN = c(60, 150, 10),
Sodium = c(134, 131, 139),
Age = c(40, 80, 20),
HR = c(90, 105, 80),
COPD = c(1, 0, 0)
)
m.gwtg(example_data)
#> Systolic.BP BUN Sodium Age HR COPD pt.sbp pt.bun pt.sodium pt.age pt.hr
#> 1 170 60 134 40 90 1 6 11 2 17 4
#> 2 160 150 131 80 105 0 8 28 3 19 8
#> 3 200 10 139 20 80 0 0 2 0 17 1
#> pt.copd total outcome
#> 1 2 42 0
#> 2 0 66 0
#> 3 0 20 0
Created on 2022-03-25 by the reprex package (v2.0.1)
This is a more concise way to calculate the outcome:
library(tidyverse)
calc_score <- function(systolic_bp, bun, sodium) {
systolic_bp_score <- case_when(
systolic_bp >= 200 ~ 0,
systolic_bp >= 190 ~ 2,
systolic_bp >= 180 ~ 4
)
bun_score <- case_when(
bun >= 150 ~ 28,
bun >= 140 ~ 27
)
sodium_score <- case_when(
sodium >= 139 ~ 0,
sodium >= 137 ~ 1
)
systolic_bp_score + bun_score + sodium_score
}
# example data
trainset <- tibble(
systolic_bp = c(180, 195),
bun = c(145, 180),
sodium = c(138, 140)
)
trainset %>%
mutate(
score = list(systolic_bp, bun, sodium) %>% pmap_dbl(calc_score),
outcome = as.numeric(score > 97)
)
#> # A tibble: 2 × 5
#> systolic_bp bun sodium score outcome
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 180 145 138 32 0
#> 2 195 180 140 30 0
Created on 2022-03-25 by the reprex package (v2.0.0)

how to groupby and take mean of value by symetrically looping forward and backward on the date value in r

I need to group data on ID and then replace the missing value of price by iterating on a date value up and down. first, look for 1 date value up and down if no data go 2 date values up and down until there is a mean value for all rows.
Input data :
df1 <- data.frame(id = c(11,11,11,11,11,11,11,11,555,555,555,555,555,555,555,555,555),
Date = c("1-Jun", "18-Jun", "3-Jul", "4-Jul", "25-Jul", "3-Nov", "7-Nov", "28_Nov",
"1-Jun", "18-Jun", "3-Jul", "4-Jul", "25-Jul", "3-Nov", "7-Nov", "28_Nov",
"30-Nov"),
price = c(NA, NA, 100, NA, 25, NA, 50, NA, 400, NA, NA, NA, NA, NA, NA, NA, 200)
)
Updated requirement:
Input data :
df1 <- data.frame(id = c(11,11,11,11,11,11,11,11),
Date = c("1-Jun", "5-Jun", "8-Jun", "9-Jun", "14-Jun", "16-Jun", "20-Jun", "21-Jun"),
price = c(NA, NA,100, NA, 50, NA, 200, NA)
)
I need to impute all missing dates between the available dates for each id's and then go symmetrically up and down to impute missing. Also, not always I need the average between two, eg: when I go 2 dates up and down and I see only 1 value, then I would impute that value.
Please find below with a reprex one possible solution using the data.table library.
I built a function to make it easier to use.
Reprex
Code of the NA_imputations() function
library(data.table)
NA_imputations <- function(x) {
x[, rows := .I]
z <- x[, .I[!is.na(price)]]
id_1 <- z[-length(z)]
id_2 <- z[-1]
values <- x[z, .(price = price, id = id)]
values_1 <- values[-nrow(values)]
names(values_1) <- c("price_1", "id_o1")
values_2 <- values[-1]
names(values_2) <- c("price_2", "id_o2")
subtract <- z[-1] - z[-length(z)]
r <- data.table(id_1, values_1, id_2, values_2, subtract)
Results <- r[, `:=` (id_mean = fifelse(subtract > 2 & subtract %% 2 == 0, id_1+(subtract/2), (id_1+id_2)/2),
mean = fifelse(subtract >= 2 & subtract %% 2 == 0 & id_o1 == id_o2, (price_1+price_2)/2, NA_real_))
][, `:=` (price_1 = NULL, id_o1 = NULL, id_2 = NULL, price_2 = NULL, id_o2 = NULL)
][x, on = .(id_mean = rows)
][, price := fcoalesce(price, mean)
][, mean := NULL
][r[subtract > 2 & subtract %% 2 == 0,id_1]:r[subtract > 2 & subtract %% 2 == 0,id_mean-1], price := lapply(price, nafill, type = "nocb"), by = .(id)
][, price := nafill(price, type = "nocb"), by = .(id)
][, price := nafill(price, type = "locf")
][, `:=` (id_1 = NULL, id_mean = NULL, subtract = NULL)][]
return(Results)
}
Output of the NA_imputations() function
NA_imputations(df1)
#> id Date price
#> <num> <char> <num>
#> 1: 11 1-Jun 100.0
#> 2: 11 18-Jun 100.0
#> 3: 11 3-Jul 100.0
#> 4: 11 4-Jul 62.5
#> 5: 11 25-Jul 25.0
#> 6: 11 3-Nov 37.5
#> 7: 11 7-Nov 50.0
#> 8: 11 28_Nov 50.0
#> 9: 555 1-Jun 400.0
#> 10: 555 18-Jun 400.0
#> 11: 555 3-Jul 400.0
#> 12: 555 4-Jul 400.0
#> 13: 555 25-Jul 300.0
#> 14: 555 3-Nov 200.0
#> 15: 555 7-Nov 200.0
#> 16: 555 28_Nov 200.0
#> 17: 555 30-Nov 200.0
Created on 2021-12-05 by the reprex package (v2.0.1)

Recode continuous variable in R based on conditions

I want to "translate" a syntax written in SPSS into R code but am a total beginner in R and struggling to get it to work.
The SPSS syntax is
DO IF (Geschlecht = 0).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 22.99=0) (23 thru 55=1) (55.01 thru Highest=2)
INTO Hang.
ELSE IF (Geschlecht = 1).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 21.99=0) (22 thru 54=1) (54.01 thru Highest=2)
INTO Hang.
END IF.
I have installed the "car"-package in R but I neither get the "range" recoding to work (I have tried
td_new$Hang <- recode(td_new$hang0, "0:22.99=0; 23:55=1; else=2")
nor do I manage to work with the if-else-function. My last attempt was
if(td_new$Geschlecht == 0){
td_new$Hang <- td_new$hang0 = 3
} else if (td_new$Geschlecht == 1) {
td_new$Hang <- td_new$hang0 = 5)
} else
td_new$hang0 <- NA
(this was without the recoding, just to test the if-else function).
Would be very happy if someone helped!
Thanks a lot in advance :)!
Sorry, edited to add:
The data structure looks as follows:
Geschlecht hang0
0 15
1 45
1 7
0 11
And I want to recode hang0 such that
for boys (Geschlecht = 0): all values < 23 = 0, values between 23 and 55 = 1, all values > 55 = 2
and for girls (Geschlecht = 1): all values < 22 = 0, values between 23 and 54 = 1, all values > 54 = 2
Here's an approach with case_when:
library(dplyr)
td_new %>%
mutate(Hang = case_when(Geschlecht = 0 & hang0 < 23 ~ 0,
Geschlecht = 0 & hang0 >= 23 & hang0 < 55 ~ 1,
Geschlecht = 0 & hang0 >= 55 ~ 2,
Geschlecht = 1 & hang0 < 22 ~ 0,
Geschlecht = 1 & hang0 >= 22 & hang0 < 54 ~ 1,
Geschlecht = 1 & hang0 >= 54 ~ 2,
TRUE ~ NA_real_))
# Geschlecht hang0 Hang
#1 0 15 0
#2 1 45 1
#3 1 7 0
#4 0 11 0
The final line is there to catch NAs.
Data
td_new <- structure(list(Geschlecht = c(0L, 1L, 1L, 0L), hang0 = c(15L, 45L, 7L, 11L)), class = "data.frame", row.names = c(NA, -4L))

R: remove a subset of a dataframe from the original one with multiple conditions using for loop

I have a dataframe as follows:(dput of original table is quite big hence providing a small example)
Date Sales Depo
2020-01 100 ABC
2020-02 125 ABC
2020-03 0 ABC
2020-04 0 ABC
2020-01 0 BBC
2020-02 0 BBC
2020-03 0 BBC
2020-04 5 BBC
I want to remove all the records pertaining to BBC based on the following conditions
either the sum(cols) <= max(col_value) or rowcount with zero exceeds 80% of
total row count
The above rule should be applicable for each Depo.
So the resultant df would be
Date Sales Depo
2020-01 100 ABC
2020-02 125 ABC
2020-03 0 ABC
2020-04 0 ABC
My Approach:
df_final = data.frame(Date = NULL,Sales = NULL, Depo =NULL)
for (v in unique(df$Depo)){
temp <- subset(data,Depo==v)
temp_f <- temp[,colSums(Sales!=0) > 0]
df_final <-rbind(df_final,temp_f)
}
But the above gives me a NULL data frame
Can anybody throws any light?
How can I achieve the same?
Using dplyr :
library(dplyr)
df %>%
group_by(Depo) %>%
filter((sum(Sales) > max(Sales)) & (sum(Sales == 0) < (0.8 * n())))
#Opposite can be written as :
#filter(!((sum(Sales) <= max(Sales)) | (sum(Sales == 0) > (0.8 * n()))))
The same logic can also be implemented in base R :
subset(df, as.logical(ave(Sales, Depo, FUN = function(x)
(sum(x) > max(x)) & (sum(x == 0) < (0.8 * length(x))))))
and data.table :
library(data.table)
setDT(df)[, .SD[(sum(Sales) > max(Sales)) & (sum(Sales == 0) < (0.8 * .N))], Depo]
data
df <- structure(list(Date = c("2020-01", "2020-02", "2020-03", "2020-04",
"2020-01", "2020-02", "2020-03", "2020-04"), Sales = c(100L,
125L, 0L, 0L, 0L, 0L, 0L, 5L), Depo = c("ABC", "ABC", "ABC",
"ABC", "BBC", "BBC", "BBC", "BBC")), class = "data.frame", row.names =c(NA, -8L))

Resources