I am trying to get a summary of my data based on combinations of two variables.
The following code used to work on the data:
df <- data_frame(fc = runif(1000, -5, 5),
padj = runif(1000, 0, 1))
df %>%
summarise(
dn_red = count(fc < -1.5, padj <= 0.1),
dn_pink = count(fc < -1.5, padj >= 0.1),
dn_blue = count(fc>-1.5 & fc< 0, padj <= 0.1),
dn_grey = count(fc>-1.5 & fc< 0, padj >= 0.1),
up_red = count(fc > 1.5, padj <= 0.1),
up_pink = count(fc > 1.5, padj >= 0.1),
up_blue = count(fc < 1.5 & fc > 0, padj <= 0.1),
up_grey = count(fc < 1.5 & fc > 0, padj >= 0.1)
)
Running it after a couple of months since writing it throws the following error:
Error: Problem with `summarise()` input `dn_red`.
x no applicable method for 'count' applied to an object of class "logical"
ℹ Input `dn_red` is `count(fc < -1.5, padj <= 0.1)`.
I can see that count outputs a tibble with logical vectors corresponding to the conditions. What I am trying to get out of it is a summary of the counts, where both the conditions are TRUE. The code above used to do just that...
You perhaps want sum instead of count!
set.seed(1)
df <- data.frame(fc = runif(1000, -5, 5),
padj = runif(1000, 0, 1))
df %>%
summarise(
dn_red = sum(fc < -1.5, padj <= 0.1),
dn_pink = sum(fc < -1.5, padj >= 0.1),
dn_blue = sum(fc>-1.5 & fc< 0, padj <= 0.1),
dn_grey = sum(fc>-1.5 & fc< 0, padj >= 0.1),
up_red = sum(fc > 1.5, padj <= 0.1),
up_pink = sum(fc > 1.5, padj >= 0.1),
up_blue = sum(fc < 1.5 & fc > 0, padj <= 0.1),
up_grey = sum(fc < 1.5 & fc > 0, padj >= 0.1)
)
dn_red dn_pink dn_blue dn_grey up_red up_pink up_blue up_grey
1 494 1250 269 1025 458 1214 267 1023
But this is creating overlaps. So you need to replace , within logical conditions with either & or | as the case may be. See.
df %>%
summarise(
dn_red = sum(fc < -1.5 & padj <= 0.1),
dn_pink = sum(fc < -1.5 & padj >= 0.1),
dn_blue = sum(fc>-1.5 & fc< 0 & padj <= 0.1),
dn_grey = sum(fc>-1.5 & fc< 0 & padj >= 0.1),
up_red = sum(fc > 1.5 & padj <= 0.1),
up_pink = sum(fc > 1.5 & padj >= 0.1),
up_blue = sum(fc < 1.5 & fc > 0 & padj <= 0.1),
up_grey = sum(fc < 1.5 & fc > 0 & padj >= 0.1)
)
dn_red dn_pink dn_blue dn_grey up_red up_pink up_blue up_grey
1 44 328 20 127 40 296 18 127
If this is what you expected, then it is advisable to divide 1000 data points into eight colors. Use this code instead
df %>% mutate(new = case_when(
fc < -1.5 & padj <= 0.1 ~ 'dn_red',
fc < -1.5 & padj >= 0.1 ~ 'dn_pink',
fc > -1.5 & fc < 0 & padj <= 0.1 ~ 'dn_blue',
fc > -1.5 & fc < 0 & padj >= 0.1 ~'dn_grey',
fc > 1.5 & padj <= 0.1 ~ 'up_red',
fc > 1.5 & padj >= 0.1 ~ 'up_pink',
fc < 1.5 & fc > 0 & padj <= 0.1 ~ 'up_blue',
fc < 1.5 & fc > 0 & padj >= 0.1 ~ 'up_grey',
TRUE ~ 'others'
)) %>% count(new)
new n
1 dn_blue 20
2 dn_grey 127
3 dn_pink 328
4 dn_red 44
5 up_blue 18
6 up_grey 127
7 up_pink 296
8 up_red 40
or better use janitor to have a frequency count
df %>% mutate(new = case_when(
fc < -1.5 & padj <= 0.1 ~ 'dn_red',
fc < -1.5 & padj >= 0.1 ~ 'dn_pink',
fc > -1.5 & fc < 0 & padj <= 0.1 ~ 'dn_blue',
fc > -1.5 & fc < 0 & padj >= 0.1 ~'dn_grey',
fc > 1.5 & padj <= 0.1 ~ 'up_red',
fc > 1.5 & padj >= 0.1 ~ 'up_pink',
fc < 1.5 & fc > 0 & padj <= 0.1 ~ 'up_blue',
fc < 1.5 & fc > 0 & padj >= 0.1 ~ 'up_grey',
TRUE ~ 'others'
)) %>% janitor::tabyl(new) %>%
janitor::adorn_totals()
new n percent
dn_blue 20 0.020
dn_grey 127 0.127
dn_pink 328 0.328
dn_red 44 0.044
up_blue 18 0.018
up_grey 127 0.127
up_pink 296 0.296
up_red 40 0.040
Total 1000 1.000
Related
I have a dataset named trainset that has 50 variables. For each row, I need to sum up the values under columns called Systolic.Blood.Pressure, Blood.Urea.Nitrogen, Blood.Sodium, Age, heart.rR and COPD values. I already defined a function to sum these up:
m.gwtg = function(Systolic.BP, Sodium, COPD){
if (Systolic.BP>=200){
pt.sbp = 0
}else if (Systolic.BP>= 190){
pt.sbp = 2
}else if (Systolic.BP>= 180){
pt.sbp = 4
}else if (Systolic.BP>= 170){
pt.sbp = 6
}else if (Systolic.BP>= 160){
pt.sbp = 8
}else if (Systolic.BP>= 150){
pt.sbp = 9
}else if (Systolic.BP>= 140){
if (Sodium>=139){
pt.sodium = 0
}else if (Sodium>= 137){
pt.sodium = 1
}else if (Sodium>= 134){
pt.sodium = 2
}
if (Age>=110){
pt.age = 28
}else if (Age>= 100){
pt.age = 25
}else if (Age>= 90){
pt.age = 22
}else if (Age>= 80){
pt.age = 19
}else if (Age>= 70){
if (COPD == 1){
pt.copd =2
} else {
pt.copd = 0
}
total = pt.sbp + pt.bun+ pt.sodium +pt.age + pt.hr+ pt.copd
return(if (total < 79){
outcome = 0
} else {
outcome = 1
})
}
I have problem with coding out the for loop to loop through the trainset and apply the function defined for each row. I tried to code the loop like this:
for (i in 1:nrow(trainset)) {
Systolic.BP[i] <- trainset$Systolic.blood.pressure[i]
Sodium[i] <- trainset$Blood.sodium[i]
COPD[i] <- trainset$COPD[i]
total[i] <- Systolic.BP[i]+ Sodium[i] +COPD[i]
outcome.gwtg.trainset[i]= m.gwtg(total[i])
}
I got quite confused on on the code out the for loop. Thanks for any answers!
You should avoid loops and similar functions (eg, *apply() and purrr::map()) whenever possible in R. R is designed to work with vectors, and loops are much much slower.
Instead of this loop operation, do the following:
Recode each column into its transformed values
Add up the transformed values
Recode the transformed values into the 0/1 outcome
These operations should be done vector-wise to be efficient (and to avoid the tricky indexing problems you are encountering).
For example:
m.gwtg <- function(data) {
data <- dplyr::mutate(data,
pt.sbp = dplyr::case_when(
Systolic.BP >= 200 ~ 0,
Systolic.BP >= 190 ~ 2,
Systolic.BP >= 180 ~ 4,
Systolic.BP >= 170 ~ 6,
Systolic.BP >= 160 ~ 8,
Systolic.BP >= 150 ~ 9,
Systolic.BP >= 140 ~ 11,
Systolic.BP >= 130 ~ 13,
Systolic.BP >= 120 ~ 15,
Systolic.BP >= 110 ~ 17,
Systolic.BP >= 100 ~ 19,
Systolic.BP >= 90 ~ 21,
Systolic.BP >= 80 ~ 23,
Systolic.BP >= 70 ~ 24,
Systolic.BP >= 60 ~ 26,
Systolic.BP >= 50 ~ 28, # should Systolic.BP < 50 be NA or 28?
# else
TRUE ~ NA_real_
),
pt.bun = dplyr::case_when(
BUN >= 150 ~ 28,
BUN >= 140 ~ 27,
BUN >= 130 ~ 25,
BUN >= 120 ~ 23,
BUN >= 110 ~ 21,
BUN >= 100 ~ 19,
BUN >= 90 ~ 17,
BUN >= 80 ~ 15,
BUN >= 70 ~ 13,
BUN >= 60 ~ 11,
BUN >= 50 ~ 9,
BUN >= 40 ~ 8,
BUN >= 30 ~ 6,
BUN >= 20 ~ 4,
BUN >= 10 ~ 2,
BUN < 10 ~ 0,
# else
TRUE ~ NA_real_
),
pt.sodium = dplyr::case_when(
Sodium >= 139 ~ 0,
Sodium >= 137 ~ 1,
Sodium >= 134 ~ 2,
Sodium >= 131 ~ 3,
Sodium < 131 ~ 4,
# else
TRUE ~ NA_real_
),
pt.age = dplyr::case_when(
Age >= 110 ~ 28,
Age >= 100 ~ 25,
Age >= 90 ~ 22,
Age >= 80 ~ 19,
Age <= 70 ~ 17,
Age >= 60 ~ 14,
Age >= 50 ~ 11,
Age <= 40 ~ 8,
Age >= 30 ~ 6,
Age >= 20 ~ 3,
Age < 20 ~ 0,
# else
TRUE ~ NA_real_
),
pt.hr = dplyr::case_when(
HR >= 105 ~ 8,
HR >= 100 ~ 6,
HR >= 95 ~ 5,
HR >= 90 ~ 4,
HR >= 85 ~ 3,
HR >= 80 ~ 1,
HR < 80 ~ 0,
# else
TRUE ~ NA_real_
),
pt.copd = dplyr::case_when(
COPD == 1 ~ 2,
COPD == 0 ~ 0,
# else
TRUE ~ NA_real_
),
total = pt.sbp + pt.bun + pt.sodium + pt.age + pt.hr + pt.copd,
outcome = dplyr::if_else(total < 79, 0, 1)
)
return(data)
}
example_data <- data.frame(
Systolic.BP = c(170, 160, 200),
BUN = c(60, 150, 10),
Sodium = c(134, 131, 139),
Age = c(40, 80, 20),
HR = c(90, 105, 80),
COPD = c(1, 0, 0)
)
m.gwtg(example_data)
#> Systolic.BP BUN Sodium Age HR COPD pt.sbp pt.bun pt.sodium pt.age pt.hr
#> 1 170 60 134 40 90 1 6 11 2 17 4
#> 2 160 150 131 80 105 0 8 28 3 19 8
#> 3 200 10 139 20 80 0 0 2 0 17 1
#> pt.copd total outcome
#> 1 2 42 0
#> 2 0 66 0
#> 3 0 20 0
Created on 2022-03-25 by the reprex package (v2.0.1)
This is a more concise way to calculate the outcome:
library(tidyverse)
calc_score <- function(systolic_bp, bun, sodium) {
systolic_bp_score <- case_when(
systolic_bp >= 200 ~ 0,
systolic_bp >= 190 ~ 2,
systolic_bp >= 180 ~ 4
)
bun_score <- case_when(
bun >= 150 ~ 28,
bun >= 140 ~ 27
)
sodium_score <- case_when(
sodium >= 139 ~ 0,
sodium >= 137 ~ 1
)
systolic_bp_score + bun_score + sodium_score
}
# example data
trainset <- tibble(
systolic_bp = c(180, 195),
bun = c(145, 180),
sodium = c(138, 140)
)
trainset %>%
mutate(
score = list(systolic_bp, bun, sodium) %>% pmap_dbl(calc_score),
outcome = as.numeric(score > 97)
)
#> # A tibble: 2 × 5
#> systolic_bp bun sodium score outcome
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 180 145 138 32 0
#> 2 195 180 140 30 0
Created on 2022-03-25 by the reprex package (v2.0.0)
I have a data frame like this:
Input_df <- data.frame(Enl_ID = c("INTS121410", "INTS175899", "INTS171428", "INTS156006", "INTS196136", "INTS114771" ), `CN4244` = c(5, 0, -0.4, -0.6, 10, 2), `CN4249` = c(10, -4, -10, -2, 6, 0), `CN4250` = c(40, 10, 4, -10, 0, 4))
I'm trying to rescale the positive values between 0-1 and negative values between 0 to -1 so the output would be like
Output_df <- data.frame(Enl_ID = c("INTS121410", "INTS175899", "INTS171428", "INTS156006", "INTS196136", "INTS114771" ), `CN4244` = c(0.5, 0, -0.66, -1, 1, 0.2), `CN4249` = c(1, -0.4, -1, -0.2, 0.6, 0), `CN4250` = c(1, 0.25, 0.1, -1, 0, 0.1))
I found few examples like at stackoverflow but this is only for single-column and my file run into almost 2000 column so it is not possible to do it manually on every column.
Any idea how to do it?
Any help would be appreciated. Thanks in advance
You could use
library(dplyr)
Input_df %>%
mutate(across(starts_with("CN"), ~.x / max(abs(.x))))
This returns
Enl_ID CN4244 CN4249 CN4250
1 INTS121410 0.50 1.0 1.00
2 INTS175899 0.00 -0.4 0.25
3 INTS171428 -0.04 -1.0 0.10
4 INTS156006 -0.06 -0.2 -0.25
5 INTS196136 1.00 0.6 0.00
6 INTS114771 0.20 0.0 0.10
Or, if you want different rescaling factors for positive and negative values:
Input_df %>%
mutate(across(starts_with("CN"),
~case_when(.x >= 0 ~ .x / max(.x),
TRUE ~ - .x / min(.x))))
This returns
Enl_ID CN4244 CN4249 CN4250
1 INTS121410 0.5000000 1.0 1.00
2 INTS175899 0.0000000 -0.4 0.25
3 INTS171428 -0.6666667 -1.0 0.10
4 INTS156006 -1.0000000 -0.2 -1.00
5 INTS196136 1.0000000 0.6 0.00
6 INTS114771 0.2000000 0.0 0.10
I want to "translate" a syntax written in SPSS into R code but am a total beginner in R and struggling to get it to work.
The SPSS syntax is
DO IF (Geschlecht = 0).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 22.99=0) (23 thru 55=1) (55.01 thru Highest=2)
INTO Hang.
ELSE IF (Geschlecht = 1).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 21.99=0) (22 thru 54=1) (54.01 thru Highest=2)
INTO Hang.
END IF.
I have installed the "car"-package in R but I neither get the "range" recoding to work (I have tried
td_new$Hang <- recode(td_new$hang0, "0:22.99=0; 23:55=1; else=2")
nor do I manage to work with the if-else-function. My last attempt was
if(td_new$Geschlecht == 0){
td_new$Hang <- td_new$hang0 = 3
} else if (td_new$Geschlecht == 1) {
td_new$Hang <- td_new$hang0 = 5)
} else
td_new$hang0 <- NA
(this was without the recoding, just to test the if-else function).
Would be very happy if someone helped!
Thanks a lot in advance :)!
Sorry, edited to add:
The data structure looks as follows:
Geschlecht hang0
0 15
1 45
1 7
0 11
And I want to recode hang0 such that
for boys (Geschlecht = 0): all values < 23 = 0, values between 23 and 55 = 1, all values > 55 = 2
and for girls (Geschlecht = 1): all values < 22 = 0, values between 23 and 54 = 1, all values > 54 = 2
Here's an approach with case_when:
library(dplyr)
td_new %>%
mutate(Hang = case_when(Geschlecht = 0 & hang0 < 23 ~ 0,
Geschlecht = 0 & hang0 >= 23 & hang0 < 55 ~ 1,
Geschlecht = 0 & hang0 >= 55 ~ 2,
Geschlecht = 1 & hang0 < 22 ~ 0,
Geschlecht = 1 & hang0 >= 22 & hang0 < 54 ~ 1,
Geschlecht = 1 & hang0 >= 54 ~ 2,
TRUE ~ NA_real_))
# Geschlecht hang0 Hang
#1 0 15 0
#2 1 45 1
#3 1 7 0
#4 0 11 0
The final line is there to catch NAs.
Data
td_new <- structure(list(Geschlecht = c(0L, 1L, 1L, 0L), hang0 = c(15L, 45L, 7L, 11L)), class = "data.frame", row.names = c(NA, -4L))
I have data set with 3 features as below:
V1 V2 V3
0.268 0.917 0.191
0.975 0.467 0.447
0.345 0.898 0.984
0.901 0.043 0.456
0.243 0.453 0.964
0.001 0.464 0.953
0.998 0.976 0.978
0.954 0.932 0.923
How to plot this data in 3D graphic based on the following conditions giving different colour for each condition.
(v1>=0.90 && v3>=0.90 && v3>=0.90) || (v1>=0.90 && v3< 0.50 && v3< 0.50) || (v1 < 0.50 && v3>=0.90 && v3< 0.50)|| (v1< 0.50 && v3< 0.50 && v3>=0.90)
I assumed the second statement in each condition is referring to V2, which makes more sense. To color the points according to which condition is met first you need to create a column with that value:
df = data.frame(
"V1" = c(0.268,0.975,0.345,0.901,0.243,0.001,0.998,0.954),
"V2" = c(0.917,0.467,0.898,0.043,0.453,0.464,0.976,0.932),
"V3" = c(0.191,0.447,0.984,0.456,0.964,0.953,0.978,0.923)
)
df = df %>%
mutate(
group = case_when(
V1 >= 0.9 & V2 >= 0.9 & V3 >=0.9 ~ "1",
V1 >= 0.9 & V2 < 0.5 & V3 < 0.5 ~ "2",
V1 < 0.5 & V2 >= 0.9 & V3 <0.5 ~ "3",
V1 <0.5 & V2 <0.5 & V3 >=0.9 ~ "4",
T ~ "5"
))
Then we can use the plotlyor scatterplot3d packages to build the graph:
scatterplot3d(x=df$V1,y=df$V2,z=df$V3,color=df$group)
plot_ly(x=df$V1,y=df$V2,z=df$V3,color = df$group)
You can start by creating a logical vector using the vectorized &;|
# Create the logical vector
ind <- (mat$v1>=0.90 & mat$v3>=0.90 & mat$v3>=0.90) | (mat$v1>=0.90 & mat$v3< 0.50 & mat$v3< 0.50) |
(mat$v1 < 0.50 & mat$v3>=0.90 & mat$v3< 0.50) | (mat$v1< 0.50 & mat$v3< 0.50 & mat$v3>=0.90)
And now one can plot it e.g. using the plotly
# plot
plotly::plot_ly(x = mat$v1[ind], y = mat$v2[ind], z = mat$v3[ind])
With the data
mat = structure(list(v1 = c(0.268, 0.975, 0.345, 0.901, 0.243, 0.001,
0.998, 0.954), v2 = c(0.917, 0.467, 0.898, 0.043, 0.453, 0.464,
0.976, 0.932), v3 = c(0.191, 0.447, 0.984, 0.456, 0.964, 0.953,
0.978, 0.923)), class = "data.frame", row.names = c(NA, -8L))
library(dplyr)
library(data.table)
df <- data.frame(year = rep(2003:2004, each = 4),
mg = rep(rep(c("a", "b"), each = 2), times = 2),
plant_date = c(20, 30, 20, 30, 33, 40, 33, 40),
stage1 = c(40, 50, 42, 52, 43, 55, 48, 57),
stage2 = c(55, 65, 57, 66, 58, 68, 59, 65),
stage3 = c(61, 75, 63, 76, 66, 77, 68, 79))
set.seed(123)
dat <- data.frame(year = rep(2003:2004, each = 365), doy = rep(1:365, times = 2),
rainfall = sample(0:20, 730, replace = T))
final.dat <- dat %>% dplyr::left_join(df)
I have used the dplyr to do so some calculation as follows:
final.dat %>% dplyr::group_by(year, plant_date, mg) %>%
dplyr::summarise(
sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2]),
var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))
I want to do the same thing using data.table
final.dat <- data.table(final.dat)
final.dat[, j = list(sum(rainfall[doy >= plant_date & doy <= stage1]),
sum(rainfall[doy >= plant_date & doy <= stage2]),
mean(rainfall[doy >= plant_date & doy <= stage1]),
mean(rainfall[doy >= plant_date & doy <= stage2])),
by = list(year, plant_date, mg)]
I have two questions:
1) How can I keep my column names as in dplyr instead of the default V1, V2, V3 and V4.
2) I do not know how to implement this part of dplyr in data.table
var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))
Thanks
dt = as.data.table(final.dat) # or setDT to convert in place
dt[, .(sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2]))
, by = .(year, plant_date, mg)][
, `:=`(var.x1 = (sum_rain_stage1 > sum_rain_stage2) * (mean_rain_stage1 - mean_rain_stage2),
var.x2 = (mean_rain_stage1 > mean_rain_stage2) * (sum_rain_stage1 - sum_rain_stage2))][]
Backing up a step, your approach to conditions like doy >= plant_date & doy <= stage1 can be simplified in two ways...
First, there's doy %between% list(plant_date, stage1) so you don't have to type the var name twice.
Second, since doy and the other columns come from different tables, you can do this inside a non-equi join to update df instead of making new table final.dat:
setDT(df)
setDT(dat)
df[, c("sum_rs1", "mean_rs1") :=
dat[df, on=.(doy >= plant_date, doy <= stage1), .(sum(rainfall), mean(rainfall)), by=.EACHI][, .(V1, V2)]]
df[, c("sum_rs2", "mean_rs2") :=
dat[df, on=.(doy >= plant_date, doy <= stage2), .(sum(rainfall), mean(rainfall)), by=.EACHI][, .(V1, V2)]]
# same as #eddi's
df[, `:=`(
var.x1 = (sum_rs1 > sum_rs2) * (mean_rs1 - mean_rs2),
var.x2 = (mean_rs1 > mean_rs2) * (sum_rs1 - sum_rs2)
)]
For summarise, we can just name as in a list
list(a = 1:2, b = 3:5)
Similarly,
final.dat[, list(sum_rain_stage1 = sum(rainfall[doy >= plant_date & doy <= stage1]),
sum_rain_stage2 = sum(rainfall[doy >= plant_date & doy <= stage2]),
mean_rain_stage1 = mean(rainfall[doy >= plant_date & doy <= stage1]),
mean_rain_stage2 = mean(rainfall[doy >= plant_date & doy <= stage2])),
by = list(year, plant_date, mg)]
# year plant_date mg sum_rain_stage1 sum_rain_stage2 mean_rain_stage1 mean_rain_stage2
#1: 2003 20 a 251 355 11.952381 9.861111
#2: 2003 30 a 176 304 8.380952 8.444444
#3: 2003 20 b 261 361 11.347826 9.500000
#4: 2003 30 b 185 313 8.043478 8.459459
#5: 2004 33 a 109 247 9.909091 9.500000
#6: 2004 40 a 134 279 8.375000 9.620690
#7: 2004 33 b 157 253 9.812500 9.370370
#8: 2004 40 b 158 242 8.777778 9.307692