Creating a column with factor variables conditional on multiple other columns? - r

I have 4 columns, called Amplification, CNV.gain, Homozygous.Deletion.Frequency, Heterozygous.Deletion.Frequency. I want to create a new column in which, if any of the values in these 4 columns are:
greater than or equal to 5 and less than or equal to 10, it returns low:
greater than 10 and less than or equal to 20, it returns medium
greater than 20, it returns high
An example of the final table (long_fused) would look like this:
CNV.Gain
Amplification
Homozygous.Deletion.Frequency
Heterozygous.Deletion.Frequency
Threshold
3
5
10
0
Low
0
0
11
8
Medium
7
16
25
0
High
So far, I've tried the following code, although it seems to fill in the "Threshold" Column, is doing so incorrectly.
library(dplyr)
long_fused <- long_fused %>%
mutate(Percent_sample_altered = case_when(
Amplification>=5 & Amplification < 10 & CNV.gain>=5 & CNV.gain < 10 | CNV.gain>=5 & CNV.gain<=10 & Homozygous.Deletion.Frequency>=5 & Homozygous.Deletion.Frequency<=10| Heterozygous.Deletion.Frequency>=5 & Heterozygous.Deletion.Frequency<=10 ~ 'Low',
Amplification>= 10 & Amplification<20 |CNV.gain>=10 & CNV.gain<20| Homozygous.Deletion.Frequency>= 10 & Homozygous.Deletion.Frequency<20 | Heterozygous.Deletion.Frequency>=10 & Heterozygous.Deletion.Frequency<20 ~ 'Medium',
Amplification>20 | CNV.gain >20 | Homozygous.Deletion.Frequency >20 | Heterozygous.Deletion.Frequency>20 ~ 'High'))
As always any help is appreciated!
Data in dput format
long_fused <-
structure(list(CNV.Gain = c(3L, 0L, 7L), Amplification = c(5L,
0L, 16L), Homozygous.Deletion.Frequency = c(10L, 11L, 25L),
Heterozygous.Deletion.Frequency = c(0L, 8L, 0L), Threshold =
c("Low", "Medium", "High")), class = "data.frame",
row.names = c(NA, -3L))

Here is a way with rowwise followed by base function cut.
library(dplyr)
long_fused %>%
rowwise() %>%
mutate(new = max(c_across(-Threshold)),
new = cut(new, c(5, 10, 20, Inf), labels = c("Low", "Medium", "High"), left.open = TRUE))

Here's an alternative using case_when -
library(dplyr)
long_fused %>%
mutate(max = do.call(pmax, select(., -Threshold)),
#If you don't have Threshold column in your data just use .
#mutate(max = do.call(pmax, .),
Threshold = case_when(between(max, 5, 10) ~ 'Low',
between(max, 11, 15) ~ 'Medium',
TRUE ~ 'High'))
# CNV.Gain Amplification Homozygous.Deletion.Frequency
#1 3 5 10
#2 0 0 11
#3 7 16 25
# Heterozygous.Deletion.Frequency max Threshold
#1 0 10 Low
#2 8 11 Medium
#3 0 25 High

Related

Making a table that contains Mean and SD of a Dataset

I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")

How To return the true condition only of a result on a list in r?

I have a problem with my R code.
Here, I have a list named bought_list with lists of customer and checkout (checkout is a data frame),
And this how checkout lists looks like:
items price qty total
Milk 10 2 20
Dolls 15 10 150
Chocolate 5 5 25
Toys 50 1 50
I want to know which one is for play_purpose and date_purpose
So I made a variable of boolean
play_purpose <- Bought_list[["checkout"]][,"total"] >= 50 & Bought_list[["checkout"]][,"total"] <= 150
date_purpose <- Bought_list[["checkout"]][,"total"] > 0 & Bought_list[["checkout"]][,"total"] < 50
How to return the items name and total value of selected condition like this?
for play_purpose:
Dolls 150
Toys 50
for date_purpose :
Milk 20
Chocolate 25
I'm not clear on the structure of your data, but you could subset with your current code:
play_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] >= 50 &
Bought_list[["checkout"]][, "total"] <= 150, c(1, 4)]
# items total
#2 Dolls 150
#4 Toys 50
date_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] > 0 &
Bought_list[["checkout"]][, "total"] < 50, c(1, 4)]
# items total
#1 Milk 20
#3 Chocolate 25
Another option is to use dplyr:
Bought_list$checkout %>%
filter(total >= 50 & total <= 150) %>%
select(items, total)
Bought_list$checkout %>%
filter(total > 0 & total < 50) %>%
select(items, total)
Or if you are needing to applying this function to multiple dataframes in the list, then we could use map from purrr:
map(Bought_list, ~ .x %>%
filter(total >= 50 & total <= 150) %>%
select(items, total))
map(Bought_list, ~ .x %>%
filter(total > 0 & total < 50) %>%
select(items, total))
Data
Bought_list <- list(checkout = structure(list(items = c("Milk", "Dolls", "Chocolate",
"Toys"), price = c(10L, 15L, 5L, 50L), qty = c(2L, 10L, 5L, 1L
), total = c(20L, 150L, 25L, 50L)), class = "data.frame", row.names = c(NA,
-4L)))

Recode continuous variable in R based on conditions

I want to "translate" a syntax written in SPSS into R code but am a total beginner in R and struggling to get it to work.
The SPSS syntax is
DO IF (Geschlecht = 0).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 22.99=0) (23 thru 55=1) (55.01 thru Highest=2)
INTO Hang.
ELSE IF (Geschlecht = 1).
RECODE hang0 (SYSMIS=SYSMIS) (Lowest thru 21.99=0) (22 thru 54=1) (54.01 thru Highest=2)
INTO Hang.
END IF.
I have installed the "car"-package in R but I neither get the "range" recoding to work (I have tried
td_new$Hang <- recode(td_new$hang0, "0:22.99=0; 23:55=1; else=2")
nor do I manage to work with the if-else-function. My last attempt was
if(td_new$Geschlecht == 0){
td_new$Hang <- td_new$hang0 = 3
} else if (td_new$Geschlecht == 1) {
td_new$Hang <- td_new$hang0 = 5)
} else
td_new$hang0 <- NA
(this was without the recoding, just to test the if-else function).
Would be very happy if someone helped!
Thanks a lot in advance :)!
Sorry, edited to add:
The data structure looks as follows:
Geschlecht hang0
0 15
1 45
1 7
0 11
And I want to recode hang0 such that
for boys (Geschlecht = 0): all values < 23 = 0, values between 23 and 55 = 1, all values > 55 = 2
and for girls (Geschlecht = 1): all values < 22 = 0, values between 23 and 54 = 1, all values > 54 = 2
Here's an approach with case_when:
library(dplyr)
td_new %>%
mutate(Hang = case_when(Geschlecht = 0 & hang0 < 23 ~ 0,
Geschlecht = 0 & hang0 >= 23 & hang0 < 55 ~ 1,
Geschlecht = 0 & hang0 >= 55 ~ 2,
Geschlecht = 1 & hang0 < 22 ~ 0,
Geschlecht = 1 & hang0 >= 22 & hang0 < 54 ~ 1,
Geschlecht = 1 & hang0 >= 54 ~ 2,
TRUE ~ NA_real_))
# Geschlecht hang0 Hang
#1 0 15 0
#2 1 45 1
#3 1 7 0
#4 0 11 0
The final line is there to catch NAs.
Data
td_new <- structure(list(Geschlecht = c(0L, 1L, 1L, 0L), hang0 = c(15L, 45L, 7L, 11L)), class = "data.frame", row.names = c(NA, -4L))

how to make new column with if and else condition

what should i do when i want to make new column with mutate but with if condition status on it.
example :
dt <- read.table(text="
name,gender,fat_%
adam,male,32
anya,female,27
gilang,male,24
andine,female,34
",sep=',',header=TRUE)
## + > dt
## name gender fat_.
## 1 adam male 32
## 2 anya female 27
## 3 gilang male 24
## 4 andine female 34
my question :
what code i have to write if i want to make new column where gonna take 2 answer "yes" or "no".
and my new column will be like this :
name gender fat_% obesity
adam male 32 yes
anya female 27 no
gilang male 24 yes
andine female 34 no
note : formula to find obesity is
(if male & fat > 26 = yes ,if girl & fat >32 = yes) if (if male & fat < 26 = no ,if girl & fat <32 = no)
Couple of suggestions first. Gender can be a single char M/F. You cannot use % in column name. Your column name 'fat', you probably meant BMI??
Does this work for you?
dt %>%
mutate (newcol = ifelse ((gender == "male"), (ifelse ((fat_ > 26), TRUE, FALSE)),
(ifelse ((fat_ > 32), TRUE, FALSE))))
Two solutions.
First, a base Rsolution:
df$obesity <- ifelse (df$gender == "m" & df$fat_ > 26 , "yes",
ifelse(df$gender == "f" & df$fat_ > 32, "yes", "no"))
Using mutatefrom dplyr, a more compact code based on dplyrs if_else rather than base R's ifelse is this:
df %>%
mutate(obesity = if_else(gender=="m" & fat_ > 26|gender=="f" & fat_ > 32, "yes", "no"))
RESULT:
df
name gender fat_ obesity
1 adam m 32 yes
2 anya f 27 no
3 gilang m 24 no
4 andine f 34 yes
DATA:
df <- data.frame(
name = c("adam", "anya", "gilang", "andine"),
gender = c("m", "f", "m", "f"),
fat_ = c(32,27,24,34)
)
One approach is to use case_when from dplyr:
library(dplyr)
df %>%
mutate(obesity = case_when(gender == "male" & fat > 26 ~ "yes",
gender == "female" & fat > 32 ~ "yes",
TRUE ~ "no"))
# name gender fat obesity
#1 adam male 32 yes
#2 anya female 27 no
#3 gilang male 24 no
#4 andine female 34 yes
Once you understand the syntax, it comes in handy quite often.
Data
structure(list(name = structure(c(1L, 3L, 4L, 2L), .Label = c("adam",
"andine", "anya", "gilang"), class = "factor"), gender = structure(c(2L,
1L, 2L, 1L), .Label = c("female", "male"), class = "factor"),
fat = c(32, 27, 24, 34)), class = "data.frame", row.names = c(NA,
-4L))

ifelse in r with two or more conditions

How can I use a conditional statement in R to define value in a column based on two column conditions?
Data
Term(in month) DayLate NEW_STATUS
12 0 .....
24 24 .....
17 30 .....
9 15 .....
36 21 .....
Pseudocode
if(term <= 12){
if(DayLate <= 14) then NEW_STATUS = "NORM"
if(DayLate between 15~30) then NEW_STATUS = "SPECIAL"
}else if(term > 12){
if(DayLate <= 29) then NEW_STATUS = "NORM"
if(DayLate between 30~89) then NEW_STATUS = "SPECIAL"
}
It can be achieved by nested conditional statements with ifelse() in base or if_else(), case_when() in dplyr.
# data
df <- structure(list(Term = c(12L, 24L, 17L, 9L, 36L), DayLate = c(0L,
24L, 30L, 15L, 21L)), class = "data.frame", row.names = c(NA, -5L))
(1) base way
within(df,
NEW_STATUS <- ifelse(Term <= 12,
ifelse(DayLate <= 14, "NORM", "SPECIAL"),
ifelse(DayLate <= 29, "NORM", "SPECIAL"))
)
(2) dplyr
df %>% mutate(
NEW_STATUS = case_when(
Term <= 12 ~ if_else(DayLate <= 14, "NORM", "SPECIAL"),
TRUE ~ ifelse(DayLate <= 29, "NORM", "SPECIAL")
)
)
Output
# Term DayLate NEW_STATUS
# 1 12 0 NORM
# 2 24 24 NORM
# 3 17 30 SPECIAL
# 4 9 15 SPECIAL
# 5 36 21 NORM

Resources