I'm trying to conditionally replace values with NA in R.
Here's what I've tried so far using dplyr package.
Data
have <- data.frame(id = 1:3,
gender = c("Female", "I Do Not Wish to Disclose", "Male"))
First try
want = as.data.frame(have %>%
mutate(gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
)
This gives me an error.
Second try
want = as.data.frame(have %>%
mutate(gender = ifelse(gender == "I Do Not Wish to Disclose", NA, gender))
)
This runs without an error but turns Female into 1, Male into 3 and I Do Not Wish to Disclose into 2...
It is case where the column is factor. Convert to character and it should work
library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
The change in values in gender is when it gets coerced to its integer storage values
as.integer(factor(c("Male", "Female", "Male")))
I would use the very neat function na_if() from dplyr.
library(dplyr)
have <- data.frame(gender = c("F", "M", "NB", "I Do Not Wish to Disclose"))
have |> mutate(gender2 = na_if(gender, "I Do Not Wish to Disclose"))
Output:
#> gender gender2
#> 1 F F
#> 2 M M
#> 3 NB NB
#> 4 I Do Not Wish to Disclose <NA>
Created on 2022-04-19 by the reprex package (v2.0.1)
Related
Have seen several posts on this, but can't seem to get it to work for my specific use case.
I'm trying to assign a new field value based on ifelse logic. My input dataset looks like:
If the value for X is missing, I am trying to replace it with the previous value of X, only when the value of unique_id is the same as the previous value of unique_id. I would like the output dataset to look like this:
The code I've written (I'm a total beginner) doesn't throw an error, but the data doesn't change:
within(data3, data3$Output <- ifelse(data3$unique_id == lag(data3$unique_id) & is.na(data3$Output), data3$Output == lag(data3$Output), data3$Output == data3$Output))
I do change missing data values ("-") in the input dataset to official NA missing values in a previous step... hopefully allowing me to use the is.na function.
data.table option where you replace the NA with the non-NA value per group:
df <- data.frame(unique_id = c("m", "m"),
X = c(73500, NA),
MoM = c("4%", "0%"))
library(data.table)
setDT(df)
df[, X := X[!is.na(X)][1L], by = unique_id]
df
#> unique_id X MoM
#> 1: m 73500 4%
#> 2: m 73500 0%
Created on 2022-07-09 by the reprex package (v2.0.1)
In addition to the provided solutions: One of these:
fill()
suggest by #jared_marot in the comments
library(dplyr)
library(tidyr)
df %>%
fill(X)
first()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = first(X))
lag()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = lag(X, default = X[1]))
base R
df[2,2] <- df[1,2]
You could group the IDs, then use fill to copy down the values replacing NAs by group. See the reproducible example below.
(If you have NAs which could appear before or after the value, then you could add , .direction = "downup" to the fill.
library(tidyverse)
# Sample data
df <- tribble(
~unique_id, ~x, ~mom,
"m", 73500, 4,
"m", NA, 0,
"z", 4000, 5,
"z", NA, 0,
)
df2 <- df |>
group_by(unique_id) |>
fill(x, .direction = "downup") |>
ungroup()
#> # A tibble: 4 × 3
#> unique_id x mom
#> <chr> <dbl> <dbl>
#> 1 m 73500 4
#> 2 m 73500 0
#> 3 z 4000 5
#> 4 z 4000 0
Created on 2022-07-09 by the reprex package (v2.0.1)
Let suppose we have a big data.frame named df with three different variables:
Gender: which can be M or F (2 possible answers)
Hair: which can be "black", "brown", "blond", "red", "other" (5 possible values)
Sport: which can be "yes" or "no" (2 different values)
Value: always 1 in order to count the number of events
When I use the collap function from collapse package I run the following code
collap (df, ~ Gender + Hair + Sport, FUN = sum, cols ="Value")
What I expect is a data.frame with 20 different rows (one per each combination); however, if there is a combination with no occurrences, the row does not appear.
Do you know how can I get all the possible combinations with a 0 in case there are no events with the required values?
You can complete unused factor levels like this, resulting in a row for the females despite all rows in the data are males:
library(tidyverse)
library(collapse)
#> collapse 1.7.6, see ?`collapse-package` or ?`collapse-documentation`
#>
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#>
#> D
data <- tribble(
~Gender, ~Hair, ~Value,
"M", "black", 1
)
data %>%
mutate(Gender = Gender %>% factor(levels = c("M", "F"))) %>%
complete(Gender, fill = list(Value = 0)) %>%
collap(~ Gender + Hair, FUN = sum, cols = "Value")
#> # A tibble: 2 × 3
#> Gender Hair Value
#> <fct> <chr> <dbl>
#> 1 M black 1
#> 2 F <NA> 0
Created on 2022-05-03 by the reprex package (v2.0.0)
This is the answer to my question based on the response by #danloo
df %<%
complete(Gender, Hair, Sport) %>%
collap( ~Gender + Hair + Sport, FUN = sum, cols = "Value")
Running that I get a data.frame with 20 different rows where NA are placed for those combinations with no events.
I am working with a data frame where each row is a patient with a particular illness. There is a column for their age category, and several columns with text (Yes or No) as to whether or not they are experiencing a particular symptom. Example provided below
set.seed(1)
Sick <- data.frame(age=sample(c("Infant", "Child", "Adult", "Elderly"), size=20, replace = TRUE),
cough= sample(c("Yes", "No"), size=20, replace = TRUE),
fever= sample(c("Yes", "No"), size=20, replace = TRUE),
chills= sample(c("Yes", "No"), size=20, replace = TRUE),
fatigue=sample(c("Yes", "No"), size=20, replace = TRUE))
What I am trying to get is a nicely structured frame that indicates how many patients in each category experience the symptom where the columns are the age categories and the rows are the count of how many people in that category experienced that symptom. The code below shows what I want my end result to be.
Count <- data.frame(symptom=c("cough", "fever", "chills", "fatigue"),
Infant=c(5, 1, 4, 2),
Child= c(4, 3, 2, 4),
Adult= c(2, 3, 1, 5),
Elderly = c(1, 0, 0, 0))
I know I could create this with the table and rbind functions, however, I was wondering if anyone had advice on how to streamline this? The real frame has about 10 age categories and 25 symptoms, so doing lots of tables may not be the most efficient.
Thank you for any and all help!
The above is great (upvoted, but tidyverse is needed as well), or even simpler
library(tidyverse)
Sick%>%
pivot_longer(-age,names_to='symptom')%>%
count(age,symptom)%>%
pivot_wider(names_from='symptom',values_from='n')
I've found in learning R that a great many problems can be solved by pivoting long and then wide or vice versa with some transform or calculation in between :)
I hope this is right. If I understand your question you just want to count the yes's for each category. I've put it into a function so just change x = Sick to whatever your dataframe is called and run the function.
EDIT I forget which package the pipe and columns_to_rownames comes from, I've added dplyr as a require but it may come from magrittr. If in doubt just load the tidyverse.
sick_tbl <- function(x = Sick){
require(dplyr)
sick_piv <- pivot_longer(x, names_to = "names", values_to = "values",
-c(age))
count <- sick_piv%>%
count(values, names, age) %>%
filter(values == "Yes") %>%
select(!values)
sick_out <- pivot_wider(count,
names_from = "age",
values_from = "n") %>%
column_to_rownames(var = "names")
sick_out[is.na(sick_out)] <- 0
sick_out <<- sick_out}
To run on your example data:
sick_tbl(x = Sick)
Adult Child Elderly Infant
chills 1 2 4 NA
cough 4 2 5 1
fatigue 3 3 2 1
fever 2 4 2 2
Here are three pretty concise options:
Base R
# Can skip the lapply if the Y/N columns were character to begin with
with(subset(cbind(Sick[1], stack(lapply(Sick[-1], as.character))), values == "Yes"),
table(ind, age))
data.table
library(data.table)
melt(as.data.table(Sick), id.vars="age")[value == "Yes", table(variable, age)]
questionr
library(questionr)
cross.multi.table(Sick[-1], Sick[[1]], true.codes = list("Yes"))
First reshape, then use the datasummary function from the modelsummary package (self-promotion alert). The benefit of this solution is that you can customize the look of your table and save it to many formats (html, latex, word, markdown, etc.):
library('modelsummary')
library('tidyverse')
dat = pivot_longer(Sick, -age, names_to = "Symptom") %>%
filter(value == "Yes")
datasummary(Symptom ~ N * age, data = dat)
Adult
Child
Elderly
Infant
chills
1
2
0
4
cough
2
4
1
5
fatigue
5
4
0
2
fever
3
3
0
1
I need to obtain values of a list of grades (math, language, sci, etc.) conditional on the presence of valid values in 2016 (validity_2016=="yes"), into a new variable called grades_{subjects} (eg. grades_math).
df<-tibble(person = c("Alice", "Bob", "Mary"),
validity_2016 = c(NA, "yes", NA),
likes_ham = c("no", "yes", "yes"),
grades_math_2015=c(6,2,4),
grades_math_2016=c(3,5,7),
grades_language_2015=c(7,1,9),
grades_language_2016=c(3,6,7),
grades_sci_2015=c(7,1,9),
grades_sci_2016=c(3,6,7))
I was wondering the viability to use dplyr's mutate_at or mutate(across, in the following way:
dplyr::mutate(across(grades_math_2016, grades_language_2016,grades_sci_2016),
~dplyr::case_when(!is.na(validity_2016)~list(grades_math_2015,grades_language_2015,grades_sci_2015)~.),
.names="{col}"))
The outcomes should look like this:
df<-tibble(person = c("Alice", "Bob", "Mary"),
validity_2016 = c(NA, "yes", NA),
likes_ham = c("no", "yes", "yes"),
grades_math_2015=c(6,2,4),
grades_math_2016=c(3,5,7),
grades_language_2015=c(7,1,9),
grades_language_2016=c(3,6,7),
grades_sci_2015=c(7,1,9),
grades_sci_2016=c(3,6,7),
grades_math=c(6,5,4),
grades_language=c(7,6,7),
grades_sci=c(7,6,9))
I'd recommend using a mutate and ifelse for each subject. Something like:
df2 = df %>%
mutate(grades_math = ifelse(validatiy_2016 == "yes", grades_math_2016, grades_math_2015))
The downside if this approach is you need to repeat it for each subject. This could be automated with something like:
out_cols = c("grades_math", "grades_sci")
for(col in out_cols){
c15 = paste0(col,"_2015")
c16 = paste0(col,"_2016")
df = df %>% mutate(!!sym(col) := ifelse(validaity_2016 == "yes", !!sym(c16), !!sym(c15)))
}
Where !!sym(x) takes the text saved in the variable x and turns it into a variable name (e.g. if x = "sci" then !!sym(x) gives us the variable sci instead of the text "sci" or the variable x).
tidyverse and rlang example:
This example uses mutate and case_when to assign the variables as you described. I wrapped it in a function in case this is something you would do often.
library(tidyverse)
library(rlang)
make_grade_columns <- function(df, condition_col, year_view){
year_column_names <- colnames(df)[str_detect(colnames(df), year_view) & colnames(df) != condition_col & !str_detect(colnames(df), "validity")]
year_prior_column_names <- colnames(df)[str_detect(colnames(df), as.character(as.numeric(year_view) - 1)) & colnames(df) != condition_col]
return_col_names <- str_remove(year_column_names, "_\\d\\d\\d\\d")
df <- df %>% mutate(
!!return_col_names[1] := case_when(
(df %>% select(!!!condition_col)) == "yes" ~ !! sym(year_column_names[1]),
T ~ !! sym(year_prior_column_names[1])),
!!return_col_names[2] := case_when(
(df %>% select(!!!condition_col)) == "yes" ~ !! sym(year_column_names[2]),
T ~ !! sym(year_prior_column_names[2])),
!!return_col_names[3] := case_when(
(df %>% select(!!!condition_col)) == "yes" ~ !! sym(year_column_names[3]),
T ~ !! sym(year_prior_column_names[3])))
return(df)
}
make_grade_columns(df, "validity_2016", "2016") %>%
select(person, validity_2016, grades_math, grades_sci, grades_language)
# # A tibble: 3 x 5
# person validity_2016 grades_math grades_sci grades_language
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Alice NA 6 7 7
# 2 Bob yes 5 6 6
# 3 Mary NA 4 9 9
Suppose you changed it up and wanted to see the grades on the IF they answer "yes" to likes_ham. Simply have that as your conditioning column for the function.
make_grade_columns(df, "likes_ham", "2016")%>%
select(person, likes_ham, grades_math, grades_sci, grades_language)
# # A tibble: 3 x 5
# person likes_ham grades_math grades_sci grades_language
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Alice no 6 7 7
# 2 Bob yes 5 6 6
# 3 Mary yes 7 7 7
The function will take the yes answer and return the values from the year. If the answer is "no" then it will return the value from the year prior instead.
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), stock = c(43,5,2), bill = c(43,112,33))
How is it possible to use in names column the gender identification?
Expected output:
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), gender = c("female", "female", "male"), stock = c(43,5,2), bill = c(43,112,33))
Tried this
library(gender)
test <- gender_df(demo_df, method = "demo",
name_col = "name", year_col = c("1900", "2000"))
but I receive this error
Error in gender_df(demo_df, method = "demo", name_col = "name") :
year_col %in% names(data) is not TRUE
Use gender() instead of gender_df().
Note that gender() automatically sorts output alphabetically by name, so it won't work to simply add the output as a new vector to demo_df, as the ordering may be wrong.
Two options to handle this:
1. Sort demo_df alphabetically by name before you call gender().
library(dplyr)
demo_df %>%
arrange(names) %>%
mutate(gender = gender::gender(demo_df$names)$gender)
2. Use a join method, like dplyr::inner_join, to merge demo_df and the resulting data frame output of the call to gender(), on the names column.
gender_df <- gender::gender(demo_df$names) %>%
select(names = name, gender)
inner_join(demo_df, gender_df, by = "names")
Output:
id names stock bill gender
1 1 Hillary 43 43 female
2 2 Madison 5 112 female
3 3 John 2 33 male
All of this is possible in base R, too, not including the gender imputation part. I just prefer dplyr.