Recoding race variables into multiracial category by group - r

I have been trying to learn the best way to recode variables in a column based on the condition of a name being associated with more than one race.
I have been working with a dataframe like this:
df <- data.frame('Name' = c("Jon", "Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Black", "White", "Asian", "Asian", "Black"))
What I am trying to do is recode any value that appears more than once in a group and transform it into a "multi-racial" category.
The end goal is to construct a dataframe like below:
df1 <- data.frame('Name' = c("Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Multiracial", "Asian", "Asian", "Black"))
The way I currently am doing it is by getting a list of people with multiple answers grouping race by name. Then, get a list of the names with more than one answer and for the names with more than one answer only, replace the race with "multi-racial". Code shown below:
df1 <- unique(df[, c('Name', 'Race')])
multi_answer <-
df1 %>%
dplyr::group_by(Name) %>%
dplyr::summarise(n_answers = n_distinct(Race))
multi_answer <- multi_answer[multi_answer$n_answers >1,]
df1[df1$Name %in% c(multi_answer$Name), 'Race'] <- 'multi-racial'
df1 <- unique(df1)

You can just group_by the Name and then summarize the data. You just use the condition of "if there is more than one entry" (i.e., n() > 1):
library(tidyverse)
df |>
group_by(Name)|>
summarise(Race = ifelse(n() > 1, "multi-racial", Race))
#> # A tibble: 4 x 2
#> Name Race
#> <chr> <chr>
#> 1 Bobby Asian
#> 2 Fred Black
#> 3 Jon multi-racial
#> 4 Sarah Asian

Related

Conditionally replace values with NA in R

I'm trying to conditionally replace values with NA in R.
Here's what I've tried so far using dplyr package.
Data
have <- data.frame(id = 1:3,
gender = c("Female", "I Do Not Wish to Disclose", "Male"))
First try
want = as.data.frame(have %>%
mutate(gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
)
This gives me an error.
Second try
want = as.data.frame(have %>%
mutate(gender = ifelse(gender == "I Do Not Wish to Disclose", NA, gender))
)
This runs without an error but turns Female into 1, Male into 3 and I Do Not Wish to Disclose into 2...
It is case where the column is factor. Convert to character and it should work
library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
The change in values in gender is when it gets coerced to its integer storage values
as.integer(factor(c("Male", "Female", "Male")))
I would use the very neat function na_if() from dplyr.
library(dplyr)
have <- data.frame(gender = c("F", "M", "NB", "I Do Not Wish to Disclose"))
have |> mutate(gender2 = na_if(gender, "I Do Not Wish to Disclose"))
Output:
#> gender gender2
#> 1 F F
#> 2 M M
#> 3 NB NB
#> 4 I Do Not Wish to Disclose <NA>
Created on 2022-04-19 by the reprex package (v2.0.1)

Select groups with row containing specific value (with dplyr and pipes)

I'm trying to select groups in a grouped df that contain a specific string on a specific row within each group.
Consider the following df:
df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4),
rep("id_3", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text",
"there",
"are",
"final",
"texts"))
I want to create a dataframe that contains just the groups that have the word "is" on the second row.
Here is some incorrect code:
desired_df <- df %>% group_by(id) %>%
filter(slice(select(., string), 2) %in% "is")
Here is the desired output:
desired_df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text"))
I've looked here but this doesn't solve my issue because this finds groups with any occurrence of the specified string.
I could also do some sort of separate code where I identify the ids and then use that to subset the original df, like so:
ids <- df %>% group_by(id) %>% slice(2) %>% filter(string %in% "is") %>% select(id)
desired_df <- df %>% filter(id %in% ids$id)
But I'm wondering if I can do something simpler within a single pipe series.
Help appreciated!
After grouping by 'id', subset the 'string' for the second element and apply %in% with "is" on the lhs of %in% to return a single TRUE per group
library(dplyr)
df %>%
group_by(id) %>%
filter('is' %in% string[2]) %>%
ungroup
-output
# A tibble: 8 x 2
# id string
# <chr> <chr>
#1 id_1 here
#2 id_1 is
#3 id_1 some
#4 id_1 text
#5 id_2 here
#6 id_2 is
#7 id_2 other
#8 id_2 text

Add gender column recognization

demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), stock = c(43,5,2), bill = c(43,112,33))
How is it possible to use in names column the gender identification?
Expected output:
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), gender = c("female", "female", "male"), stock = c(43,5,2), bill = c(43,112,33))
Tried this
library(gender)
test <- gender_df(demo_df, method = "demo",
name_col = "name", year_col = c("1900", "2000"))
but I receive this error
Error in gender_df(demo_df, method = "demo", name_col = "name") :
year_col %in% names(data) is not TRUE
Use gender() instead of gender_df().
Note that gender() automatically sorts output alphabetically by name, so it won't work to simply add the output as a new vector to demo_df, as the ordering may be wrong.
Two options to handle this:
1. Sort demo_df alphabetically by name before you call gender().
library(dplyr)
demo_df %>%
arrange(names) %>%
mutate(gender = gender::gender(demo_df$names)$gender)
2. Use a join method, like dplyr::inner_join, to merge demo_df and the resulting data frame output of the call to gender(), on the names column.
gender_df <- gender::gender(demo_df$names) %>%
select(names = name, gender)
inner_join(demo_df, gender_df, by = "names")
Output:
id names stock bill gender
1 1 Hillary 43 43 female
2 2 Madison 5 112 female
3 3 John 2 33 male
All of this is possible in base R, too, not including the gender imputation part. I just prefer dplyr.

Filter rows when values match in certain columns in R [duplicate]

This question already has answers here:
Filtering a dataframe showing only duplicates
(4 answers)
Closed 3 years ago.
I want to filter a data frame to include only rows that have matching values in certain columns.
My data:
df <- data.frame("Date" = ymd(c("2005-01-01", "2005-01-02", "2005-01-02", "2005-01-01", "2005-01-01")),
"Person" = c("John", "John", "John", "Maria", "Maria"),
"Job" = c("OR", "ER", "Heart", "Liver", "CV"),
"Type" = c("Day", "Night", "Night", "Day", "Night"))
I want to create a smaller data frame that includes rows that match on the date, the person, and the type.
The data frame I want to see is this:
df1 <- data.frame("Date" = ymd(c("2005-01-02", "2005-01-02")),
"Person" = c("John", "John"),
"Job" = c("ER", "Heart"),
"Type" = c("Night", "Night"))
We can use group_by and filter from dplyr:
library(dplyr)
df %>%
group_by(Date, Person, Type) %>%
filter(n() > 1)
Output:
# A tibble: 2 x 4
# Groups: Date, Person, Type [1]
Date Person Job Type
<date> <fct> <fct> <fct>
1 2005-01-02 John ER Night
2 2005-01-02 John Heart Night

Manipulating variables to produce a new dataset in R

I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.
Print the dataframe
print(df)
You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)

Resources