Add gender column recognization - r

demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), stock = c(43,5,2), bill = c(43,112,33))
How is it possible to use in names column the gender identification?
Expected output:
demo_df <- data_frame(id = c(1,2,3), names = c("Hillary", "Madison", "John"), gender = c("female", "female", "male"), stock = c(43,5,2), bill = c(43,112,33))
Tried this
library(gender)
test <- gender_df(demo_df, method = "demo",
name_col = "name", year_col = c("1900", "2000"))
but I receive this error
Error in gender_df(demo_df, method = "demo", name_col = "name") :
year_col %in% names(data) is not TRUE

Use gender() instead of gender_df().
Note that gender() automatically sorts output alphabetically by name, so it won't work to simply add the output as a new vector to demo_df, as the ordering may be wrong.
Two options to handle this:
1. Sort demo_df alphabetically by name before you call gender().
library(dplyr)
demo_df %>%
arrange(names) %>%
mutate(gender = gender::gender(demo_df$names)$gender)
2. Use a join method, like dplyr::inner_join, to merge demo_df and the resulting data frame output of the call to gender(), on the names column.
gender_df <- gender::gender(demo_df$names) %>%
select(names = name, gender)
inner_join(demo_df, gender_df, by = "names")
Output:
id names stock bill gender
1 1 Hillary 43 43 female
2 2 Madison 5 112 female
3 3 John 2 33 male
All of this is possible in base R, too, not including the gender imputation part. I just prefer dplyr.

Related

Recoding race variables into multiracial category by group

I have been trying to learn the best way to recode variables in a column based on the condition of a name being associated with more than one race.
I have been working with a dataframe like this:
df <- data.frame('Name' = c("Jon", "Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Black", "White", "Asian", "Asian", "Black"))
What I am trying to do is recode any value that appears more than once in a group and transform it into a "multi-racial" category.
The end goal is to construct a dataframe like below:
df1 <- data.frame('Name' = c("Jon", "Bobby", "Sarah", "Fred"),
'Race' = c("Multiracial", "Asian", "Asian", "Black"))
The way I currently am doing it is by getting a list of people with multiple answers grouping race by name. Then, get a list of the names with more than one answer and for the names with more than one answer only, replace the race with "multi-racial". Code shown below:
df1 <- unique(df[, c('Name', 'Race')])
multi_answer <-
df1 %>%
dplyr::group_by(Name) %>%
dplyr::summarise(n_answers = n_distinct(Race))
multi_answer <- multi_answer[multi_answer$n_answers >1,]
df1[df1$Name %in% c(multi_answer$Name), 'Race'] <- 'multi-racial'
df1 <- unique(df1)
You can just group_by the Name and then summarize the data. You just use the condition of "if there is more than one entry" (i.e., n() > 1):
library(tidyverse)
df |>
group_by(Name)|>
summarise(Race = ifelse(n() > 1, "multi-racial", Race))
#> # A tibble: 4 x 2
#> Name Race
#> <chr> <chr>
#> 1 Bobby Asian
#> 2 Fred Black
#> 3 Jon multi-racial
#> 4 Sarah Asian

Conditionally replace values with NA in R

I'm trying to conditionally replace values with NA in R.
Here's what I've tried so far using dplyr package.
Data
have <- data.frame(id = 1:3,
gender = c("Female", "I Do Not Wish to Disclose", "Male"))
First try
want = as.data.frame(have %>%
mutate(gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
)
This gives me an error.
Second try
want = as.data.frame(have %>%
mutate(gender = ifelse(gender == "I Do Not Wish to Disclose", NA, gender))
)
This runs without an error but turns Female into 1, Male into 3 and I Do Not Wish to Disclose into 2...
It is case where the column is factor. Convert to character and it should work
library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
The change in values in gender is when it gets coerced to its integer storage values
as.integer(factor(c("Male", "Female", "Male")))
I would use the very neat function na_if() from dplyr.
library(dplyr)
have <- data.frame(gender = c("F", "M", "NB", "I Do Not Wish to Disclose"))
have |> mutate(gender2 = na_if(gender, "I Do Not Wish to Disclose"))
Output:
#> gender gender2
#> 1 F F
#> 2 M M
#> 3 NB NB
#> 4 I Do Not Wish to Disclose <NA>
Created on 2022-04-19 by the reprex package (v2.0.1)

Assign a conditional value to new created column

My Data frame looks like this
Now, I want to add a new column which assigns one (!) specific value to each country. That means, there is only one value for Australia, one for Canada etc. for every year.
It should look like this:
Year Country R Ineq Adv NEW_COL
2018 Australia R1 Ineq1 1 x_Australia
2019 Australia R2 Ineq2 1 x_Australia
1972 Canada R1 Ineq1 1 x_Canada
...
Is there a smart way to do this?
Appreciate any help!
you use merge.
x = data.frame(country = c("AUS","CAN","AUS","USA"),
val1 = c(1:4))
y = data.frame(country = c("AUS","CAN","USA"),
val2 = c("a","b","c"))
merge(x,y)
country val1 val2
1 AUS 1 a
2 AUS 3 a
3 CAN 2 b
4 USA 4 c
You just manually create the (probably significantly smaller!) reference table that then gets duplicated in the original table in the merge. As you can see, my 3 row table (with a,b,c) is correctly duplicated up to the original (4 row) table such that every AUS gets "a".
You may use mutate and case_when from the package dplyr:
library(dplyr)
data <- data.frame(country = rep(c("AUS", "CAN"), each = 2))
data <- mutate(data,
newcol = case_when(
country == "CAN" ~ 1,
country == "AUS" ~ 2))
print(data)
You can use mutate and group_indices:
library(dplyr)
Sample data:
sample.df <- data.frame(Year = sample(1971:2019, 10, replace = T),
Country = sample(c("AUS", "Can", "UK", "US"), 10, replace = T))
Create new variable called ID, and assign unique ID to each Country group:
sample.df <- sample.df %>%
mutate(ID = group_indices(., Country))
If you want it to appear as x_Country, you can use paste (as commented):
sample.df <- sample.df %>%
mutate(ID = paste(group_indices(., Country), Country, sep = "_"))

Replace all nicknames with full names based on a different dataframe in R

I have a dataframe with names that are a mixture of full names and nicknames. I want to replace all of the nicknames in that dataframe with full names from a different dataset.
temp <- data.frame("Id" = c(1,2,3,4,5), "name" = c("abe", "bob", "tim","timothy", "Joe"))
temp2 <-data.frame("name" = c("abraham", "robert", "timothy","joseph"),"nickname1"=c("abe", "rob", "tim","joe"),"nickname2"=c("", "bob", "","joey"))
If the name column in temp appears in either nickname1 or nickname2 in temp2, replace with the value in the name column of temp2.
so it should look like this at the end:
temp3<- data.frame("Id" = c(1,2,3,4,5), "name" = c("abraham", "robert", "timothy","timothy", "Joseph"))
As mentioned by #thelatemail, you can get the data in long format and then do a join. Also, you have data in upper as well as lower case, make it into one uniform case before doing the join. If the value is present in temp2, you can select that or else keep the temp value using coalesce.
library(dplyr)
temp2 %>%
tidyr::pivot_longer(cols = -name, names_to = 'nickname') %>%
filter(value != '') %>%
mutate(name = tolower(name)) %>%
right_join(temp %>% mutate(name = tolower(name)), by = c('value' = 'name')) %>%
mutate(name = coalesce(name, value)) %>%
select(Id, name)
# Id name
# <dbl> <chr>
#1 1 abraham
#2 2 robert
#3 3 timothy
#4 4 timothy
#5 5 joseph

Join based on substring

I have a dataset with emails like:
my_df <- data.frame(email = c("mirko#asdoi.com", "elsa#asodida.co.uk", "elsapina#asoqw.com"))
And I have an open source dataset like:
open_data <- data.frame(name = c("mirko", "elsa", "pina"), gender = c("male", "female", "male")
How can I perform a lookup of my_df with open_data to associate the gender to each email?
In the case of multiple join, I want it to create multiple records
The result should be:
result <- data.frame(email = c("mirko#asdoi.com", "elsa#asodida.co.uk", "elsapina#asoqw.com", "elsapina#asoqw.com"), gender = c("male", "female", "female", "male))
One option is to use the sqldf library and solve this via a database style join between the two data frames:
library(sqldf)
my_df$name <- sub("#.*$", "", my_df$email)
sql <- "select t1.email, t2.gender from my_df t1 inner join open_data t2 "
sql <- paste0(sql, "on t1.name like '%' || t2.name || '%'")
result <- sqldf(sql)
Perhaps something along these lines? Not sure how robust this will be for more complex cases though.
library(tidyverse)
open_data %>%
rowwise() %>%
mutate(email = list(grep(name, my_df$email))) %>%
unnest() %>%
mutate(email = my_df$email[email])
## A tibble: 4 x 3
# name gender email
# <fct> <fct> <fct>
#1 mirko male mirko#asdoi.com
#2 elsa female elsa#asodida.co.uk
#3 elsa female elsapina#asoqw.com
#4 pina male elsapina#asoqw.com
Explanation: We use grep to find matches of open_data$name in my_df$email; then unnest to expand multiple matches, and use row indices to extract email entries.

Resources