How to convert gender in dataframes from integer to character using R - r

I have a column in my dataframe where gender is coded 1 and 0 for male and female respectively. It's not a replica, but looks something like this:
df <- read.csv("df.csv")
" Gender Age Width
1 0 35 1.4
2 0 30 1.4
3 1 32 1.3
4 1 31 1.5
5 0 36 1.4
6 1 39 1.7 "
I've managed to change the class type of it to factor and gave it labels:
df$Gender <- as.factor(df$Gender)
class(df$Gender)
df$Gender <- factor(df$Gender,
levels = c("1","0"),
labels = c("male", "female"))
However, when I try to print df$Gender, I get all "NA" as my output
UPDATE:
Thank you all for your help!
I realised that my code works when I run it the first time. It only becomes "NA" when I rerun the second chunk. Will this be a problem or can I just ignore it?

You can use
library(tidyverse)
df %>%
mutate(gender = factor(Gender, labels = c("male", "female")))
or simply
df$gender <- ifelse(df$Gender == 1,"male","female")
or
df %>%
mutate(gender = if_else(Gender == 1,"male","female"))
or
df %>%
mutate(gender = case_when(Gender == 1 ~ "male",
Gender == 0 ~ "female"))
Data
df = structure(list(Sn = 1:6, Gender = c(0L, 0L, 1L, 1L, 0L, 1L),
Age = c(35L, 30L, 32L, 31L, 36L, 39L), Width = c(1.4, 1.4,
1.3, 1.5, 1.4, 1.7), gender = c("female", "female", "male",
"male", "female", "male")), row.names = c(NA, -6L), class = "data.frame")

Everything what you have done seems correct, when I reconstruct your input
Gender <- c(1,0,1,0)
Age <- c(50,30,40,30)
df <- data.frame(Gender,Age)
df$Gender <- factor(df$Gender,
levels = c("1","0"),
labels = c("male", "female"))
print(df$Gender)
if you really want is as character you can then add:
df$Gender <- as.character(df$Gender)
But I think (as others already mentioned) its because of your input data, therefore try to add stringasfactors to your import command:
df <- read.csv("df.csv", stringsAsFactors = FALSE)

Related

Ifelse conditional on same strings in multiple columns

So I guess this is possible to achieve by just making a veeery long line code using mutate() and ifelse() but I want to know if there is a way of doing it without writing a tone of code.
I have data where the degree of each individual is written in a non-ordered fashion. The data looks like this:
id <- c(1, 2, 3, 4, 5, 6)
degree1 <- c("masters", "bachelors", "PhD", "bachelors", "bachelors", NA)
degree2 <- c("PhD", "masters", "bachelors", NA, NA, NA)
degree3 <- c("bachelors", NA, "masters", NA, "masters", NA)
Now I want to create a new column containing the string for the highest degree, like this
dat$highest_degree <- c("PhD", "masters", "PhD", "bachelors", "masters", NA)
How can I achieve this?
An option is to loop over the rows for the selected 'degree' column, convert to factor with levels specified in the order, drop the levels to remove the unused levels and select the first level
v1 <- c("PhD", "masters", "bachelors")
dat$highest_degree <- apply(dat[-1], 1, function(x)
levels(droplevels(factor(x, levels = v1)))[1])
dat$highest_degree
#[1] "PhD" "masters" "PhD" "bachelors" "masters" NA
Or using tidyverse, reshape into 'long' format, then slice the first row after arrangeing the long format column by matching with an ordered degree vector and grouping by 'id', then join with the original data
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = starts_with('degree'), values_to = 'highest_degree') %>%
select(-name) %>%
arrange(id, match(highest_degree, v1)) %>%
group_by(id) %>%
slice_head(n = 1) %>%
ungroup %>%
left_join(dat, .)
data
dat <- data.frame(id, degree1, degree2, degree3)
Here is a base R option using pmin + factor
lvs <- c("PhD", "masters", "bachelors")
dat$highest_degree <- lvs[
do.call(
pmin,
c(asplit(matrix(as.integer(factor(as.matrix(dat[-1]), levels = lvs)), nrow(dat)), 2),
na.rm = TRUE
)
)
]
which gives
> dat
id degree1 degree2 degree3 highest_degree
1 1 masters PhD bachelors PhD
2 2 bachelors masters <NA> masters
3 3 PhD bachelors masters PhD
4 4 bachelors <NA> <NA> bachelors
5 5 bachelors <NA> masters masters
6 6 <NA> <NA> <NA> <NA>
Data
> dput(dat)
structure(list(id = c(1, 2, 3, 4, 5, 6), degree1 = c("masters",
"bachelors", "PhD", "bachelors", "bachelors", NA), degree2 = c("PhD",
"masters", "bachelors", NA, NA, NA), degree3 = c("bachelors",
NA, "masters", NA, "masters", NA)), class = "data.frame", row.names = c(NA,
-6L))

R function that drops columns according column value and primary keys?

Currently I have a compiled data frame, so for the same item code there are fixed and changing variables. For example:
..primary key..b...... c.
1. 1234........apple ..pear
2. 1234........apple ..orange
3. 5678........berry .. lime
4. 5679........orange.apple
5. 5679........orange.apple
In this case, since column c has different variables for both line1 and line2 despite having the same primary key #1234, column c should be dropped.
Is there any way i can do this without hard coding the column names?
Using dplyr, we can find the columns that need to be dropped using summarize_all():
library(dplyr)
df <- tibble(key = c(1, 1, 2, 3, 3),
b = c("a", "a", "b", "c", "c"),
c = c("p", "o", "l", "a", "a"))
drop_cols <- df %>%
group_by(key) %>%
summarize_all(~ any(. != .[1])) %>%
select(-key) %>%
select_if(any) %>%
colnames()
df %>% select(- one_of(drop_cols))
One way with dplyr is for each primary key count number of distinct elements in each column. We then select columns where there is only one unique value for each primary key.
library(dplyr)
df %>%
select(df %>%
group_by(primary) %>%
mutate_at(vars(-group_cols()), n_distinct) %>%
select_if(~all(. == 1)) %>%
names)
# primary b
#1 1234 apple
#2 1234 apple
#3 5678 berry
#4 5679 orange
#5 5679 orange
data
df <- structure(list(primary = c(1234L, 1234L, 5678L, 5679L, 5679L),
b = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("apple",
"berry", "orange"), class = "factor"), c = structure(c(4L,
3L, 2L, 1L, 1L), .Label = c("apple", "lime", "orange", "pear"
), class = "factor")), class = "data.frame", row.names = c(NA,-5L))

Create a contingency table with 2 factors from messy data

I have the following data in messy format:
structure(list(com_level = c("B", "B", "B", "B", "A", "A"),
hf_com = c(1, 1, 1, 1, 1, 1),
sal_level = c("2", "3", "1", "2", "1", "4"),
exp_sal = c(NA, 1, 1, NA, 1, NA)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -6L))
Column com_level is the factor with 2 levels and column hf_com gives the frequency count for that level.
Column sal_level is the factor with 4 levels and column exp_sal gives the frequency count for that level.
I want to create a contingency table similar to this:
structure(list(`1` = c(1L, 2L),
`2` = c(0L, 1L),
`3` = c(0L, 2L),
`4` = c(1L, 0L)),
row.names = c("A", "B"), class = "data.frame")
I have code that works when I want to compare two columns with the same factor:
# 1 step to create table with frequency counts for exp_sal and curr_sal per category of level
cs_es_table <- df_not_na_num %>%
dplyr::count(sal_level, exp_sal, curr_sal) %>%
tidyr::spread(key = sal_level,value = n) %>% # this code spreads on just one key
select(curr_sal, exp_sal, 1, 2, 3, 4, 5, 6, 7, -8) %>% # reorder columns and omit Column 8 (no answer)
as.data.frame()
# step 2- convert cs_es_table to long format and summarise exp_sal and curr_sal frequencies
cs_es_table <- cs_es_table %>%
gather(key, value, -curr_sal,-exp_sal) %>% # crucial step to make data long
mutate(curr_val = ifelse(curr_sal == 1,value,NA),
exp_val = ifelse(exp_sal == 1,value,NA)) %>% #mutate actually cleans up the data and assigns a value to each new column for 'exp' and 'curr'
group_by(key) %>% #for your summary, because you want to sum up your previous rows which are now assigned a key in a new column
summarise_at( .vars = vars(curr_val, exp_val), .funs = sum, na.rm = TRUE)
This code produces this table but just spreads on one key in step 1:
structure(list(curr_val = c(533L, 448L, 237L, 101L, 56L), exp_val = c(179L,
577L, 725L, 401L, 216L)), row.names = c("< 1000 EUR", "1001-1500 EUR",
"2001-3000 EUR", "3001-4000 EUR", "4001-5000 EUR"), class = "data.frame")
Will I need to use pivot_wider as in this example?
Is it possible to use spread on multiple columns in tidyr similar to dcast?
or
tidyr::spread() with multiple keys and values
Any help would be appreciated to compare the two columns with different factors.

Showing an entire line of data from a factor within a column

I have data in the following format:
ID Species Sex
1 spA M
2 spB F
3 spA Sex Required
I would like to write a line of code that gives me the whole line of all rows that contain "Sex Required" in the Sex column
I have tried:
df[df$Sex == "Sex Required",]
But it is not working correctly. I know this is probably simple but I am new to R.
Try:
df<- structure(list(ID = 1:3, Species = structure(c(1L, 2L, 1L), .Label = c("spA",
"spB"), class = "factor"), Sex = structure(c(2L, 1L, 3L), .Label = c("F",
"M", " Sex Required"), class = "factor")), .Names = c("ID", "Species",
"Sex"), class = "data.frame", row.names = c(NA, -3L))
I assume that there are some spaces i.e. "Sex Required ", or " Sex Required"
df[df$Sex == "Sex Required",]
#[1] ID Species Sex
#<0 rows> (or 0-length row.names)
df[grep("Sex Required", df$Sex),]
# ID Species Sex
# 3 3 spA Sex Required
Or
library(stringr)
df$Sex <- str_trim(df$Sex)
df[df$Sex == "Sex Required",]
# ID Species Sex
#3 3 spA Sex Required

Passing current value of ddply split on to function

Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.

Resources