Create a mapping system from factor variable to numeric in R - r

I have an education column that consists of a number of strings. These are the levels of the factor variable.
[1] "NIU or no schooling" "NIU"
[3] "None or preschool" "Grades 1, 2, 3, or 4"
[5] "Grade 1" "Grade 2"
[7] "Grade 3" "Grade 4"
[9] "Grades 5 or 6" "Grade 5"
[11] "Grade 6" "Grades 7 or 8"
[13] "Grade 7" "Grade 8"
[15] "Grade 9" "Grade 10"
[17] "Grade 11" "Grade 12"
[19] "12th grade, no diploma" "12th grade, diploma unclear"
[21] "High school diploma or equivalent" "1 year of college"
[23] "Some college but no degree" "2 years of college"
[25] "Associate's degree, occupational/vocational program" "Associate's degree, academic program"
[27] "3 years of college" "4 years of college"
[29] "Bachelor's degree" "5+ years of college"
[31] "5 years of college" "6+ years of college"
[33] "Master's degree" "Professional school degree"
[35] "Doctorate degree" "Missing/Unknown"
I want to convert this factor variable into a numeric such that Grade 1 --> 1, Grade 2 --> 2, and in-between grades are simply the average between any listed grade.
I can do this manually, but it would require 35 individual commands. Is there an easier way in R to create a mapping system between the factor?

Related

R loop to iterate and find unique combination between each item

concept_id concept_name event
1: 443387 Malignant tumor of stomach comorb
2: 4193704 Type 2 diabetes mellitus without complication comorb
3: 4095320 Malignant tumor of body of stomach comorb
4: 201826 Type 2 diabetes mellitus comorb
5: 4174977 Retinopathy due to diabetes mellitus comorb
For the above data, I am trying to create a list of combinations for concept_ids. There are 5 concept ids so when we iterate each concept_id with another concept_id we get a list something like this.
nrow(comorb_event)
for (i in (1:nrow(comorb_event))) {
for (j in (1:nrow(comorb_event))){
print(paste(i,j))
}
}
[1] "1 1"
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 1"
[1] "2 2"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 1"
[1] "3 2"
[1] "3 3"
[1] "3 4"
[1] "3 5"
[1] "4 1"
[1] "4 2"
[1] "4 3"
[1] "4 4"
[1] "4 5"
[1] "5 1"
[1] "5 2"
[1] "5 3"
[1] "5 4"
[1] "5 5"
My output is not what I expect. Since item [1,1] are same items we can avoid that, and similarly item [2,1] is already covered by [1,2] we can remove that too. The expected list would be something like this after removing the redundant combinations:
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 4"
[1] "3 5"
[1] "4 5"
Sample data
structure(list(concept_id = c("443387", "4193704", "4095320",
"201826", "4174977"), concept_name = c("Malignant tumor of stomach",
"Type 2 diabetes mellitus without complication", "Malignant tumor of body of stomach",
"Type 2 diabetes mellitus", "Retinopathy due to diabetes mellitus"
), event = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("comorb",
"drug", "primary_dx"), class = "factor")), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x5642431689a0>)
We need combn
t(combn(seq_len(nrow(comorb_event)), 2))

Renaming labels of a factor in R

I have census data of Male and Female populations organizaed by age group:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
if (!file.exists("./datafiles/cc-est2018-alldata-54.csv"))
download.file(url, destfile = "./datafiles/cc-est2018-alldata-54.csv", mode = "wb")
popSample <- read.csv("./datafiles/cc-est2018-alldata-54.csv") %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
popSample$AGEGRP <- as.factor(popSample$AGEGRP)
I then plot the Male and Female population relationships, faceted by age group (1-18, which is currently treated as a int
g <- ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups", x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
g
Which results in this plot: https://share.getcloudapp.com/v1ur6O4e
The problem: I am trying to convert the column AGEGRP from ‘int’ to ‘factor’, and change the factors labels from “1”, “2”, “3”, … “18” to "AgeGroup1", "AgeGroup2", "AgeGroup3", … "AgeGroup18"
When I try this code, my AGEGRP column's observation values are all replaced with NAs:popSample$AGEGRP <- factor(popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
https://share.getcloudapp.com/qGuo1O4y
Thank you for your help,
popSample$AGEGRP <- factor( popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
Need to add all levels though.
Alternatively
levels(popSample$AGEGRP) <- c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
should work as well.
Read in the csv again:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
popSample <- read.csv(url) %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
If you just want to add a prefix "AgeGroup" to your facet labels, you do:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP,labeller=labeller(AGEGRP = function(i)paste0("AgeGroup",i))) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
If there is a need for new factors, then you need to refactor (like #Annet's answer below):
lvls = c("0 to 4", "5 to 9", "10 to 14", "15 to 19",
"20 to 24", "25 to 29", "30 to 34", "35 to 39",
"40 to 44", "45 to 49", "50 to 54", "55 to 59",
"60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
#because you have factorize it
# if you can read the csv again, skip the factorization
popSample$AGEGRP = factor(lvls[popSample$AGEGRP],levels=lvls)
Then plot:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
To change all the factor labels with one function, you can use forcats::fct_relabel (forcats ships as part of the tidyverse, which you've already got loaded). The changed factor labels will carry over to the plot facets and the order stays the same.
First few entries:
# before relabelling
popSample$AGEGRP[1:4]
#> [1] 1 2 3 4
#> Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# after relabelling
forcats::fct_relabel(popSample$AGEGRP, ~paste0("AgeGroup", .))[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Or with base R, reassign the levels:
levels(popSample$AGEGRP) <- paste0("AgeGroup", levels(popSample$AGEGRP))
popSample$AGEGRP[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18

Error: NAs introduced by coercion when converting character to numeric

I want to convert the column $Annual.income saved as character in my dataframe to numeric. The code I use gives NA values although the new class is numeric.
I have looked for answer on forums but none of the questions answer my problem:
I do not have NAs in the column Annual.income, there are only numbers. All the data is formated so as to have "." instead of "," for decimals .
Here is the code I use.
data$Annual.income <- as.numeric(as.character(data$Annual.income))
******************************UPDATE********************************************
Here is the dput of the column Annual.income.
dput(data$Annual.income)
c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
The following still gives me NAs.
data$Annual.income <- as.numeric(data$Annual.income))
I imported the data using the Import dataset command of the Environement and unchecked stringAsfactor, checked heading = YES. Seperator = Semicolon , decimal = Period.
Thanks
...
The white space causes the problem here, simply remove all white space characters with gsub(), e.g.
Annual.income <- c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
as.numeric(gsub("\\s", "", Annual.income))
#> [1] 34500 51400 43200 40100 36400 39100 41900 48700 45500 45500 49100
#> [12] 35100 34500 29200 32200 36300 35800 31500 33000 34600 32100 32000
#> [23] 31400 33200 42600 29200 34600 29200 34100 30600 34034 33600 31000
#> [34] 35500 30600 30600 30600 30800 34034 33200 32900
Created on 2019-05-17 by the reprex package (v0.2.1)

Changing levels in R

I have a field where the levels are broken down as below:
levels(demo$age)
"18 to 24 years old" "25 to 34 years old" "35 to 44 years old" "45 to 54 years old" "55 to 64 years old" "65 to 74 years old" "75 years old or older"
How can I change the levels to
"Total " "18 to 24 years old" "25 plus".
We create a vector of levels that needs to be changed
v1 <- c("25 to 34 years old", "35 to 44 years old", "45 to 54 years old",
"55 to 64 years old", "65 to 74 years old" , "75 years old or older")
then, assign those to new level
levels(demo$age)[levels(demo$age) %in% v1] <- "25 plus"
If we need a 'Total' level as well
levels(demo$age) <- c("Total", levels(demo$age))
levels(demo$age)
#[1] "Total" "18 to 24 years old" "25 plus"
data
set.seed(24)
demo <- data.frame(age = sample(c("18 to 24 years old", v1), 100, replace = TRUE))

Convert a date range to Date type in R

This vector of date ranges is included in a dataframe of mine with class 'character'. The formats vary depending on whether the date range crosses into a different month:
dput(pollingdata$dates)
c("Nov. 1-7", "Nov. 1-7", "Oct. 24-Nov. 6", "Oct. 4-Nov. 6",
"Oct. 30-Nov. 6", "Oct. 25-31", "Oct. 7-27", "Oct. 21-Nov. 3",
"Oct. 20-24", "Jul. 19", "Oct. 29-Nov. 4", "Oct. 28-Nov. 3",
"Oct. 27-Nov. 2", "Oct. 20-28", "Sep. 30-Oct. 20", "Oct. 15-19",
"Oct. 26-Nov. 1", "Oct. 25-31", "Oct. 24-30", "Oct. 18-26",
"Oct. 10-14", "Oct. 4-9", "Sep. 23-Oct. 6", "Sep. 16-29", "Sep. 2-22",
"Oct. 21-Nov. 2", "Oct. 17-25", "Sep. 30-Oct. 13", "Sep. 27-Oct. 3",
"Sep. 21-26", "Sep. 14-20", "Aug. 26-Sep. 15", "Sep. 7-13",
"Aug. 19-Sep. 8", "Aug. 31-Sep. 6", "Aug. 12-Sep. 1", "Aug. 9-Sep. 1",
"Aug. 24-30", "Aug. 5-25", "Aug. 17-23", "Jul. 29-Aug. 18",
"Aug. 10-16", "Jan. 12")
I would like to convert this vector into two separate columns in my dataframe, 1. startdate and 2. enddate, for the beginning and end of the range. Both columns should be saved as class 'Date', this will make it easier for me to use the data in my project. Does anyone know an easy way to do this manipulation? I have been struggling with it.
Thanks in advance,
We can split the vector by - into a list, replace the elements that have only numbers at the end by pasteing the month substring, append NA for those having less than 2 elements using (length<-) and convert to data.frame (with do.call(rbind.data.frame)
lst <- lapply(strsplit(v1, "-"), function(x) {
i1 <- grepl("^[0-9]+", x[length(x)])
if(i1) {
x[length(x)] <- paste(substr(x[1], 1, 4), x[length(x)])
x} else x})
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
colnames(d1) <- c("Start_Date", "End_Date")
As per the OP's post, we need to convert to Date class, but Date class follows the format of %Y-%m-%d. In the vector, there is no year, not sure we can paste the current year and convert to Date class. If that is permissible, then
d1[] <- lapply(d1, function(x) as.Date(paste(x, 2017), "%b. %d %Y"))
head(d1)
# Start_Date End_Date
#1 2017-11-01 2017-11-07
#2 2017-11-01 2017-11-07
#3 2017-10-24 2017-11-06
#4 2017-10-04 2017-11-06
#5 2017-10-30 2017-11-06
#6 2017-10-25 2017-10-31
You may use library stringr function "str_split_fixed" to split the fields and then process the data. Map the library stringr and process as below:
library(stringr)
dat <- data.frame(date=c("Nov. 1-7", "Nov. 1-7", "Oct. 24-Nov. 6", "Oct. 4-Nov. 6",
"Oct. 30-Nov. 6", "Oct. 25-31", "Oct. 7-27", "Oct. 21-Nov. 3",
"Oct. 20-24", "Jul. 19", "Oct. 29-Nov. 4", "Oct. 28-Nov. 3",
"Oct. 27-Nov. 2", "Oct. 20-28", "Sep. 30-Oct. 20", "Oct. 15-19",
"Oct. 26-Nov. 1", "Oct. 25-31", "Oct. 24-30", "Oct. 18-26",
"Oct. 10-14", "Oct. 4-9", "Sep. 23-Oct. 6", "Sep. 16-29", "Sep. 2-22",
"Oct. 21-Nov. 2", "Oct. 17-25", "Sep. 30-Oct. 13", "Sep. 27-Oct. 3",
"Sep. 21-26", "Sep. 14-20", "Aug. 26-Sep. 15", "Sep. 7-13",
"Aug. 19-Sep. 8", "Aug. 31-Sep. 6", "Aug. 12-Sep. 1", "Aug. 9-Sep. 1",
"Aug. 24-30", "Aug. 5-25", "Aug. 17-23", "Jul. 29-Aug. 18",
"Aug. 10-16", "Jan. 12"))
Output processing:
#spliting with space and dash
dt <- data.frame(str_split_fixed(dat$date, "[-]|\\s",4))
names(dt) <- c("stdt1","stdt2","endt1","endt2")
##Removing dot(.) and replacing with ""
dt1 <- data.frame(sapply(dt,function(x)gsub("[.]","",x)))
dt1$stdt <- as.Date(paste0(dt1$stdt2,dt1$stdt1,"2016"),format="%d%b%Y")
dt1$endt <- ifelse(dt1$endt2=="",paste0(dt1$endt1,dt1$stdt1,"2016"),
paste0(dt1$endt2,dt1$endt1,"2016"))
dt1$endt <-as.Date(ifelse(nchar(dt1$endt)==7,paste0(dt1$stdt2,dt1$endt),dt1$endt),"%d%b%Y")
Assumptions:
1) No year provided , hence I have taken year as 2016.
2) On 10th row and 43rd row, there is no info on end date "day",hence I have assumed the same day as start date.
Answer:
> dt1
stdt1 stdt2 endt1 endt2 stdt endt
1 Nov 1 7 2016-11-01 2016-11-07
2 Nov 1 7 2016-11-01 2016-11-07
3 Oct 24 Nov 6 2016-10-24 2016-11-06
4 Oct 4 Nov 6 2016-10-04 2016-11-06
5 Oct 30 Nov 6 2016-10-30 2016-11-06
6 Oct 25 31 2016-10-25 2016-10-31
7 Oct 7 27 2016-10-07 2016-10-27
8 Oct 21 Nov 3 2016-10-21 2016-11-03
9 Oct 20 24 2016-10-20 2016-10-24
10 Jul 19 2016-07-19 2016-07-19

Resources