Related
I have survey data of minutes to complete a journey in a dataframe, of character type. Some entries are written as a range, e.g. '5-10'. I want to change these entries to the mean of the range.
My data looks like this.
[1] "30" "15"
[3] "30" "15 Minutes "
[5] "15" "20 mins "
[7] "30" "half an hour to 40 minutes"
[9] "30" "40"
[11] "20" "30"
[13] "15" "20"
[15] "40" "20"
[17] "40" "30"
[19] "15" "15"
[21] "20" "30mins"
[23] "20" "20"
[25] "15" "40"
[27] "15" "25"
[29] "30" "20"
[31] "Depends where you live" "30-45"
[33] "30 min " "20"
[35] "30" "20"
[37] "60 minutes" "30 mins"
[39] "15" "10"
[41] "20" "40"
[43] "60" "60"
[45] "30" "49"
[47] "50 minutes" "20 minutes"
[49] "90" "7-10 minutes "
[51] "15-20" "25 minutes"
[53] "25" "45"
[55] "60 minutes " "2-4 hours"
[57] "30" "30 min"
[59] "20" "30"
[61] "20" "25"
[63] "2-4hrs" "30"
[65] "45" "45"
[67] "75" "20"
[69] "60" "45mins"
[71] "60" "20"
I have tried the following code:
data <- data %>% mutate(
est_time = case_when(
grepl('-', est_time) ~ mean(as.numeric(unlist(str_split(est_time, '-'))))
))
data <- data %>% mutate(
est_time = ifelse(
grepl('-', est_time),
mean(as.numeric(unlist(str_split(est_time, '-')))),
est_time)
)
Each time, I recieve:
Warning message:
Problem while computing `est_time = case_when(...)`.
ℹ NAs introduced by coercion
I suspect this may be because the unlist function spreads the list data over multiple rows.
How can I resolve this and achieve my aim?
Instead of using case_when or ifelse, an option is to select only the rows having -, read with read.table, get the rowMeans and assign it back
i1 <- grepl('^(\\d+)-(\\d+)$', data$est_time)
data$est_time[i1] <- rowMeans(read.table(text = data$est_time[i1],
sep = '-', header = FALSE), na.rm = TRUE)
In case, we want to take the mean of all those entries having the - (i.e. including all those entries like 2-4 hours or 7-10 minutes in addition to 15-20)
library(stringr)
library(dplyr)
data %>%
mutate(est_time2 = str_replace_all(est_time, "(\\d+-\\d+)",
function(x) mean(scan(text = x, what = numeric(),
sep = '-', quiet = TRUE))))
-output
# A tibble: 9 × 2
est_time est_time2
<chr> <chr>
1 "15 Minutes" "15 Minutes"
2 "20 mins" "20 mins"
3 "40" "40"
4 "15" "15"
5 "Depends where you live" "Depends where you live"
6 "7-10 minutes " "8.5 minutes "
7 "15-20" "17.5"
8 "2-4 hours" "3 hours"
9 "30 min" "30 min"
data
data <- structure(list(est_time = c("15 Minutes", "20 mins", "40", "15",
"Depends where you live", "7-10 minutes ", "15-20", "2-4 hours",
"30 min")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L))
I have census data of Male and Female populations organizaed by age group:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
if (!file.exists("./datafiles/cc-est2018-alldata-54.csv"))
download.file(url, destfile = "./datafiles/cc-est2018-alldata-54.csv", mode = "wb")
popSample <- read.csv("./datafiles/cc-est2018-alldata-54.csv") %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
popSample$AGEGRP <- as.factor(popSample$AGEGRP)
I then plot the Male and Female population relationships, faceted by age group (1-18, which is currently treated as a int
g <- ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups", x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
g
Which results in this plot: https://share.getcloudapp.com/v1ur6O4e
The problem: I am trying to convert the column AGEGRP from ‘int’ to ‘factor’, and change the factors labels from “1”, “2”, “3”, … “18” to "AgeGroup1", "AgeGroup2", "AgeGroup3", … "AgeGroup18"
When I try this code, my AGEGRP column's observation values are all replaced with NAs:popSample$AGEGRP <- factor(popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
https://share.getcloudapp.com/qGuo1O4y
Thank you for your help,
popSample$AGEGRP <- factor( popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
Need to add all levels though.
Alternatively
levels(popSample$AGEGRP) <- c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
should work as well.
Read in the csv again:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
popSample <- read.csv(url) %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
If you just want to add a prefix "AgeGroup" to your facet labels, you do:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP,labeller=labeller(AGEGRP = function(i)paste0("AgeGroup",i))) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
If there is a need for new factors, then you need to refactor (like #Annet's answer below):
lvls = c("0 to 4", "5 to 9", "10 to 14", "15 to 19",
"20 to 24", "25 to 29", "30 to 34", "35 to 39",
"40 to 44", "45 to 49", "50 to 54", "55 to 59",
"60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
#because you have factorize it
# if you can read the csv again, skip the factorization
popSample$AGEGRP = factor(lvls[popSample$AGEGRP],levels=lvls)
Then plot:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
To change all the factor labels with one function, you can use forcats::fct_relabel (forcats ships as part of the tidyverse, which you've already got loaded). The changed factor labels will carry over to the plot facets and the order stays the same.
First few entries:
# before relabelling
popSample$AGEGRP[1:4]
#> [1] 1 2 3 4
#> Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# after relabelling
forcats::fct_relabel(popSample$AGEGRP, ~paste0("AgeGroup", .))[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Or with base R, reassign the levels:
levels(popSample$AGEGRP) <- paste0("AgeGroup", levels(popSample$AGEGRP))
popSample$AGEGRP[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
I have problems with the output after I bin the a numerical vector.
I am trying to bin the length of stay, which was calculated beforehand with difftime function. It does not make sense to provide the whole code since this is only the background. Yet, when I bin, I do not get the right answer.
Here is the length of stay assigned it with los.
dput(los)
c(61.0416666666667, 61.0416666666667, 61.0416666666667, 2, 2, 3, 3)
Here are my breaks. I used na.rm inside as tried several methods. I passed na.rm with TRUE, FALSE and took it out of my breaks.
breaks <- c(0, 0.8, 0.16,
1.0, 1.8, 1.16,
2.0, 2.8, 2.16,
3.0, 3.8, 3.16,
4.0, 4.8, 4.16,
5.0, 5.8, 5.16,
6.0, 6.8, 6.16,
7.0, 14.0, 21.0, 28.0, max(los)) #, , na.rm = FALSE
Nevertheless, the next code tried
dt_los$losbinned <- cut(dt_los$LOS,
breaks = breaks,
labels = c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"),
right = FALSE)#
with different parameters passed for the 'right' gives me this:
when right = FALSE I do not get LOS for 61.04 binned for the category ">28 d". BBut do get the right bins for the other ones 2.00 and 3.00.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(NA, NA, NA, 7L, 7L,
10L, 10L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
when I pass right = TRUE, the output for 61.04 is binning into ">28 d" which is the desired answer, yet, I do not get the right bins for 2.0 and 3.0, which are bbinned in 1 d 16hrs for 2.0 and 2 d 16 hrs for 3. And again, these shall be binned in 2, respectively 3.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(25L, 25L, 25L, 6L, 6L,
9L, 9L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
The actual and expected results should the the right bins assigned for my length of stay. For 61.04 -> ">28d", for 2 -> "2 d", for 3 -> "3 d".
If this can be done with tidyverse that would be amazing. But respecting the bins I have assigned. However, I am aware this isn't done yet. Therefore, okay with the corrected code I have came up with, but corrected.
The cut function's bins are exclusive to inclusive.
From the cut function's help: The factor level labels are constructed as "(b1, b2]", "(b2, b3]" etc. for right = TRUE and as "[b1, b2)"
In order to include the lowest value (or highest value in this case), the include.lowest=TRUE option in required. This will make the first bin exclusive to exclusive, "[b1, b2]".
Try:
labels<-c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d")
dt_los$losbinned <- cut(los, breaks=breaks, labels=labels, right=FALSE, include.lowest = TRUE)
This vector of date ranges is included in a dataframe of mine with class 'character'. The formats vary depending on whether the date range crosses into a different month:
dput(pollingdata$dates)
c("Nov. 1-7", "Nov. 1-7", "Oct. 24-Nov. 6", "Oct. 4-Nov. 6",
"Oct. 30-Nov. 6", "Oct. 25-31", "Oct. 7-27", "Oct. 21-Nov. 3",
"Oct. 20-24", "Jul. 19", "Oct. 29-Nov. 4", "Oct. 28-Nov. 3",
"Oct. 27-Nov. 2", "Oct. 20-28", "Sep. 30-Oct. 20", "Oct. 15-19",
"Oct. 26-Nov. 1", "Oct. 25-31", "Oct. 24-30", "Oct. 18-26",
"Oct. 10-14", "Oct. 4-9", "Sep. 23-Oct. 6", "Sep. 16-29", "Sep. 2-22",
"Oct. 21-Nov. 2", "Oct. 17-25", "Sep. 30-Oct. 13", "Sep. 27-Oct. 3",
"Sep. 21-26", "Sep. 14-20", "Aug. 26-Sep. 15", "Sep. 7-13",
"Aug. 19-Sep. 8", "Aug. 31-Sep. 6", "Aug. 12-Sep. 1", "Aug. 9-Sep. 1",
"Aug. 24-30", "Aug. 5-25", "Aug. 17-23", "Jul. 29-Aug. 18",
"Aug. 10-16", "Jan. 12")
I would like to convert this vector into two separate columns in my dataframe, 1. startdate and 2. enddate, for the beginning and end of the range. Both columns should be saved as class 'Date', this will make it easier for me to use the data in my project. Does anyone know an easy way to do this manipulation? I have been struggling with it.
Thanks in advance,
We can split the vector by - into a list, replace the elements that have only numbers at the end by pasteing the month substring, append NA for those having less than 2 elements using (length<-) and convert to data.frame (with do.call(rbind.data.frame)
lst <- lapply(strsplit(v1, "-"), function(x) {
i1 <- grepl("^[0-9]+", x[length(x)])
if(i1) {
x[length(x)] <- paste(substr(x[1], 1, 4), x[length(x)])
x} else x})
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
colnames(d1) <- c("Start_Date", "End_Date")
As per the OP's post, we need to convert to Date class, but Date class follows the format of %Y-%m-%d. In the vector, there is no year, not sure we can paste the current year and convert to Date class. If that is permissible, then
d1[] <- lapply(d1, function(x) as.Date(paste(x, 2017), "%b. %d %Y"))
head(d1)
# Start_Date End_Date
#1 2017-11-01 2017-11-07
#2 2017-11-01 2017-11-07
#3 2017-10-24 2017-11-06
#4 2017-10-04 2017-11-06
#5 2017-10-30 2017-11-06
#6 2017-10-25 2017-10-31
You may use library stringr function "str_split_fixed" to split the fields and then process the data. Map the library stringr and process as below:
library(stringr)
dat <- data.frame(date=c("Nov. 1-7", "Nov. 1-7", "Oct. 24-Nov. 6", "Oct. 4-Nov. 6",
"Oct. 30-Nov. 6", "Oct. 25-31", "Oct. 7-27", "Oct. 21-Nov. 3",
"Oct. 20-24", "Jul. 19", "Oct. 29-Nov. 4", "Oct. 28-Nov. 3",
"Oct. 27-Nov. 2", "Oct. 20-28", "Sep. 30-Oct. 20", "Oct. 15-19",
"Oct. 26-Nov. 1", "Oct. 25-31", "Oct. 24-30", "Oct. 18-26",
"Oct. 10-14", "Oct. 4-9", "Sep. 23-Oct. 6", "Sep. 16-29", "Sep. 2-22",
"Oct. 21-Nov. 2", "Oct. 17-25", "Sep. 30-Oct. 13", "Sep. 27-Oct. 3",
"Sep. 21-26", "Sep. 14-20", "Aug. 26-Sep. 15", "Sep. 7-13",
"Aug. 19-Sep. 8", "Aug. 31-Sep. 6", "Aug. 12-Sep. 1", "Aug. 9-Sep. 1",
"Aug. 24-30", "Aug. 5-25", "Aug. 17-23", "Jul. 29-Aug. 18",
"Aug. 10-16", "Jan. 12"))
Output processing:
#spliting with space and dash
dt <- data.frame(str_split_fixed(dat$date, "[-]|\\s",4))
names(dt) <- c("stdt1","stdt2","endt1","endt2")
##Removing dot(.) and replacing with ""
dt1 <- data.frame(sapply(dt,function(x)gsub("[.]","",x)))
dt1$stdt <- as.Date(paste0(dt1$stdt2,dt1$stdt1,"2016"),format="%d%b%Y")
dt1$endt <- ifelse(dt1$endt2=="",paste0(dt1$endt1,dt1$stdt1,"2016"),
paste0(dt1$endt2,dt1$endt1,"2016"))
dt1$endt <-as.Date(ifelse(nchar(dt1$endt)==7,paste0(dt1$stdt2,dt1$endt),dt1$endt),"%d%b%Y")
Assumptions:
1) No year provided , hence I have taken year as 2016.
2) On 10th row and 43rd row, there is no info on end date "day",hence I have assumed the same day as start date.
Answer:
> dt1
stdt1 stdt2 endt1 endt2 stdt endt
1 Nov 1 7 2016-11-01 2016-11-07
2 Nov 1 7 2016-11-01 2016-11-07
3 Oct 24 Nov 6 2016-10-24 2016-11-06
4 Oct 4 Nov 6 2016-10-04 2016-11-06
5 Oct 30 Nov 6 2016-10-30 2016-11-06
6 Oct 25 31 2016-10-25 2016-10-31
7 Oct 7 27 2016-10-07 2016-10-27
8 Oct 21 Nov 3 2016-10-21 2016-11-03
9 Oct 20 24 2016-10-20 2016-10-24
10 Jul 19 2016-07-19 2016-07-19
I have an education column that consists of a number of strings. These are the levels of the factor variable.
[1] "NIU or no schooling" "NIU"
[3] "None or preschool" "Grades 1, 2, 3, or 4"
[5] "Grade 1" "Grade 2"
[7] "Grade 3" "Grade 4"
[9] "Grades 5 or 6" "Grade 5"
[11] "Grade 6" "Grades 7 or 8"
[13] "Grade 7" "Grade 8"
[15] "Grade 9" "Grade 10"
[17] "Grade 11" "Grade 12"
[19] "12th grade, no diploma" "12th grade, diploma unclear"
[21] "High school diploma or equivalent" "1 year of college"
[23] "Some college but no degree" "2 years of college"
[25] "Associate's degree, occupational/vocational program" "Associate's degree, academic program"
[27] "3 years of college" "4 years of college"
[29] "Bachelor's degree" "5+ years of college"
[31] "5 years of college" "6+ years of college"
[33] "Master's degree" "Professional school degree"
[35] "Doctorate degree" "Missing/Unknown"
I want to convert this factor variable into a numeric such that Grade 1 --> 1, Grade 2 --> 2, and in-between grades are simply the average between any listed grade.
I can do this manually, but it would require 35 individual commands. Is there an easier way in R to create a mapping system between the factor?