split a column into numeric and non-numeric components - r

I need to split one column into 2 where the the resulting columns contain the numeric or character portions of the original column.
df <- data.frame(myCol = c("24 hours", "36days", "1month", "2 months +"))
myCol
24 hours
36days
1month
2 months +
result should be:
alpha numeric
hours 24
days 36
month 1
months + 2
Note the inconsistent formatting of the original dataframe (sometimes with spaces, sometimes without).
tidy or base solutions are fine
Thanks

One solution could be:
library(tidyverse)
df %>%
separate(myCol,
into = c("numeric", "alpha"),
sep = "(?=[a-z +]+)(?<=[0-9])"
)
Which returns:
numeric alpha
1 24 hours
2 36 days
3 1 month
4 2 months +

You could do:
library(stringr)
df$numeric <- str_extract(df$myCol, "[0-9]+")
df$alpha <- str_remove(df$myCol, df$numeric)
Or with base functions
df$numeric <- regmatches(df$myCol, regexpr("[0-9]+", df$myCol))
df$alpha <- gsub("[0-9]+", "", df$myCol)

Related

Is there a tidyverse-way to split one column of character strings into two columns, where sorting depends on specific characters?

I have columns with data for age; e.g 2y:3m equals 2 years and 3 months and 5m = 5 months.
Age
2y:3m
5m
I wish to separate this column into two: "Years" and "Months", respectively.
I can do this by using the tidyr separate-function with ":" as separator.
However, my problem is that children where age is only reported in months, e.g 5m, the seperation puts 5m into the "Years"-column and NA into the "Months"-column, like this:
Years
Months
2y
3m
5m
NA
Does anyone know a handy way to solve this, preferably within the tidyverse packages.
Hope one of you can help!
This is what I tried (after streamlining the notation of years ("years", "year", "ys" and months ("month","months","mths") --> only "y" and "m":
childage1 <- Child_age %>%
separate(eage_clean,c("Years","Months"),sep=":")
One idea is maybe to first put a ":" in front of all values, that only contains months, but I had problems on how to do this...
If you want the years and months as numbers, you could extract with regex rather than separating:
df %>% mutate(years = as.numeric(sub("^.*(\\d+)y.*$", "\\1", Age)),
months = as.numeric(sub("^.*(\\d+)m.*$", "\\1", Age)))
#> Age years months
#> 1 2y:3m 2 3
#> 2 5m NA 5
It seems that it may be more useful in this data set to have 0 rather than NA, since "5m" probably represents 0 years, 5 months, and e.g. "2y" probably means 2 years, 0 months. If so, then you may prefer:
df %>% mutate(years = as.numeric(sub("^.*(\\d+)y.*$", "\\1", Age)),
months = as.numeric(sub("^.*(\\d+)m.*$", "\\1", Age))) %>%
mutate(across(c(years, months), ~ifelse(is.na(.x), 0, .x)))
#> Age years months
#> 1 2y:3m 2 3
#> 2 5m 0 5
With separate, if we need to keep it on the 'left' side, use fill = "left"
library(tidyr)
separate(Child_age, Age,c("Years","Months"),sep=":", fill = "left")

Using the same variable within str_detect and mutate

I have a variable time_col which contains the word 'minutes' if it is in minutes (e.g. 20 minutes), and only contains a number if it in hours (e.g. 2). I want to remove the word 'minutes' and convert it into hours when the observation is in minutes.
df <- raw_df %>%
mutate(time_col = ifelse(str_detect(time_col, "minutes"), time_col/60, time_col))
However, this gives an error:
'Error: Problem with `mutate()` input `time_col`. x non-numeric
argument to binary operator.'
I don't have this issue when I use ifelse(str_detect(time_col, "minutes"), 1, 0) so I think this is because the str_detect replaces time_col before going over to the ifelse condition.
How do I fix this issue?
I've created a dummy dataframe to demonstrate.
Since your time_col is character, you'll need to first get rid of the string " minutes" (note the space before "minutes"), change it to numeric, then divide it by 60.
Input
library(tidyverse)
df <- data.frame(Dummy = letters[1:3],
time_col = c("2", "20 minutes", "30 minutes"))
df
Dummy time_col
1 a 2
2 b 20 minutes
3 c 30 minutes
Code and output
df %>% mutate(time_col = ifelse(
str_detect(time_col, "minutes"),
as.numeric(gsub(" minutes", "", time_col)) / 60,
time_col
))
Dummy time_col
1 a 2
2 b 0.333333333333333
3 c 0.5

How to tidy my weekyear variable in the dataset

I have a dataset with a weekyear variable.
For example:
Weekyear
12016
22016
32016
...
422016
432016
442016
As you might understand this creates some difficulties as approaching this variable as an integer does not allow me to sort it descending-wise.
Therefore, I want to change variable from 12016 to 201601 to allow desc ordering. This would have been easy if my values would have the same number of characters, they aren't (for example 12016 and 432016).
Does anyone know how to treat this variable? Thanks in advance!
Diederik
Your could use stringr::str_sub to get the format you want:
# Getting the year
years <- stringr::str_sub(text, -4)
# Getting the weeks
weeks <- stringr::str_sub(text, end = nchar(text) - 4)
weeks <- ifelse(nchar(weeks) == 1, paste0(0, weeks), weeks)
as.integer(paste0(years, weeks))
[1] 201601 201602 201603 201642 201643 201644
Data:
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
EDIT:
Or, you can use a combo of str_pad and str_sub:
library(stringr)
text_paded <- str_pad(text, 6, "left", 0)
as.integer(paste0(str_sub(text_paded, start = -4), str_sub(text_paded, end = 2)))
[1] 201601 201602 201603 201642 201643 201644
You can extract the year and week using modulo arithmetic and integer division.
x <- 432016
year <- x %% 10000
week <- x %/% 10000
week <- sprintf("%02d", week) # make sure single digits have leading zeros
new_x <- paste0(year, week)
new_x <- as.integer(new_x)
new_x
Here is a very short approach using regex. No packages needed.
To better understand it, I split it in 2 steps but you can nest the calls.
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
# first add a zero to weeks with one digit
text1 <- gsub("(\\b\\d{5}\\b)", "0\\1", text)
# then change position of first two and last four digits
gsub("([0-9]{2})([0-9]{4})", "\\2\\1", text1)

How can I merge three variables into one variable that represents the merged variables separated by a comma? [duplicate]

This question already has answers here:
Concatenate a vector of strings/character
(8 answers)
Closed 5 years ago.
I have three variables: Year, Month, and Day. How can I merge them into one variable ("Date") so that the variable is represented as such:
yyyy-mm-dd
Thanks in advance and best regards!
How do you merge three variables into one variable?
Consider two methods:
Old school
With dplyr, lubridate, and data frames
And consider the data types. You can have:
Number or character
Date or POSIXct final type
Old School Method
The old school method is straightforward. I assume you are using vectors or lists and don't know data frames yet. Let's take your data, force it to a standardized, unambiguous format, and concatenate the data.
> y <- 2012:2015
> y
[1] 2012 2013 2014 2015
> m <- 1:4
> m
[1] 1 2 3 4
> d <- 10:13
> d
[1] 10 11 12 13
Use as.numeric if you want to be safe and convert everything to the same format before concatenation. If you get any NA values you will need to handle them with the is.na function and provide a default value.
Use paste with the sep separator value set to your delimiter, in this case, the hyphen.
> paste(y,m,d, sep = '-')
[1] "2012-1-10" "2013-2-11" "2014-3-12" "2015-4-13"
Dataframe / Dplyr / Lubridate Way
> df <- data.frame(year = y, mon = m, day = d)
> df
year mon day
1 2012 1 10
2 2013 2 11
3 2014 3 12
4 2015 4 13
Below I do the following:
Take the df object
Create a new variable name Date
Concatenate the numeric variables y, m, and d with a - separator
Convert the string literal into a Date format with ymd()
> df %>%
mutate(Date = ymd(
paste(y,m,d, sep = '-')
)
)
year mon day Date
1 2012 1 10 2012-01-10
2 2013 2 11 2013-02-11
3 2014 3 12 2014-03-12
4 2015 4 13 2015-04-13
Below we create year-month-day character strings, yyyy-mm-dd character strings (similar except one digit month and day are zero padded out to 2 digits) and Date class. The last one prints out as yyyy-mm-dd and can be manipulated in ways that character strings can't, for example adding one to a Date class object gives the next day.
First we set up some sample input:
year <- c(2017, 2015, 2014)
month <- c(3, 1, 10)
day <- c(15, 9, 25)
convert to year-month-day character string This is not quite yyyy-mm-dd since 1 digit months and days are not zero padded to 2 digits:
paste(year, month, day, sep = "-")
## [1] "2017-3-15" "2015-1-9" "2014-10-25"
convert to Date class It prints on console as yyyy-mm-dd. Two alternatives:
as.Date(paste(year, month, day, sep = "-"))
## [1] "2017-03-15" "2015-01-09" "2014-10-25"
as.Date(ISOdate(year, month, day))
## [1] "2017-03-15" "2015-01-09" "2014-10-25"
convert to character string yyyy-mm-dd In this case 1 digit month and day are zero padded out to 2 characters. Two alternatives:
as.character(as.Date(paste(year, month, day, sep = "-")))
## [1] "2017-03-15" "2015-01-09" "2014-10-25"
sprintf("%d-%02d-%02d", year, month, day)
## [1] "2017-03-15" "2015-01-09" "2014-10-25"

Convert age entered as 'X Weeks, Y Days, Z hours' in R

I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.

Resources