I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.
Related
I have columns with data for age; e.g 2y:3m equals 2 years and 3 months and 5m = 5 months.
Age
2y:3m
5m
I wish to separate this column into two: "Years" and "Months", respectively.
I can do this by using the tidyr separate-function with ":" as separator.
However, my problem is that children where age is only reported in months, e.g 5m, the seperation puts 5m into the "Years"-column and NA into the "Months"-column, like this:
Years
Months
2y
3m
5m
NA
Does anyone know a handy way to solve this, preferably within the tidyverse packages.
Hope one of you can help!
This is what I tried (after streamlining the notation of years ("years", "year", "ys" and months ("month","months","mths") --> only "y" and "m":
childage1 <- Child_age %>%
separate(eage_clean,c("Years","Months"),sep=":")
One idea is maybe to first put a ":" in front of all values, that only contains months, but I had problems on how to do this...
If you want the years and months as numbers, you could extract with regex rather than separating:
df %>% mutate(years = as.numeric(sub("^.*(\\d+)y.*$", "\\1", Age)),
months = as.numeric(sub("^.*(\\d+)m.*$", "\\1", Age)))
#> Age years months
#> 1 2y:3m 2 3
#> 2 5m NA 5
It seems that it may be more useful in this data set to have 0 rather than NA, since "5m" probably represents 0 years, 5 months, and e.g. "2y" probably means 2 years, 0 months. If so, then you may prefer:
df %>% mutate(years = as.numeric(sub("^.*(\\d+)y.*$", "\\1", Age)),
months = as.numeric(sub("^.*(\\d+)m.*$", "\\1", Age))) %>%
mutate(across(c(years, months), ~ifelse(is.na(.x), 0, .x)))
#> Age years months
#> 1 2y:3m 2 3
#> 2 5m 0 5
With separate, if we need to keep it on the 'left' side, use fill = "left"
library(tidyr)
separate(Child_age, Age,c("Years","Months"),sep=":", fill = "left")
I am a beginner with R and I have a vector called "condition" and it has following values, I want to change the value of condition to number of hours left for all the records.
condition <- c("10 days left", "3 days left", "22 hours left", "54 hours left", "10 minutes", "9 minutes", "Listing Expired")
Just to verify whether my logical concept is good, I first tried my logic on one element i.e. element[2] of the vector. It worked and I got my desired result
b <- as.numeric(word(condition[2]),1)*24
b
# [1] 72
bc <- paste(b, gsub(".*days left.*", "hours left", condition[2]))
bc
# "72 hours left"
I then tried to put it in for loop to apply the same logic on every element of vector. However it just worked for 1st two values but rest iterations are completely wrong. I am getting warnings too. Please help what changes do I need to make to get my desired output?
Declaring a variable to get the 2nd element of my strings
var2nd <- word(condition,2)
var2nd
Now creating for loop to calculate hours left for every value in the vector condition
Hours_left <- for (var in condition) {
if (var2nd == "days") {
b = as.numeric(word(var,1))*24
c = paste(b, gsub(".*days left.*", "hours left", var))
print(c)
} else if ( var2nd == "minutes") {
d <- round(as.numeric(word(var, 1)/60, digits = 2))
e <- as.character(paste(d, gsub(".*minutes.*", "hours left", var)))
print(e)
} else if (var2nd == "Expired") {
print(var)
} else if (var2nd == "hours") {
print(var)
}
}
[1] "240 hours left"
[1] "72 hours left"
[1] "528 22 hours left"
[1] "1296 54 hours left"
[1] "240 10 minutes"
[1] "216 9 minutes"
[1] "NA Listing Expired"
Warning messages:
1: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
2: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
3: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
4: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
5: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
6: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
7: In if (var2nd == "days") { :
the condition has length > 1 and only the first element will be used
8: NAs introduced by coercion
I don't know word, but base R can do this relatively easily with a cautious regular expression:
dat <- strcapture("\\b(-?[0-9]+\\.?[0-9]*)\\b\\s*(day|hour|minute)",
condition, data.frame(num=numeric(0), unit=character(0)))
dat$orig <- condition
dat
# num unit orig
# 1 10 day 10 days left
# 2 3 day 3 days left
# 3 22 hour 22 hours left
# 4 54 hour 54 hours left
# 5 10 minute 10 minutes
# 6 9 minute 9 minutes
# 7 NA <NA> Listing Expired
(Notice that I don't try to save the pluralizing s on the strings.)
Explanation of the regex:
\\b is a word boundary, either the beginning of the string or some non-alpha character (break between letters and non-letters)
(...) is a group, there are two groups, and in strcapture these are assigned to the num= and unit= portions of the third argument
-? start with an optional literal minus-sign
[0-9]+ one or more digits; side note, this means that -.5 does not match, the regex can be modified to allow for this, but with a bit more work ... I'm also assuming decimals may be possible, so if it is always integers then this can be simplified a bit
\\.? literal decimal, optional
[0-9]* zero or more digits
\\s* zero or more "blank space" such as "space" and "tab"
(day|hour|minute) one of the three strings: "day", "hour", or "minute"; note that since I don't use \\b (word-boundary) here, there may or may not be an s after the word (or anything else after it, for that matter), and we're perfectly fine with that
From here, there are a few techniques for converting the units.
Lookup/dictionary; in R this is a named vector, and
conversions <- c(day = 24, hour = 1, minute = 1/60)
dat$num * conversions[ dat$unit ]
# day day hour hour minute minute <NA>
# 240.000 72.000 22.000 54.000 0.167 0.150 NA
Merge in the units, often more powerful with more things to join on; often easier to "maintain", since it can be in a CSV and maintained without editing R source.
conversions_df <- data.frame(unit = c("day", "hour", "minute"), mult = c(24, 1, 1/60))
dat <- merge(dat, conversions_df, by = "unit", all.x = TRUE)
dat$num * dat$mult
# [1] 240.000 72.000 22.000 54.000 0.167 0.150 NA
Simple switch, which can be more useful than the first (lookup), since it more easily affords default values. The disadvantage is that it is not vectorized, so we need to do it ourselves (I'll use sapply for that):
sapply(dat$unit, switch, day = 24, hour = 1, minute = 1/60, 999)
# day day hour hour minute minute <NA>
# 24.0000 24.0000 1.0000 1.0000 0.0167 0.0167 999.0000
(and then multiply that as in the others). 999 is here just to demonstrate a default value ... it is not useful for this application.
While some may default to trying nested ifelse, when I see 3 or things to check I tend towards either one of the techniques above or with dplyr::case_when:
library(dplyr)
dat %>%
mutate(
mult = case_when(
unit == "day" ~ 24,
unit == "hour" ~ 1,
unit == "minute" ~ 1/60,
TRUE ~ 9999
),
timeleft = num * mult
)
# num unit orig mult timeleft
# 1 10 day 10 days left 2.40e+01 240.000
# 2 3 day 3 days left 2.40e+01 72.000
# 3 22 hour 22 hours left 1.00e+00 22.000
# 4 54 hour 54 hours left 1.00e+00 54.000
# 5 10 minute 10 minutes 1.67e-02 0.167
# 6 9 minute 9 minutes 1.67e-02 0.150
# 7 NA <NA> Listing Expired 1.00e+04 NA
(Again, 9999 is used to demonstrate the potential for a default value ... I don't think it truly applies here.)
Using lubridate and stringr from the tidyverse.
library("tidyverse")
condition %>%
str_extract("^\\d+\\s\\w+\\b") %>%
time_length(unit="hours") %>%
round(2) %>%
str_c(" hours left") %>%
replace_na("listing expired")
[1] "240 hours left" "72 hours left" "22 hours left" "54 hours left"
[5] "0.17 hours left" "0.15 hours left" "listing expired"
I need to split one column into 2 where the the resulting columns contain the numeric or character portions of the original column.
df <- data.frame(myCol = c("24 hours", "36days", "1month", "2 months +"))
myCol
24 hours
36days
1month
2 months +
result should be:
alpha numeric
hours 24
days 36
month 1
months + 2
Note the inconsistent formatting of the original dataframe (sometimes with spaces, sometimes without).
tidy or base solutions are fine
Thanks
One solution could be:
library(tidyverse)
df %>%
separate(myCol,
into = c("numeric", "alpha"),
sep = "(?=[a-z +]+)(?<=[0-9])"
)
Which returns:
numeric alpha
1 24 hours
2 36 days
3 1 month
4 2 months +
You could do:
library(stringr)
df$numeric <- str_extract(df$myCol, "[0-9]+")
df$alpha <- str_remove(df$myCol, df$numeric)
Or with base functions
df$numeric <- regmatches(df$myCol, regexpr("[0-9]+", df$myCol))
df$alpha <- gsub("[0-9]+", "", df$myCol)
I found a pet adoption dataset that includes the age of a pet when adopted. However, the age variable contains strings like "3 months" or "4 years" or "3 weeks" all in the same column. The dataset is otherwise tidy. How can I convert these variables into year values?
I've tried something like this:
for(i in i:nrow(Pet_Train$AgeuponOutcome)){
if(grepl(i, "month") == TRUE)
Pet_Train$Age_in_Years[i] == "0"
}
But I have little experience with loops/if statements/this "grepl" function I just looked up. I do have experience with tidy functions like mutate() and filter() but I'm not sure how to apply those with these many of possible argument combinations.
Since there are 27,000 instances, so I'd rather not go through this by hand.
Edit:
I figured out how to use the grepl function to replace instances containing "month" with "less than a year." But is there a way to take the exact number of months and convert them into the year as a decimal?
The first two use only base of R and the third uses dplyr and tidyr.
1) Use read.table to split the input column into the numeric and units parts and then multiply the numeric part by the fraction of a year that the units part represents.
PT <- data.frame(Age = c("3 months", "4 years", "3 weeks")) # input
transform(cbind(PT, read.table(text = as.character(PT$Age))),
Years = V1 * (7 / 365.25 * (V2 == "weeks") + 1/12 * (V2 == "months") + (V2 == "years")))
giving:
Age V1 V2 Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
2) Alternately the last line could be written in terms of switch:
transform(cbind(PT, read.table(text = as.character(PT$Age), as.is = TRUE)),
Years = V1 * sapply(V2, switch, weeks = 7 / 365.25, months = 1 / 12, years = 1))
3) This uses dplyr and tidyr:
PT %>%
separate(Age, c("No", "Units")) %>%
mutate(No = as.numeric(No),
Years = No * case_when(Units == "weeks" ~ 7 / 365.25,
Units == "months" ~ 1 / 12,
Units == "years" ~ 1))
giving:
No Units Years
1 3 months 0.25000000
2 4 years 4.00000000
3 3 weeks 0.05749487
lubridate-based solution:
library(tidyverse)
library(lubridate)
dat <- data_frame(age_text = c("3 months", "4 years", "3 weeks"))
dat %>% mutate(age_in_years = duration(age_text) / dyears(1))
The answer of David Rubinger uses the lubridate package to coerce character strings to objects of class Duration.
The as.duration() function seems to recognize a variety of strings, e.g.,
age_text <- c("3 months", "4 years", "3 weeks", "52 weeks", "365 days 6 hours")
lubridate::as.duration(age_text)
[1] "7889400s (~13.04 weeks)" "126230400s (~4 years)" "1814400s (~3 weeks)"
[4] "31449600s (~52 weeks)" "31557600s (~1 years)"
However, the OP has requested to convert the strings into year values rather than seconds.
This can be achieved by using the as.numeric() function which takes a units parameter to specify the desired conversion:
as.numeric(lubridate::as.duration(age_text), units = "years")
[1] 0.25000000 4.00000000 0.05749487 0.99657769 1.00000000
Other units can be chosen as well:
as.numeric(lubridate::as.duration(age_text), units = "months")
[1] 3.0000000 48.0000000 0.6899384 11.9589322 12.0000000
as.numeric(lubridate::as.duration(age_text), units = "weeks")
[1] 13.04464 208.71429 3.00000 52.00000 52.17857
Just to expand on the comment I left, you could use ifelse. First though, here's a reproducible example of your data (always very useful for you to provide this when asking a question):
df <- data.frame("Duration" = c("3 months", "4 years", "3 weeks"))
You can then split out the units and values from this using string split:
df$Value <- as.numeric(vapply(strsplit(as.character(df$Duration), split = " "), `[`, 1, FUN.VALUE=character(1)))
df$Units <- vapply(strsplit(as.character(df$Duration), split = " "), `[`, 2, FUN.VALUE=character(1))
Finally, use nested ifelse arguments which tell R what to do if data in a column matches a condition, and what to do if not - so I have this saying that, if the units is weeks, divide the amount by 52.18 (the number of weeks per year).
df$Years <- ifelse(df[,'Units']=="weeks", df[,'Value']/(365.25/7), ifelse(df[,'Units']=="months", df[,'Value']/12, df[,'Value']))
And the successful output:
> df
Duration Value Units Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
Note: It would be more appropriate to do this with "days" as your unit of time, which could be done if you had dates for the first and second event (birth and adoption dates of the animal). This is because years and months are variable length units - December is longer than February, 2016 was longer than 2015 and 2017.
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28