convert a month character value into a date month value - r

I have a data frame with a month name character column and I need to convert it into a month name date column. I try this code
data1$month <- as.Date(as.character(data1$month), "%B")
but all the values are turned into NA. what am I doing wrong?
Thanks.
some more info on the data
head(data1)
month year impressions clicks conversions cost revenue month_num
<chr> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 April 2018 18737558 107063 291 117505. 145745. 4
2 August 2018 23247068 126523 439 118631. 143217. 8
3 February 2018 20119465 117370 320 146965. 114594. 2
4 January 2018 23905450 148205 382 155756. 145513. 1
5 July 2018 11963956 92740 297 106249. 138354. 7
6 June 2018 6845841 52294 253 53205. 91740. 6

You can match the first three characters of the month name with the predefined vector month.abb:
data1$month_num <- match(substr(data1$month, 1, 3), month.abb)
Hope it helps.

Related

Creating subset of dataset based on multiple condition in r

I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))
I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020

Better ways to combine Year and Month into Date object using mapply and lubridate

(I actually came up with a solution but that didn't satisfy my desire for simplicity and intuitiveness, therefore here I state my question and solution while waiting for a nice and neat solution.)
I have a data with one column being Year and the other being Month, while the month is in the format of string:
Country Month Year Type
<fct> <chr> <dbl> <fct>
1 Argentina June 1975 Currency
2 Argentina February 1981 Currency
3 Argentina July 1982 Currency
I am trying to combine the Month and Year column to a single column Date, which is in the format of date.
First Try
My first try was to use mapply, with the help of lubridate and a little function of my that transforms month from string to int.
months = c("January", "February", "March", "April", 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')
month_num = c(1:12)
names(month_num) = months
crisis$Date = mapply(function(y, m){
m = month_num[m]
d = make_date(y,m)
return(d)
},crisis$Year, crisis$Month)
However this didn't turn out to be what I want:
Country Month Year Type Date
<fct> <chr> <dbl> <fct> <list>
1 Argentina June 1975 Currency <date [1]>
2 Argentina February 1981 Currency <date [1]>
3 Argentina July 1982 Currency <date [1]>
4 Argentina September 1986 Currency <date [1]>
, as the Date column is list format.
Some Googling
With some help from this post and some manipulation on unlisting it and turning it back to date object, I managed to get the result I want:
crisis$Date = as_date(unlist(mapply(function(y, m){
m = month_num[m]
d = make_date(y,m)
return(d)
},crisis$Year, crisis$Month, SIMPLIFY = FALSE)))
The result is
Country Month Year Type Date
<fct> <chr> <dbl> <fct> <date>
1 Argentina June 1975 Currency 1975-06-01
2 Argentina February 1981 Currency 1981-02-01
3 Argentina July 1982 Currency 1982-07-01
4 Argentina September 1986 Currency 1986-09-01
This is so far fine to deal with, but I believe there are better solutions.
You can convert month to a number, and then from there to a date:
df %>%
mutate(
Month = base::match(Month, base::month.name),
Date = as.Date(paste(Year, '-', Month, '-01', sep=''))
) %>%
select(-c(Month, Year))
# A tibble: 3 x 3
# Country Type Date
# <chr> <chr> <date>
# 1 Argentina Currency 1975-06-01
# 2 Argentina Currency 1981-02-01
# 3 Argentina Currency 1982-07-01
Does this help?
I provided the dataframe below:
library(tibble)
df <- tibble(
Country = 'Argentina',
Month = c('June', 'February', 'July'),
Year = c(1975, 1981, 1982),
Type = 'Currency'
)
df$Date <- lubridate::myd(paste(df$Month, df$Year, "1"))
So after the help from #Gram and #det, I came up with my solution.
I am a new learner in R so I didn't realize some of the R-ish style of handling datas, therefore tried to make every thing done in one single line of code. Thanks to some tips from Gram's answer, I somehow learned to clear my code by adding auxilary columns instead (which is similar to excel).
Consider that there might be situations in the future where the correspondence might not simply be from 1:12 to months, and to make things more general for future utilization, I create a new data.frame just to store all the information about months:
month_ref = data.frame(num = 1:12, Month = c("January", "February", "March", "April", 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'))
num Month
1 1 January
2 2 February
3 3 March
4 4 April
Now the idea is to "combine" the two dataframes, matching the Month column to numerical numbers. This is exactly like the VLOOKUP function in excel, and with help from this post, I now have a dataframe with a column of numbers
crisis = crisis %>%
inner_join(month_ref, by=c("Month"))
Country Month Year Type num
<fct> <chr> <dbl> <fct> <int>
1 Argentina June 1975 Currency 6
2 Argentina February 1981 Currency 2
3 Argentina July 1982 Currency 7
4 Argentina September 1986 Currency 9
I can then handle my dataframe with a neat column of month in number, which is much more easier and readable than handling the parsing in a custom function in mutate().
crisis = crisis %>%
inner_join(month_ref, by="Month") %>%
mutate(
Date = lubridate::ymd(paste(Year, num, "01", sep="-"))
) %>%
select(-c(num, Month, Year))
Country Type Date
<fct> <fct> <date>
1 Argentina Currency 1975-06-01
2 Argentina Currency 1981-02-01
3 Argentina Currency 1982-07-01

In R, How do I extract certain rows from a list of data sets?

> str(pc)
'data.frame': 562 obs. of 9 variables:
$ id : int 1 2 3 4 5 10 12 17 19 22 ...
$ gender : chr "M" "F" "M" "M" ...
$ birth_year: int 1973 1974 1937 1943 1958 1958 1940 1973 1971 1950 ...
$ type : chr "spontaneous SAH" "traumatic SAH" "spontaneous SAH" "traumatic SAH" ...
$ admit_year: int 2011 2011 2016 2012 2018 2017 2010 2018 2016 2018 ...
$ admit_date: chr "2011-06-22" "2011-12-19" "2016-12-06" "2012-10-28" ...
$ admitage : int 38 37 79 69 60 59 70 45 45 68 ...
$ death_date: chr NA NA NA "2012-10-28" ...
$ death_year: int NA NA NA 2012 NA NA NA NA 2016 NA ...
Hello. I have a list that looks like this. The column "id" indicates patients IDs. But there are rows with the same ID because some patients got admitted to the hospital several times. How do I delete the duplicates and leave one row per ID?
I tried this
c <- unique(pc$id)
to extract the "id" numbers, but I don't know what to do next.
I'm a beginner, so I would appreciate it if you could explain it to me with easy codes!
EDIT: I want to make one list containing the ones with the initial admitted dates of the patients, and another list containing the ones with the final admitted dates?
How can I do that? This list is in ID order, but if one patient got admitted multiple times, the date is not necessarily in chronological order. I'd like to know how I can achieve that just by using !duplicated.
Something like this should work : pc[!duplicated(pc$id),]. It will by default keep the first occurence.
library(tidyverse)
data <- tibble::tribble(
~id, ~gender, ~birth_year, ~admit_year,
1, "M", 1973, 2014,
2, "F", 1974, 2016,
3, "M", 1958, 2013,
2, "F", 1974, 2017,
1, "M", 1973, 2011,
1, "M", 1973, 2020,
1, "M", 1973, 2018,
2, "F", 1974, 2009,
)
data
# A tibble: 8 x 4
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2014
2 2 F 1974 2016
3 3 M 1958 2013
4 2 F 1974 2017
5 1 M 1973 2011
6 1 M 1973 2020
7 1 M 1973 2018
8 2 F 1974 2009
to keep the first and last row (first admit year and last admit year) by id
df <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first and last row (first admit year and last admit year) by id
slice(unique(c(1, n())))
df
# A tibble: 5 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 1 M 1973 2020
3 2 F 1974 2009
4 2 F 1974 2017
5 3 M 1958 2013
to keep the last row (last admit year) by id
df2 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the last row (last admit year) by id
slice(n())
df2
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2020
2 2 F 1974 2017
3 3 M 1958 2013
to keep the first row (first admit year) by id
df3 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first row (first admit year) by id
slice(1)
df3
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 2 F 1974 2009
3 3 M 1958 2013

How to replace numeric month with a month's full name

Change a column with month in number to the actual month name in full using tidyverse package. Please, bear in mind that even though this data has only four months here, my real dataset contains all actual month of the year.
I am new to tidyverse
mydata <- tibble(camp = c("Platinum 2018-03","Reboarding 2018","New Acct Auto Jul18", "Loan2019-4"),
Acct = c(1, 33, 6, 43),
Balance = c(222, 7744, 949, 123),
Month = c(1,4,6,8))
I expect the output to be
January, April, June, August etc. Thanks for your help.
R comes with a month.name vector which should be ok as long as you only need English names.
mydata %>% mutate(MonthName = month.name[Month])
giving:
# A tibble: 4 x 5
camp Acct Balance Month MonthName
<chr> <dbl> <dbl> <dbl> <chr>
1 Platinum 2018-03 1 222 1 January
2 Reboarding 2018 33 7744 4 April
3 New Acct Auto Jul18 6 949 6 June
4 Loan2019-4 43 123 8 August
Other Languages
If you need other languages use this code (or omit as.character to get ordered factor output):
library(lubridate)
Sys.setlocale(locale = "French")
mydata %>% mutate(MonthName = as.character(month(Month, label = TRUE, abbr = FALSE)))
giving:
# A tibble: 4 x 5
camp Acct Balance Month MonthName
<chr> <dbl> <dbl> <dbl> <chr>
1 Platinum 2018-03 1 222 1 janvier
2 Reboarding 2018 33 7744 4 avril
3 New Acct Auto Jul18 6 949 6 juin
4 Loan2019-4 43 123 8 août
A dplyr-lubridate solution:
mydata %>%
mutate(Month = lubridate::month(Month, label = TRUE, abbr = FALSE))
# A tibble: 4 x 4
camp Acct Balance Month
<chr> <dbl> <dbl> <ord>
1 Platinum 2018-03 1 222 January
2 Reboarding 2018 33 7744 April
3 New Acct Auto Jul18 6 949 June
4 Loan2019-4 43 123 August

Grouping by sector then aggregating by fiscal year

I have a dataset with fields comprising of isic (International Standard Industrial Classification), date, and cash. I would like to first group it by sector then get the sum by fiscal year.
#Here's a look at the data(cpt1). All the dates follow the following format "%Y-%m-01"
Cash Date isic
1 373165 2014-06-01 K
2 373165 2014-12-01 K
3 373165 2017-09-01 K
4 NA <NA> K
5 4789 2015-05-01 K
6 982121 2013-07-01 K
.
.
.
#I was able to group to group them by sector and sum them
cpt_by_sector=cpt1 %>% mutate(sector=recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry',.default = 'Services',
.missing = 'Services')) %>%
group_by(sector) %>% summarise_if(is.numeric, sum, na.rm=T)
#here's the result
sector `Cash`
<fct> <dbl>
1 Agriculture 2094393819.
2 Industry 53699068183.
3 Services 223995196357.
#Below is what I would like to get. I would like to take into account the fiscal year i.e. from july to june.
Sector `2009/10` `2010/11` `2011/12` `2012/13` `2013/14` `2014/15` `2015/16` `2016/17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agriculture 2.02 3.62 3.65 6.26 7.04 8.36 11.7 11.6
2 Industry 87.8 117. 170. 163. 185. 211. 240. 252.
3 Services 271. 343. 479. 495. 584. 664. 738. 821.
4 Total 361. 464. 653. 664. 776. 883. 990. 1085.
PS:I changed the date column to date format
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
# FY is the year of the date, plus 1 if the month is July or later.
# FY_label makes the requested format, by combining the prior year,
# a slash, and digits 3&4 of the FY.
mutate(FY = year(Date) + if_else(month(Date) >= 7, 1, 0),
FY_label = paste0(FY-1, "/", substr(FY, 3, 4))) %>%
mutate(sector = recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry', 'K'='Mystery Sector')) %>%
filter(!is.na(FY)) %>% # Exclude rows with missing FY
group_by(FY_label, sector) %>%
summarise(Cash = sum(Cash)) %>%
spread(FY_label, Cash)
# A tibble: 1 x 4
sector `2013/14` `2014/15` `2017/18`
<fct> <int> <int> <int>
1 Mystery Sector 1355286 377954 373165

Resources