Change a column with month in number to the actual month name in full using tidyverse package. Please, bear in mind that even though this data has only four months here, my real dataset contains all actual month of the year.
I am new to tidyverse
mydata <- tibble(camp = c("Platinum 2018-03","Reboarding 2018","New Acct Auto Jul18", "Loan2019-4"),
Acct = c(1, 33, 6, 43),
Balance = c(222, 7744, 949, 123),
Month = c(1,4,6,8))
I expect the output to be
January, April, June, August etc. Thanks for your help.
R comes with a month.name vector which should be ok as long as you only need English names.
mydata %>% mutate(MonthName = month.name[Month])
giving:
# A tibble: 4 x 5
camp Acct Balance Month MonthName
<chr> <dbl> <dbl> <dbl> <chr>
1 Platinum 2018-03 1 222 1 January
2 Reboarding 2018 33 7744 4 April
3 New Acct Auto Jul18 6 949 6 June
4 Loan2019-4 43 123 8 August
Other Languages
If you need other languages use this code (or omit as.character to get ordered factor output):
library(lubridate)
Sys.setlocale(locale = "French")
mydata %>% mutate(MonthName = as.character(month(Month, label = TRUE, abbr = FALSE)))
giving:
# A tibble: 4 x 5
camp Acct Balance Month MonthName
<chr> <dbl> <dbl> <dbl> <chr>
1 Platinum 2018-03 1 222 1 janvier
2 Reboarding 2018 33 7744 4 avril
3 New Acct Auto Jul18 6 949 6 juin
4 Loan2019-4 43 123 8 août
A dplyr-lubridate solution:
mydata %>%
mutate(Month = lubridate::month(Month, label = TRUE, abbr = FALSE))
# A tibble: 4 x 4
camp Acct Balance Month
<chr> <dbl> <dbl> <ord>
1 Platinum 2018-03 1 222 January
2 Reboarding 2018 33 7744 April
3 New Acct Auto Jul18 6 949 June
4 Loan2019-4 43 123 August
Related
Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1
Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows
I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))
I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020
> str(pc)
'data.frame': 562 obs. of 9 variables:
$ id : int 1 2 3 4 5 10 12 17 19 22 ...
$ gender : chr "M" "F" "M" "M" ...
$ birth_year: int 1973 1974 1937 1943 1958 1958 1940 1973 1971 1950 ...
$ type : chr "spontaneous SAH" "traumatic SAH" "spontaneous SAH" "traumatic SAH" ...
$ admit_year: int 2011 2011 2016 2012 2018 2017 2010 2018 2016 2018 ...
$ admit_date: chr "2011-06-22" "2011-12-19" "2016-12-06" "2012-10-28" ...
$ admitage : int 38 37 79 69 60 59 70 45 45 68 ...
$ death_date: chr NA NA NA "2012-10-28" ...
$ death_year: int NA NA NA 2012 NA NA NA NA 2016 NA ...
Hello. I have a list that looks like this. The column "id" indicates patients IDs. But there are rows with the same ID because some patients got admitted to the hospital several times. How do I delete the duplicates and leave one row per ID?
I tried this
c <- unique(pc$id)
to extract the "id" numbers, but I don't know what to do next.
I'm a beginner, so I would appreciate it if you could explain it to me with easy codes!
EDIT: I want to make one list containing the ones with the initial admitted dates of the patients, and another list containing the ones with the final admitted dates?
How can I do that? This list is in ID order, but if one patient got admitted multiple times, the date is not necessarily in chronological order. I'd like to know how I can achieve that just by using !duplicated.
Something like this should work : pc[!duplicated(pc$id),]. It will by default keep the first occurence.
library(tidyverse)
data <- tibble::tribble(
~id, ~gender, ~birth_year, ~admit_year,
1, "M", 1973, 2014,
2, "F", 1974, 2016,
3, "M", 1958, 2013,
2, "F", 1974, 2017,
1, "M", 1973, 2011,
1, "M", 1973, 2020,
1, "M", 1973, 2018,
2, "F", 1974, 2009,
)
data
# A tibble: 8 x 4
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2014
2 2 F 1974 2016
3 3 M 1958 2013
4 2 F 1974 2017
5 1 M 1973 2011
6 1 M 1973 2020
7 1 M 1973 2018
8 2 F 1974 2009
to keep the first and last row (first admit year and last admit year) by id
df <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first and last row (first admit year and last admit year) by id
slice(unique(c(1, n())))
df
# A tibble: 5 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 1 M 1973 2020
3 2 F 1974 2009
4 2 F 1974 2017
5 3 M 1958 2013
to keep the last row (last admit year) by id
df2 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the last row (last admit year) by id
slice(n())
df2
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2020
2 2 F 1974 2017
3 3 M 1958 2013
to keep the first row (first admit year) by id
df3 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first row (first admit year) by id
slice(1)
df3
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 2 F 1974 2009
3 3 M 1958 2013
I have a dataset with fields comprising of isic (International Standard Industrial Classification), date, and cash. I would like to first group it by sector then get the sum by fiscal year.
#Here's a look at the data(cpt1). All the dates follow the following format "%Y-%m-01"
Cash Date isic
1 373165 2014-06-01 K
2 373165 2014-12-01 K
3 373165 2017-09-01 K
4 NA <NA> K
5 4789 2015-05-01 K
6 982121 2013-07-01 K
.
.
.
#I was able to group to group them by sector and sum them
cpt_by_sector=cpt1 %>% mutate(sector=recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry',.default = 'Services',
.missing = 'Services')) %>%
group_by(sector) %>% summarise_if(is.numeric, sum, na.rm=T)
#here's the result
sector `Cash`
<fct> <dbl>
1 Agriculture 2094393819.
2 Industry 53699068183.
3 Services 223995196357.
#Below is what I would like to get. I would like to take into account the fiscal year i.e. from july to june.
Sector `2009/10` `2010/11` `2011/12` `2012/13` `2013/14` `2014/15` `2015/16` `2016/17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agriculture 2.02 3.62 3.65 6.26 7.04 8.36 11.7 11.6
2 Industry 87.8 117. 170. 163. 185. 211. 240. 252.
3 Services 271. 343. 479. 495. 584. 664. 738. 821.
4 Total 361. 464. 653. 664. 776. 883. 990. 1085.
PS:I changed the date column to date format
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
# FY is the year of the date, plus 1 if the month is July or later.
# FY_label makes the requested format, by combining the prior year,
# a slash, and digits 3&4 of the FY.
mutate(FY = year(Date) + if_else(month(Date) >= 7, 1, 0),
FY_label = paste0(FY-1, "/", substr(FY, 3, 4))) %>%
mutate(sector = recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry', 'K'='Mystery Sector')) %>%
filter(!is.na(FY)) %>% # Exclude rows with missing FY
group_by(FY_label, sector) %>%
summarise(Cash = sum(Cash)) %>%
spread(FY_label, Cash)
# A tibble: 1 x 4
sector `2013/14` `2014/15` `2017/18`
<fct> <int> <int> <int>
1 Mystery Sector 1355286 377954 373165
I have a data frame with a month name character column and I need to convert it into a month name date column. I try this code
data1$month <- as.Date(as.character(data1$month), "%B")
but all the values are turned into NA. what am I doing wrong?
Thanks.
some more info on the data
head(data1)
month year impressions clicks conversions cost revenue month_num
<chr> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 April 2018 18737558 107063 291 117505. 145745. 4
2 August 2018 23247068 126523 439 118631. 143217. 8
3 February 2018 20119465 117370 320 146965. 114594. 2
4 January 2018 23905450 148205 382 155756. 145513. 1
5 July 2018 11963956 92740 297 106249. 138354. 7
6 June 2018 6845841 52294 253 53205. 91740. 6
You can match the first three characters of the month name with the predefined vector month.abb:
data1$month_num <- match(substr(data1$month, 1, 3), month.abb)
Hope it helps.