Add a column in R with season time - r

I have a dataset like thatI want to add a column with season time like this:
Month
Year
Region
Season
January
2019
NY
Winter
February
2019
NY
Winter
March
2019
NY
Spring
September
2019
NY
Fall
How can I do a code in R that automatically add a column where all January, February and December are Winter, all March, April and May are Spring and so on.
Thanks a lot for helping
season <- c(data, Spring = "March", Spring = "April")

We can create a keyvalue dataset and do a join
library(dplyr)
keydat <- tibble(Month = month.name,
Season = rep(c("Winter", "Spring", "Summer", "Fall", "Winter"),
c(2, 3, 3, 3, 1)))
df1 <- left_join(df1, keydat)
-output
df1
Month Year Region Season
1 January 2019 NY Winter
2 February 2019 NY Winter
3 March 2019 NY Spring
4 September 2019 NY Fall
data
df1 <- structure(list(Month = c("January", "February", "March", "September"
), Year = c(2019L, 2019L, 2019L, 2019L), Region = c("NY", "NY",
"NY", "NY")), class = "data.frame", row.names = c(NA, -4L))

In base R you could do:
df1$Season <- c('Winter', 'Spring', 'Summer', 'Fall')[
1 + (match(df1$Month, month.name) %/% 3) %% 4]
Which results in:
df1
#> Month Year Region Season
#> 1 January 2019 NY Winter
#> 2 February 2019 NY Winter
#> 3 March 2019 NY Spring
#> 4 September 2019 NY Fall
(Using akrun's reproducible data)

Related

Creating a new column using scores from past years (which is in the same dataframe)

I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14

Identify data on the last available date in r

> dput(df1)
structure(list(X = c("cc_China", "bb_China", "dd_China", "cc_Egypt",
"bb_Egypt", "dd_Egypt"), Country = c("China", "China", "China",
"Egypt", "Egypt", "Egypt"), May = c(2, 3, 8, 2, 4, 1), Jun = c(2,
2, 5, 5, 5, 5), Jul = c(3, NA, NA, 3, 2, NA), Aug = c(4, 6, 3,
2, 3, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I have a dataset as such, I have extract the country from X column into the Country column. For each country, I wish to get the latest month and their value, where all 3 different row (cc,bb and dd) are not NA. For China, the latest is Aug where, all cc,bb and dd have values. For Egypt, the latest month would be Jun, where it was the latest month where all 3 datas are available. Thanks.
df1>
X Country May Jun Jul Aug
cc_China China 2 2 3 4
bb_China China 3 2 NA 6
dd_China China 8 5 NA 3
cc_Egypt Egypt 2 5 3 2
bb_Egypt Egypt 4 5 2 3
dd_Egypt Egypt 1 5 NA NA
I wish to get this
Month X Value
Aug cc_China 4
Aug bb_China 6
Aug dd_China 3
Jun cc_Egypt 5
Jun bb_Egypt 5
Jun dd_Egypt 5
Get the data in long format and for each Country keep only those rows which have all non-NA values in a month. For each Country you can then keep only the max month.
Since we cannot compare character month names directly, I have converted them to numbers using inbuilt vector month.abb.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -c(X, Country), names_to = 'Month') %>%
mutate(month_num = match(Month, month.abb)) %>%
group_by(Country, Month) %>%
filter(all(!is.na(value))) %>%
group_by(Country) %>%
filter(month_num == max(month_num)) %>%
ungroup %>% select(-month_num, -Country)
# X Month value
# <chr> <chr> <dbl>
#1 cc_China Aug 4
#2 bb_China Aug 6
#3 dd_China Aug 3
#4 cc_Egypt Jun 5
#5 bb_Egypt Jun 5
#6 dd_Egypt Jun 5

Applying a function iteratively in a grouped dplyr dataframe to create a column in R

Suppose I'm given the following input dataframe:
ID Date
1 20th May, 2020
1 21st May, 2020
1 28th May, 2020
1 29th May, 2020
2 20th May, 2020
2 1st June, 2020
I want to generate the following dataframe:
ID Date Delta
1 20th May, 2020 0
1 21st May, 2020 1
1 28th May, 2020 7
1 29th May, 2020 1
2 20th May, 2020 0
2 1st June, 2020 12
Where the idea is, first I group by id. Then within my current id. I iterate over the days and subtract the current date with the previous date with the exception of the first date which is just itself.
I have been using dplyr but I am uncertain on how to achieve this for groups and how to do this iteratively
My goal is to filter the deltas and retain 0 and anything larger than 7 but it must follow the 'preceeding date' logic within a specific id.
library(dplyr)
dat %>%
mutate(Date = as.Date(gsub("[a-z]{2} ", " ", Date), format = "%d %b, %Y")) %>%
group_by(ID) %>%
mutate(Delta = c(0, diff(Date))) %>%
ungroup()
# # A tibble: 6 x 3
# ID Date Delta
# <dbl> <date> <dbl>
# 1 1 2020-05-20 0
# 2 1 2020-05-21 1
# 3 1 2020-05-28 7
# 4 1 2020-05-29 1
# 5 2 2020-05-20 0
# 6 2 2020-06-01 12
Steps:
remove the ordinal from numbers, so that we can
convert them to proper Date-class objects, then
diff them within ID groups.
Data
dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Date = c(" 20th May, 2020", " 21st May, 2020", " 28th May, 2020", " 29th May, 2020", " 20th May, 2020", " 1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
Similar logic as #r2evans but with different functions.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Delta = as.integer(Date - lag(Date, default = first(Date)))) %>%
ungroup
# ID Date Delta
# <int> <date> <int>
#1 1 2020-05-20 0
#2 1 2020-05-21 1
#3 1 2020-05-28 7
#4 1 2020-05-29 1
#5 2 2020-05-20 0
#6 2 2020-06-01 12
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L), Date = c("20th May, 2020",
"21st May, 2020", "28th May, 2020", "29th May, 2020", "20th May, 2020",
"1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))

R test if value is lowest from group, add 'yes'/'no' in new column if value is lowest from group

I'm relatively new to R and running into a problem I can't seem to solve. My apologies if this question has been asked before, but answers related to 'finding lowest' I'm running into here seem to focus on extracting the lowest value, I haven't found much about using it as a condition to add new values to a column.
A simplified example of what I'm trying to achieve is below. I have a list of building names and the years they have been in use, and I want to add to the column first_year "yes" and "no" depending on if the year the building is in use is the first year or not.
building_name year_inuse first_year
office 2020 yes
office 2021 no
office 2022 no
office 2023 no
house 2020 yes
house 2021 no
house 2022 no
house 2023 no
retail 2020 yes
retail 2021 no
retail 2022 no
retail 2023 no
I grouped the data by the building names, and now I'm thinking about doing something like:
data_new <- data %>% mutate(first_year = if_else(...., "yes", "no"))
so add a condition in the if_else that tests if the year is the lowest from the group, and if so add a yes, otherwise add a no. However, I can't seem to figure out how to do this and if this is even the best approach.
Help is much appreciated.
Once you've grouped, you can get the min value for the group, and use that in your comparison, like this:
library(dplyr)
data <- tibble::tribble(
~building_name, ~year_inuse,
"office", 2020,
"office", 2021,
"office", 2022,
"office", 2023,
"house", 2020,
"house", 2021,
"house", 2022,
"house", 2023,
"retail", 2020,
"retail", 2021,
"retail", 2022,
"retail", 2023
)
data %>%
group_by(building_name) %>%
mutate(first_year = if_else(year_inuse == min(year_inuse), 'yes', 'no')) %>%
ungroup()
Which gives
# A tibble: 12 x 3
building_name year_inuse first_year
<chr> <dbl> <chr>
1 office 2020 yes
2 office 2021 no
3 office 2022 no
4 office 2023 no
5 house 2020 yes
6 house 2021 no
7 house 2022 no
8 house 2023 no
9 retail 2020 yes
10 retail 2021 no
11 retail 2022 no
12 retail 2023 no
If the 'year_inuse' is not ordered, use arrange before doing this i.e. arrange by 'building_name', 'year_inuse', create a logical vector with duplicated, convert it to numeric index (1 + ), then use that index to replace with a vector of values i.e. 'yes', 'no'
library(dplyr)
data_new <- data %>%
arrange(building_name, year_inuse) %>%
mutate(first_year = c("no", "yes")[1 + !duplicated(building_name)])
-ouptut
# building_name year_inuse first_year
#1 house 2020 yes
#2 house 2021 no
#3 house 2022 no
#4 house 2023 no
#5 office 2020 yes
#6 office 2021 no
#7 office 2022 no
#8 office 2023 no
#9 retail 2020 yes
#10 retail 2021 no
#11 retail 2022 no
#12 retail 2023 no
data
data <- structure(list(building_name = c("office", "office", "office",
"office", "house", "house", "house", "house", "retail", "retail",
"retail", "retail"), year_inuse = c(2020L, 2021L, 2022L, 2023L,
2020L, 2021L, 2022L, 2023L, 2020L, 2021L, 2022L, 2023L)),
row.names = c(NA,
-12L), class = "data.frame")

In old data frame, order by two columns and store first of each row into new data frame

I have a data frame that contains 3 columns and I'd like use the columns date and location to obtain the most recent observation of each location and store it into a new data frame.
> old.data
date location amount
2014 NY 1
2015 NJ 2
2016 NY 3
2015 NM 4
2013 NY 5
2014 NJ 6
2016 NM 7
2016 NJ 8
2015 NY 9
> new.data
date location amount
2016 NJ 8
2016 NM 7
2016 NY 3
Using dplyr:
library(dplyr)
new.data <- old.data %>% arrange(desc(date), location) %>% group_by(location) %>% slice(1)
new.data
Source: local data frame [3 x 2]
Groups: location [3]
date location
<int> <fctr>
1 2016 NJ
2 2016 NM
3 2016 NY
Using data.table:
library(data.table)
# Code updated by Arun
setDT(old.data)[order(-date, location), .(date = date[1L]), by = location]
location date
1: NJ 2016
2: NM 2016
3: NY 2016
Data
old.data <- structure(list(date = c(2014L, 2015L, 2016L, 2015L, 2013L, 2014L,
2016L, 2016L, 2015L), location = structure(c(3L, 1L, 3L, 2L,
3L, 1L, 2L, 1L, 3L), .Label = c("NJ", "NM", "NY"), class = "factor")), .Names = c("date",
"location"), class = "data.frame", row.names = c(NA, -9L))
Update (as OP changed the original dataframe)
The dplyr solution is still valid.
For data.table, this is the only way I could think of:
setDT(old.data)[order(-date, location), colnames(old.data), with = F][date == max(date)]
date location amount
1: 2016 NJ 8
2: 2016 NM 7
3: 2016 NY 3
Using .SD and .SDcols as suggested by Arun
# adding more data
old.data$amount <- 1:9
old.data$a <- 10:18
# Retain all columns
keep_cols <- colnames(old.data)[-2] # Remove the column which is mentioned in by
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = keep_cols]
# or assigning colnames to .SDcols directly:
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = (colnames(old.data)[-2])]
location date amount a
1: NJ 2016 8 17
2: NM 2016 7 16
3: NY 2016 3 12
What about this:
library(dplyr)
date <- c(2014, 2015, 2016, 2015, 2013, 2014, 2016, 2016, 2015)
location <- c("NY", "NJ", "NY", "NM", "NY", "NJ", "NM", "NJ", "NY")
old.data <- data.frame(date, location)
new.data <- group_by(old.data, location)
new.data <- summarise(new.data, year = max(date))
Using the data.table package:
library(data.table)
setDT(dat)[order(-date), .SD[1L], by = location]
# location date
# 1: NY 2016
# 2: NM 2016
# 3: NJ 2016

Resources