I am trying to write a function get_value <- function(gyear, gmonth) that selects a row from already existing dataframe or a value if the month is between two time points. The dataframe is e.g.
df <- read.table(header = TRUE, text = "
year month var1 var2
2022 1 123 987
2021 4 234 876
2021 1 345 765
2020 7 456 654
2020 3 567 543
2020 1 678 432
2019 1 789 321")
For example in year 2021 the same row
year month var1 var2
2021 1 345 765
is valid for months 1,2,3 and then comes a change and the next row
year month var1 var2
2021 4 234 876
is valid for months 4,5,6,7,8,9,10,11,12.
If the year & month are already in the dataframe, then I can have the row like
get_value <- function(gyear, gmonth){
library(tidyverse)
temp <- df %>% filter(year == gyear & month == gmonth)
}
get_value(gyear = 2020, gmonth = 1)
but what I want is also to be able to have rows (months) that are between the months that are included in the dataframe. For example I would like to be able to call
get_value(gyear = 2021, gmonth = 5)
that returns row
year month var1 var2
2021 4 234 876
because in year 2021 the month is between 4-12.
Thanks in advance for your help!
You could first create a new column month2 to indicate the ending months.
library(dplyr)
df2 <- df %>%
group_by(year) %>%
arrange(month, .by_group = TRUE) %>%
mutate(month2 = lead(month-1, default = 12), .after = month) %>%
ungroup()
# # A tibble: 7 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2019 1 12 789 321
# 2 2020 1 2 678 432
# 3 2020 3 6 567 543
# 4 2020 7 12 456 654
# 5 2021 1 3 345 765
# 6 2021 4 12 234 876
# 7 2022 1 12 123 987
Then customize a filter function that takes a dataframe as input and extract those rows where gmonth >= month & gmonth <= month2.
get_value <- function(data, gyear, gmonth){
data %>%
filter(year == gyear & gmonth >= month & gmonth <= month2)
}
I don't encourage you to put the code where I manipulate df to get month2 into this function. If you do that, the data will be repeatedly manipulated whenever get_value() is run. It's out of efficiency.
Output
get_value(df2, gyear = 2020, gmonth = 1)
# # A tibble: 1 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2020 1 2 678 432
get_value(df2, gyear = 2021, gmonth = 5)
# # A tibble: 1 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2021 4 12 234 876
Related
I have a dataframe like so:
id year month val
1 2020 1 50
1 2020 7 80
1 2021 1 40
1 2021 7 70
.
.
Now, I want to index all the values using Jan 2020 as index year for each id. Essentially group by id, then divide val with val at Jan 2020 * 100. So the final dataframe would look something like this:
id year month val
1 2020 1 100
1 2020 7 160
1 2021 1 80
1 2021 7 140
.
.
This is what I tried till now:
df %>% group_by(id) %>% mutate(val = 100*val/[val at Jan 2020])
I can separately get val at Jan 2020 like so:
df %>% filter(year==2020, month==1) %>% select(val)
But it doesn't work together:
df %>% group_by(id) %>% mutate(val = 100*val/(df %>% filter(year==2020, month==1) %>% select(val)))
The above throws error
A dplyr approach
library(dplyr)
df %>%
group_by(id) %>%
mutate(val = val / val[year == 2020 & month == 1] * 100) %>%
ungroup()
# A tibble: 4 × 4
id year month val
<int> <int> <int> <dbl>
1 1 2020 1 100
2 1 2020 7 160
3 1 2021 1 80
4 1 2021 7 140
Base R
do.call(
rbind,
lapply(
split(df,df$id),
function(x){
cbind(
subset(x,select=-c(val)),
"val"=x$val/x$val[x$year==2020 & x$month==1]*100
)
}
)
)
id year month val
1.1 1 2020 1 100
1.2 1 2020 7 160
1.3 1 2021 1 80
1.4 1 2021 7 140
I have a question similar to Calculate proportion of positives values by group, but for grouping the average fraction by many columns. I'd like to get the proportion of non-zero values in "num" by "Year" and "season". Something that works for n# of columns, no matter where they are in the df in relation to each other.
My data:
> head(df)
# A tibble: 6 x 6
Year Month Day Station num season
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 2017 1 3 266 4 DRY
2 2018 1 3 270 2 DRY
3 2018 1 3 301 1 DRY
4 2018 1 4 314 0 DRY
5 2018 2 4 402 0 DRY
6 2018 1 4 618 0 WET
I thought something like this would work, but I get a warning message:
> aggregate(df$num>0~df[,c(1,6)],FUN=mean) # Average proportion of num > 0 per year & season
Error in model.frame.default(formula = env_subset$den > 0 ~ env_subset[, :
invalid type (list) for variable 'env_subset[, c(1, 6)]'
With dplyr, I think this is what you want:
library(dplyr)
df %>% group_by(Year, season) %>%
summarize(prop_gt_0 = mean(num > 0), .groups = "drop")
# # A tibble: 3 × 3
# Year season prop_gt_0
# <int> <chr> <dbl>
# 1 2017 DRY 1
# 2 2018 DRY 0.5
# 3 2018 WET 0
It's usually better to refer to columns by name rather than by number, so, as you say it works "no matter where they are in the df".
You can still use aggregate--I prefer the formula interface for working with column names:
aggregate(num ~ Year + season, data = df, FUN = \(x) mean(x > 0))
# Year season num
# 1 2017 DRY 1.0
# 2 2018 DRY 0.5
# 3 2018 WET 0.0
I have a dataset where I have at least three columns
year sex value
1 2019 M 10
2 2019 F 20
3 2020 M 50
4 2020 F 20
I would like to group by the first column, year, and then add another level to sex that corresponds the total value in column 3, that is, I would like something like this:
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
Any help is appreciated, especially in dplyr.
Here is just another way of doing this:
library(dplyr)
library(purrr)
df %>%
group_split(year) %>%
map_dfr(~ add_row(.x, year = first(.x$year), sex = "Total", value = sum(.x$value)))
# A tibble: 6 x 3
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
You can summarise the data for each year and bind it to the original dataset.
library(dplyr)
df %>%
group_by(year) %>%
summarise(sex = 'Total',
value = sum(value)) %>%
bind_rows(df) %>%
arrange(year, sex)
# year sex value
# <int> <chr> <dbl>
#1 2019 F 20
#2 2019 M 10
#3 2019 Total 30
#4 2020 F 20
#5 2020 M 50
#6 2020 Total 70
Or in base R -
aggregate(value~year, df, sum) |>
transform(sex = 'Total') |>
rbind(df)
data
df <- data.frame(year = rep(2019:2020, each = 2),
sex = c('M', 'F'), value = c(10, 20, 50, 20))
I have a large dataset with thousands of dates in the ymd format. I want to convert this column so that way there are three individual columns by year, month, and day. There are literally thousands of dates so I am trying to do this with a single code for the entire dataset.
You can use the year(), month(), and day() extractors in lubridate for this. Here's an example:
library('dplyr')
library('tibble')
library('lubridate')
## create some data
df <- tibble(date = seq(ymd(20190101), ymd(20191231), by = '7 days'))
which yields
> df
# A tibble: 53 x 1
date
<date>
1 2019-01-01
2 2019-01-08
3 2019-01-15
4 2019-01-22
5 2019-01-29
6 2019-02-05
7 2019-02-12
8 2019-02-19
9 2019-02-26
10 2019-03-05
# … with 43 more rows
Then mutate df using the relevant extractor function:
df <- mutate(df,
year = year(date),
month = month(date),
day = day(date))
This results in:
> df
# A tibble: 53 x 4
date year month day
<date> <dbl> <dbl> <int>
1 2019-01-01 2019 1 1
2 2019-01-08 2019 1 8
3 2019-01-15 2019 1 15
4 2019-01-22 2019 1 22
5 2019-01-29 2019 1 29
6 2019-02-05 2019 2 5
7 2019-02-12 2019 2 12
8 2019-02-19 2019 2 19
9 2019-02-26 2019 2 26
10 2019-03-05 2019 3 5
# … with 43 more rows
If you only want the new three columns, use transmute() instead of mutate().
Using lubridate but without having to specify a separator:
library(tidyverse)
df <- tibble(d = c('2019/3/18','2018/10/29'))
df %>%
mutate(
date = lubridate::ymd(d),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
Note that you can change the first entry from ymd to fit other formats.
A slighlty different tidyverse solution that requires less code could be:
Code
tibble(date = "2018-05-01") %>%
mutate_at(vars(date), lst(year, month, day))
Result
# A tibble: 1 x 4
date year month day
<chr> <dbl> <dbl> <int>
1 2018-05-01 2018 5 1
#Data
d = data.frame(date = c("2019-01-01", "2019-02-01", "2012/03/04"))
library(lubridate)
cbind(d,
read.table(header = FALSE,
sep = "-",
text = as.character(ymd(d$date))))
# date V1 V2 V3
#1 2019-01-01 2019 1 1
#2 2019-02-01 2019 2 1
#3 2012/03/04 2012 3 4
OR
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(date2 = as.character(ymd(date))) %>%
separate(date2, c("year", "month", "day"), "-")
# date year month day
#1 2019-01-01 2019 01 01
#2 2019-02-01 2019 02 01
#3 2012/03/04 2012 03 04
This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
dat <- data.frame(loc.id = rep(1:2, each = 3),
year = rep(1981:1983, times = 2),
prod = c(200,300,400,150,450,350),
yld = c(1200,1250,1200,3000,3200,3200))
If I want to select for each loc.id distinct values of yld, I do this:
dat %>% group_by(loc.id) %>% distinct(yld)
loc.id yld
<int> <dbl>
1 1200
1 1250
2 3000
2 3200
However, what I want to do is for loc.id, if years have the same yld, then select the yld with a lower
prod value. My dataframe should look like i.e. I want the prod and year column too included in the final dataframe
loc.id year prod yld
1 1981 200 1200
1 1982 300 1250
2 1981 150 3000
2 1983 350 3200
We can do an arrange by 'prod' and then slice the first observation
dat %>%
arrange(loc.id, prod) %>%
group_by(loc.id, yld) %>%
slice(1)
# A tibble: 4 x 4
# Groups: loc.id, yld [4]
# loc.id year prod yld
# <int> <int> <dbl> <dbl>
#1 1 1981 200 1200
#2 1 1982 300 1250
#3 2 1981 150 3000
#4 2 1983 350 3200