I have a data frame like the following:
Frequency Period Period No. Year
Monthly 1 1 2018
Quarterly Q1 3 2018
YTD YTD-Feb 2 2019
Based on these columns, I'd like to add a min. date and max. date column so that the data frame looks like this:
Frequency Period Period No. Year Min. Date Max. Date
Monthly 1 1 2018 1/1/2018 1/31/2018
Quarterly Q1 3 2018 1/1/2018 3/31/2018
YTD YTD-Feb 2 2019 1/1/2019 2/28/2019
If we need the max, min based on the 'PeriodNo.' column, create a sequence of Dates by month from the 'Year' column, then extract the min and max`
library(dplyr)
library(purrr)
library(lubridate)
library(stringr)
df1 %>%
mutate(date = map2(as.Date(str_c(Year, '-01-01')),
PeriodNo., ~ seq(.x, length.out = .y, by = '1 month')),
Min.Date = do.call(c, map(date, min)),
Max.Date = do.call(c, map(date, ~ceiling_date(max(.x), 'month')-1))) %>%
select(-date)
# Frequency Period PeriodNo. Year Min.Date Max.Date
#1 Monthly 1 1 2018 2018-01-01 2018-01-31
#2 Quarterly Q1 3 2018 2018-01-01 2018-03-31
#3 YTD YTD-Feb 2 2019 2019-01-01 2019-02-28
Or an option with Map
lst1 <- Map(function(x, y) seq(as.Date(paste0(x, "-01-01")),
length.out = y, by = '1 month'), df1$Year, df1$PeriodNo.)
df1$Min.Date <- do.call(c, lapply(lst1, min))
df1$Max.Date <- do.call(c, lapply(lst1, function(x) (max(x) + months(1) -1)) )
data
df1 <- structure(list(Frequency = c("Monthly", "Quarterly", "YTD"),
Period = c("1", "Q1", "YTD-Feb"), PeriodNo. = c(1L, 3L, 2L
), Year = c(2018L, 2018L, 2019L)), class = "data.frame",
row.names = c(NA,
-3L))
Related
I would like to replace the 'month' value by looking at the 'week' value. If it is week 52 then the month should be 12. How to do that across the data?
Example data:
year month week
2010 1 52
2010 12 52
2011 1 52
2011 12 52
2012 1 52
2012 12 52
expected data:
year month week
2010 12 52
2010 12 52
2011 12 52
2011 12 52
2012 12 52
2012 12 52
As #MrSmithGoesToWashington pointed out, if you look from the time perspective, it is not possible. But if you are just asking how to change any value based on the value in another column, you can do sth like that.
library(dplyr)
df <- data.frame(year = c(2010, 2010),
month = c(1, 12),
week = c(52, 52))
df %>% mutate(month = ifelse(week == 52, 12, df$month))
Here is a base R way.
If the year/week are between the year/week of the first and last days of December, the month is 12 else it's the recorded month.
yw <- with(df1, paste(year, week))
yy01 <- paste(df1$year, 12, 1, sep = "-")
yy31 <- paste(df1$year, 12, 31, sep = "-")
yy01 <- format(as.Date(yy01), "%Y %U")
yy31 <- format(as.Date(yy31), "%Y %U")
ifelse(yy01 <= yw & yw <= yy31, 12, df1$month)
#[1] 12 12 12 12 12 12
And assign this value the the column month.
df1$month <- ifelse(yy01 <= yw & yw <= yy31, 12, df1$month)
Data
df1 <- read.table(text = "
year month week
2010 1 52
2010 12 52
2011 1 52
2011 12 52
2012 1 52
2012 12 52
", header = TRUE)
# Import data: df1 => data.frame
df1 <- structure(list(year = c(2010L, 2010L, 2011L, 2011L, 2012L, 2012L
), week = c(52L, 52L, 52L, 52L, 52L, 52L)), class = "data.frame",
row.names = c(NA, -6L))
# Generate a sequence of dates, store as a data.frame:
# date_range => data.frame
date_range <- data.frame(
date = seq(
from = as.Date(
paste(
min(df1$year),
"01-01",
sep = "-"
)
),
to = as.Date(
paste(
max(df1$year),
"12-31",
sep = "-"
)
),
by = "days"
)
)
# Derive the month: month_no => integer vector
date_range$month_no <- as.integer(
strftime(
date_range$date,
"%m"
)
)
# Derive the week: week_no => integer vector
date_range$week_no <- as.integer(
strftime(
date_range$date,
"%V"
)
)
# Derive the year: year_no => integer vector
date_range$year_no <- as.integer(
strftime(
date_range$date,
"%Y"
)
)
# Create a lookup table: year_mon_week_lkp => data.frame
year_mon_week_lkp <- transform(
aggregate(
month_no ~ year_no+week_no,
data = date_range,
FUN = max
),
month_no = ifelse(week_no >= 52, 12, month_no)
)
# Resolve the month using the week_no and the year:
# month => integer vector
df1$month <- with(
df1,
year_mon_week_lkp$month_no[
match(
paste0(
year,
week
),
paste0(
year_mon_week_lkp$year_no,
year_mon_week_lkp$week_no
)
)
]
)
I have a list of 83 csv files with three variables.
I have created new date columns including, month and year.
One of my dataframes from the list looks like this:
> head(estaciones$AeropuertodeBocas_93002)
Date Tx2m Tn2m Pr year month day
1 1988-01-01 27.4 23.1 41.3 1988 1 1
2 1988-01-02 29.8 24.0 0.3 1988 1 2
3 1988-01-03 30.4 24.0 0.4 1988 1 3
4 1988-01-04 30.0 24.2 2.4 1988 1 4
5 1988-01-05 29.6 23.2 9.1 1988 1 5
6 1988-01-06 30.0 23.1 5.2 1988 1 6
I would like to create a new file with the percentage of NA values per variable and per month and year. For example Jun 1988: 2% of missing values for variable "Pr" and dataframe "x".
I have tried using:
na_by_month <- map(estaciones, ~ .x %>%
mutate(Month=month(Date), Mis = rowSums(is.na(.))) %>%
group_by(Month) %>%
summarise(Sum=sum(Mis), Percentage=mean(Mis)))
This is only calculating missing values percentage for each month for the whole series and not per year.
Data (one of several dfs):
df <- structure(list(Date = structure(c(6574,
6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30),
Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
How can I create a new file containing percentage of missing values for each of my data frames inside the list, per month and per year? Thank You
If you're trying to calculate the percentage of missing values by month/year and just by year you could write a function that you can then map to your list of dataframes:
library(dplyr)
library(purrr)
library(openxlsx)
library(rlang)
ldf <- list(df, df, df)
f <- function(data, ...){
v <- enquos(...)
data %>%
group_by(!!! v) %>%
summarize(across(Tx2m:Pr,
list(missing = ~ mean(is.na(.))),
.names = paste0("{.col}_{.fn}_", quo_name(v[[1]]))),
.groups = "drop")
}
miss <- imap(ldf, ~ left_join(f(.x, month, year), f(.x, year), by = "year"))
write.xlsx(miss, "output.xlsx")
How it works
You provide the function f your dataframe and the variables you want to group by and it will calculate the percentage of missing values for those group by variables. For example, f(df, month, year) will group your data by month and year and calculate the percentage of missing values for each variable in the range Tx2m:Pr.
f(df, month, year)
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
f(df, year)
year Tx2m_missing_year Tn2m_missing_year Pr_missing_year
<int> <dbl> <dbl> <dbl>
1 1988 0 0 0
Note: the order of your grouping variables matters here. The first group by variable is used to construct the output variable names (eg Tn2m_missing_month).
If you want the number of missing by month/year and by year for each element of your list, then we can apply this function using imap and merge the results by year.
left_join(f(df, month, year), f(df, year), by = "year")
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
# ... with 3 more variables: Tx2m_missing_year <dbl>,
# Tn2m_missing_year <dbl>, Pr_missing_year <dbl>
Note: The missing by year will be repeated for each month within the year.
Lastly, write.xlsx will write a list of dataframes to an Excel workbook, where each sheet will be an element of your list.
If I've misunderstood your post and you only want the percentage missing by month within year then you can simplify this to:
miss <- imap(ldf, ~ f(.x, month, year))
Plot
To plot you could do something like this:
library(ggplot2)
library(tidyr)
library(scales)
library(lubridate)
plots <- imap(miss, ~ .x %>%
select(ends_with("year")) %>%
distinct() %>%
pivot_longer(cols = -year,
names_pattern = "(.*?)_(.*)",
names_to = c("var", NA)) %>%
mutate(date = ymd(year, truncated = 2L)) %>%
ggplot(aes(x = date, y = value, color = var, group = var)) +
geom_point() +
geom_line() +
scale_y_continuous(labels = percent_format()) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y")
)
plots[[1]]
where each variable is a line, it's y-axis value is the percent missing, and the x-axis is the year.
Note: with the given data in the example, the graphic is not that interesting and gives a warning about there being only one point. Additionally, all the points are overlapping on the same (x,y) coordinate with the given data.
df <- structure(list(Date = structure(c(6574, 6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30), Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
nongroup_vars <- setdiff(colnames(df),c('year','month'))
nongroup_vars_mr <- paste0(nongroup_vars,'_missing_ratio')
df %>%
group_by(month,year) %>%
summarise_all(function(x) mean(is.na(x))) %>%
ungroup %>%
rename_with(~nongroup_vars_mr,all_of(nongroup_vars))
it says missing ratios for each group.
output;
# A tibble: 1 × 7
month year Date_missing_ratio Tx2m_missing_ratio Tn2m_missing_ratio Pr_missing_ratio day_missing_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1988 0 0 0 0 0
In an excel file, there are two columns labelled "id" and "date" as in the following data frame:
df <-
structure(
list(
id = c(1L, 2L, 3L, 4L,5L),
date = c("10/2/2013", "-5/3/2015", "-11/-4/2019", "3/10/2019","")
),
.Names = c("id", "date"),
class = "data.frame",
row.names = c(NA,-5L)
)
The "date" column has both date e.g 10/2/2013 and non-date entries e.g. -5/3/2015 and -11/-4/2019 as well as blank spaces. I am looking for a way to read the excel file into R such that the dates and the non-dates are preserved and the blank spaces are replaced by NAs.
I have tried to use the function "read_excel" and argument "col_types" as follows:
df1<- data.frame(read_excel("df.xlsx", col_types = c("numeric", "date")))
However, this reads the dates and replaces the non-dates with NAs. I have tried other options of col_types e.g. "guess" and "skip" but these did not work for me. Any help on this is much appreciated.
Here's an approach using tidyr::separate and dplyr to filter out negative months so that only positive months are converted to "yearmon" data with zoo:
library(tidyverse)
df %>%
separate(date, c("day", "month", "year"),
sep = "/", remove = F, convert = T) %>%
mutate(month = if_else(month < 0, NA_integer_, month)) %>%
mutate(date2 = zoo::as.yearmon(paste(year, month, sep = "-")))
# id date day month year date2
#1 1 10/2/2013 10 2 2013 Feb 2013
#2 2 -5/3/2015 -5 3 2015 Mar 2015
#3 3 -11/-4/2019 -11 NA 2019 <NA>
#4 4 3/10/2019 3 10 2019 Oct 2019
#5 5 NA NA NA <NA>
I have a table with the following headers and example data
Lat Long Date Value.
30.497478 -87.880258 01/01/2016 10
30.497478 -87.880258 01/02/2016 15
30.497478 -87.880258 01/05/2016 20
33.284928 -85.803608 01/02/2016 10
33.284928 -85.803608 01/03/2016 15
33.284928 -85.803608 01/05/2016 20
I would like to average the value column on monthly basis for a particular location.
So example output would be
Lat Long Month Avg Value
30.497478 -87.880258 January 15
A solution using dplyr and lubridate.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date = mdy(Date), Month = month(Date)) %>%
group_by(Lat, Long, Month) %>%
summarise(`Avg Value` = mean(Value))
dt2
# A tibble: 2 x 4
# Groups: Lat, Long [?]
Lat Long Month `Avg Value`
<dbl> <dbl> <dbl> <dbl>
1 30.49748 -87.88026 1 15
2 33.28493 -85.80361 1 15
You can try the following, but it first modifies the data frame adding an extra column, Month, using package zoo.
library(zoo)
dat$Month <- as.yearmon(as.Date(dat$Date, "%m/%d/%Y"))
aggregate(Value. ~ Lat + Long + Month, dat, mean)
# Lat Long Month Value.
#1 30.49748 -87.88026 jan 2016 15
#2 33.28493 -85.80361 jan 2016 15
If you don't want to change the original data, make a copy dat2 <- dat and change the copy.
DATA
dat <-
structure(list(Lat = c(30.497478, 30.497478, 30.497478, 33.284928,
33.284928, 33.284928), Long = c(-87.880258, -87.880258, -87.880258,
-85.803608, -85.803608, -85.803608), Date = structure(c(1L, 2L,
4L, 2L, 3L, 4L), .Label = c("01/01/2016", "01/02/2016", "01/03/2016",
"01/05/2016"), class = "factor"), Value. = c(10L, 15L, 20L, 10L,
15L, 20L)), .Names = c("Lat", "Long", "Date", "Value."), class = "data.frame", row.names = c(NA,
-6L))
EDIT.
If you want to compute several statistics, you can define a function that computes them and returns a named vector and call it in aggregate, like the following.
stat <- function(x){
c(Mean = mean(x), Median = median(x), SD = sd(x))
}
agg <- aggregate(Value. ~ Lat + Long + Month, dat, stat)
agg <- cbind(agg[1:3], as.data.frame(agg[[4]]))
agg
# Lat Long Month Mean Median SD
#1 30.49748 -87.88026 jan 2016 15 15 5
#2 33.28493 -85.80361 jan 2016 15 15 5
I am trying to aggregate a table using ddply().
My table looks like this:
Year Month Count
2000 Jan 1
2000 Jan 2
2001 Feb 2
2001 Feb 1
I want to sum up the counts based on year and month. So I would have 2000, Jan, 3 and 2001, Feb, 3.
My code is
ddply(df,???,sum(Count))
I am not sure how to add in multiple variables.
We group by the variables 'Year', 'Month', and get the sum of 'Count' specifying summarise from the plyr.
Using plyr
library(plyr)
ddply(df, .(Year, Month), plyr::summarise, Count=sum(Count))
# Year Month Count
#1 2000 Jan 3
#2 2001 Feb 3
Or we can use the formula method of aggregate from base R.
aggregate(Count~., df, FUN=sum)
# Year Month Count
#1 2001 Feb 3
#2 2000 Jan 3
Or with dplyr, we group by the variables and summarise
library(dplyr)
df %>%
group_by(Year, Month) %>%
dplyr::summarise(Count=sum(Count))
# Year Month Count
# (int) (chr) (int)
#1 2000 Jan 3
#2 2001 Feb 3
Or we convert the 'data.frame' to 'data.table' (setDT(df)), group by the columns, and get the sum of 'Count'.
library(data.table)
setDT(df)[, list(Count=sum(Count)), .(Year, Month)]
# Year Month Count
#1: 2000 Jan 3
#2: 2001 Feb 3
NOTE: When we load functions that are similar from other packages, it is better to use packagename::function (plyr::summarise and dplyr::summarise)
data
df <- structure(list(Year = c(2000L, 2000L, 2001L, 2001L),
Month = c("Jan",
"Jan", "Feb", "Feb"), Count = c(1L, 2L, 2L, 1L)), .Names = c("Year",
"Month", "Count"), class = "data.frame",
row.names = c(NA, -4L))