I have a data frame that looks like this
Region 2000 2001 2002 2003 2004 2005
Australia 15.6 18.4 19.2 20.2 39.1 50.2
Norway 19.05 20.2 15.3 10 10.1 5.6
and basically I need a quick way to add extra columns in-between the currently existing columns that contain interpolated values of the surrounding columns.
Think of it like this: say you don't want columns for every year, but rather columns for every quarter. Then, for every pair of years (like 2000 and 2001), we would need to add 3 extra columns in-between these years.
The values of these columns will just be interpolated values. So, for Australia, the value in 2000 is 15.6 and in 2001 it is 18.4. So we calculate (18.4 - 15.6)/4 = 0.7, and then the values should now be 15.6, 16.3, 17, 17.7, and finally 18.4.
I have a working solution that builds up the new dataframe from scratch using a for loop. It is EXTREMELY slow. How to speed this up?
This is how I did it when I had a similar problem. Not the most sophisticated solution but it works.
Australia=c( 15.6, 18.4, 19.2, 20.2, 39.1, 50.2)
library(zoo)
midpoints=rollmean(Australia, 2)
biyearly=c(rbind(Australia,midpoints))
midpoints=rollmean(biyearly, 2)
quarterly=c(rbind(biyearly,midpoints))
quarterly
#[1] 15.600 16.300 17.000 17.700 18.400 18.600 18.800 19.000 19.200 19.450 19.700
#[12] 19.950 20.200 24.925 29.650 34.375 39.100 41.875 44.650 47.425 50.200 33.600
#[23] 17.000 16.300
Here is one way with tidyverse:
library(tidyverse)
df %>%
#get data in long format
pivot_longer(cols = -Region) %>%
#group by Region
group_by(Region) %>%
#Create 4 number sequence between every 2 value
summarise(temp = list(unlist(map2(value[-n()], value[-1], seq, length.out = 4)))) %>%
#Get data in long format
unnest(temp) %>%
group_by(Region) %>%
#Create column name
mutate(col = paste0(rep(names(df)[-c(1, ncol(df))], each = 4), "Q", 1:4)) %>%
#Spread data in wide format
pivot_wider(names_from = col, values_from = temp)
# A tibble: 2 x 21
# Groups: Region [2]
# Region `2000Q1` `2000Q2` `2000Q3` `2000Q4` `2001Q1` `2001Q2` `2001Q3` `2001Q4` `2002Q1`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Austr… 15.6 16.5 17.5 18.4 18.4 18.7 18.9 19.2 19.2
#2 Norway 19.0 19.4 19.8 20.2 20.2 18.6 16.9 15.3 15.3
# … with 11 more variables: `2002Q2` <dbl>, `2002Q3` <dbl>, `2002Q4` <dbl>,
# `2003Q1` <dbl>, `2003Q2` <dbl>, `2003Q3` <dbl>, `2003Q4` <dbl>, `2004Q1` <dbl>,
# `2004Q2` <dbl>, `2004Q3` <dbl>, `2004Q4` <dbl>
data
df <- structure(list(Region = structure(1:2, .Label = c("Australia",
"Norway"), class = "factor"), `2000` = c(15.6, 19.05), `2001` = c(18.4,
20.2), `2002` = c(19.2, 15.3), `2003` = c(20.2, 10), `2004` = c(39.1,
10.1), `2005` = c(50.2, 5.6)), class = "data.frame", row.names = c(NA, -2L))
Here is a solution using dplyr. Should be more consistent and much faster than a loop:
# dummy data
df <- tibble(Region = LETTERS[1:5],
`2000` = 1:5,
`2001` = 3:7,
`2002` = 10:14)
# function to calculate quarterly values
into_quarter <- function(x) x / 4
df %>%
# create new variables that contain quarterly values
mutate_at(vars(starts_with("200")),
.funs = list("Q1" = into_quarter,
"Q2" = into_quarter,
"Q3" = into_quarter,
"Q4" = into_quarter)) %>%
# sort them approriatly.
# can also be done with base R and order(names), depending on desired result
select(Region,
starts_with("2000"),
starts_with("2001"),
starts_with("2002"),
# in case there are also other variables and to not loose any information
everything())
Related
I'm unsure how to structure my pivot longer command when I have both annual and monthly data. For example I have:
wide <- data.frame(region_name = character(), # Create empty data frame
total_population_2019 = numeric(),
total_population_2020 = numeric(),
mean_temperature_2019_1 = numeric(),
mean_temperature_2019_2 = numeric(),
mean_temperature_2020_1 = numeric(),
mean_temperature_2020_2 = numeric(),
stringsAsFactors = FALSE)
wide[1, ] <- list("funville", 50000, 51250, 26.3, 24.6, 25.7, 24.9)
region_name total_population_2019 total_population_2020 mean_temperature_2019_1 mean_temperature_2019_2 mean_temperature_2020_1 mean_temperature_2020_2
funville 50000 51250 26.3 24.6 25.7 24.9
I'm able to pivot on the monthly columns using spread:
long <- pivot_longer(wide, cols = 4:7, names_to = c("layer" ,"year", "month"),
names_pattern = "(.*)_(.*)_?_(.*)") %>%
group_by(layer) %>%
mutate(n = 1:n()) %>%
spread(layer, value) %>%
select(-n)
which gives
region_name total_population_2019 total_population_2020 year month mean_temperature
1 funville 50000 51250 2019 1 26.3
2 funville 50000 51250 2019 2 24.6
3 funville 50000 51250 2020 1 25.7
4 funville 50000 51250 2020 2 24.9
I'd like to now have a population column where the values are attributed for each row/month that falls in that year, ideally would look like:
desired.df <- data.frame(region_name = c("funville", "funville", "funville", "funville"),
year = c("2019", "2019", "2020", "2020"),
month = c("1", "2", "1", "2"),
population = c("50000", "50000", "51250", "51250"),
mean_temperature = c("26.3", "24.6", "25.7", "24.9"))
which gives
region_name year month population mean_temperature
1 funville 2019 1 50000 26.3
2 funville 2019 2 50000 24.6
3 funville 2020 1 51250 25.7
4 funville 2020 2 51250 24.9
Does anyone have a solution? Thanks in advance
One option would be to use the names_pattern argument and the special .value. To make this work I first add a helper month to your population columns. Additionally I use tidyr::fill to fill up the population column:
library(dplyr)
library(tidyr)
wide |>
rename_with(~ paste(.x, 1, sep = "_"), starts_with("total")) |>
pivot_longer(-region_name,
names_to = c(".value", "year", "month"),
names_pattern = "^(.*?)_(\\d+)_(\\d+)$") |>
group_by(year) |>
fill(total_population) |>
arrange(year)
#> # A tibble: 4 × 5
#> # Groups: year [2]
#> region_name year month total_population mean_temperature
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 funville 2019 1 50000 26.3
#> 2 funville 2019 2 50000 24.6
#> 3 funville 2020 1 51250 25.7
#> 4 funville 2020 2 51250 24.9
I have the following sample data:
samplesize=100
df <- data.frame(sex = sample(c("M", "F"), size = samplesize, replace = TRUE),
agegrp = sample(c("old", "middle", "young"), size = samplesize, replace = TRUE),
duration1 = runif(samplesize, min = 1, max = 100),
duration2 = runif(samplesize, min = 1, max = 100),
country = sample(c("USA", "CAN"), size = samplesize, replace = TRUE))
df
My goal is to plot a table like this that displays the median values [median(na.rm = TRUE) as there might be missing values]
USA CAN
total old middle young M F total old middle young M F
duration1 10.2 12.2 13.1 10.2 13.0 13.9 ... ... ... ... ... ...
duration2 10.4 13.2 13.2 10.0 13.1 14.0 ... ... ... ... ... ...
The way I would usually calculate such a table is to calculate the median values columnwise:
df %>%
group_by(country, agegrp) %>%
summarise(dur1 = median(duration1, na.rm = TRUE),
dur2 = median(duration, na.rm = TRUE)
And finally I put all the columns together. Unfortunately, as the number of combinations gets bigger, this methods becomes very cumbersome. So my question is:
Is there any function like table() that let's me calculate means or medians (instead of frequencies) using specific combinations of variables?
It would also be fine if it was just a two-dimensional table with multi-dimensional variable names like:
USA_total USA_old USA_middle USA_young USA_m USA_f CAN_total ...
duration1
duration2
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(sex, agegrp, country), names_to = "parameters") %>%
group_by(agegrp, country, parameters) %>%
summarise(mean = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = c(country, agegrp), values_from = mean)
Returns:
parameters CAN_middle USA_middle CAN_old USA_old CAN_young USA_young
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 duration1 48.6 62.6 31.5 40.0 43.0 50.5
2 duration2 60.9 54.0 53.1 58.9 45.1 55.6
Edit
Including M and F:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c(sex, agegrp), names_to = "groupings_names", values_to="groupings") %>%
select(-groupings_names) %>%
pivot_longer(cols = -c(groupings, country), names_to = "parameters") %>%
group_by(groupings, country, parameters) %>%
summarise(mean = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = c(country, groupings), values_from = mean)
parameters CAN_F USA_F CAN_M USA_M CAN_middle USA_middle CAN_old USA_old CAN_young USA_young
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 duration1 63.3 59.0 50.9 47.7 57.9 46.1 56.8 60.6 59.5 49.1
2 duration2 60.6 59.0 54.9 48.3 65.0 45.6 48.5 49.5 55.8 62.4
I have a list of 83 csv files with three variables.
I have created new date columns including, month and year.
One of my dataframes from the list looks like this:
> head(estaciones$AeropuertodeBocas_93002)
Date Tx2m Tn2m Pr year month day
1 1988-01-01 27.4 23.1 41.3 1988 1 1
2 1988-01-02 29.8 24.0 0.3 1988 1 2
3 1988-01-03 30.4 24.0 0.4 1988 1 3
4 1988-01-04 30.0 24.2 2.4 1988 1 4
5 1988-01-05 29.6 23.2 9.1 1988 1 5
6 1988-01-06 30.0 23.1 5.2 1988 1 6
I would like to create a new file with the percentage of NA values per variable and per month and year. For example Jun 1988: 2% of missing values for variable "Pr" and dataframe "x".
I have tried using:
na_by_month <- map(estaciones, ~ .x %>%
mutate(Month=month(Date), Mis = rowSums(is.na(.))) %>%
group_by(Month) %>%
summarise(Sum=sum(Mis), Percentage=mean(Mis)))
This is only calculating missing values percentage for each month for the whole series and not per year.
Data (one of several dfs):
df <- structure(list(Date = structure(c(6574,
6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30),
Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
How can I create a new file containing percentage of missing values for each of my data frames inside the list, per month and per year? Thank You
If you're trying to calculate the percentage of missing values by month/year and just by year you could write a function that you can then map to your list of dataframes:
library(dplyr)
library(purrr)
library(openxlsx)
library(rlang)
ldf <- list(df, df, df)
f <- function(data, ...){
v <- enquos(...)
data %>%
group_by(!!! v) %>%
summarize(across(Tx2m:Pr,
list(missing = ~ mean(is.na(.))),
.names = paste0("{.col}_{.fn}_", quo_name(v[[1]]))),
.groups = "drop")
}
miss <- imap(ldf, ~ left_join(f(.x, month, year), f(.x, year), by = "year"))
write.xlsx(miss, "output.xlsx")
How it works
You provide the function f your dataframe and the variables you want to group by and it will calculate the percentage of missing values for those group by variables. For example, f(df, month, year) will group your data by month and year and calculate the percentage of missing values for each variable in the range Tx2m:Pr.
f(df, month, year)
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
f(df, year)
year Tx2m_missing_year Tn2m_missing_year Pr_missing_year
<int> <dbl> <dbl> <dbl>
1 1988 0 0 0
Note: the order of your grouping variables matters here. The first group by variable is used to construct the output variable names (eg Tn2m_missing_month).
If you want the number of missing by month/year and by year for each element of your list, then we can apply this function using imap and merge the results by year.
left_join(f(df, month, year), f(df, year), by = "year")
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
# ... with 3 more variables: Tx2m_missing_year <dbl>,
# Tn2m_missing_year <dbl>, Pr_missing_year <dbl>
Note: The missing by year will be repeated for each month within the year.
Lastly, write.xlsx will write a list of dataframes to an Excel workbook, where each sheet will be an element of your list.
If I've misunderstood your post and you only want the percentage missing by month within year then you can simplify this to:
miss <- imap(ldf, ~ f(.x, month, year))
Plot
To plot you could do something like this:
library(ggplot2)
library(tidyr)
library(scales)
library(lubridate)
plots <- imap(miss, ~ .x %>%
select(ends_with("year")) %>%
distinct() %>%
pivot_longer(cols = -year,
names_pattern = "(.*?)_(.*)",
names_to = c("var", NA)) %>%
mutate(date = ymd(year, truncated = 2L)) %>%
ggplot(aes(x = date, y = value, color = var, group = var)) +
geom_point() +
geom_line() +
scale_y_continuous(labels = percent_format()) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y")
)
plots[[1]]
where each variable is a line, it's y-axis value is the percent missing, and the x-axis is the year.
Note: with the given data in the example, the graphic is not that interesting and gives a warning about there being only one point. Additionally, all the points are overlapping on the same (x,y) coordinate with the given data.
df <- structure(list(Date = structure(c(6574, 6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30), Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
nongroup_vars <- setdiff(colnames(df),c('year','month'))
nongroup_vars_mr <- paste0(nongroup_vars,'_missing_ratio')
df %>%
group_by(month,year) %>%
summarise_all(function(x) mean(is.na(x))) %>%
ungroup %>%
rename_with(~nongroup_vars_mr,all_of(nongroup_vars))
it says missing ratios for each group.
output;
# A tibble: 1 × 7
month year Date_missing_ratio Tx2m_missing_ratio Tn2m_missing_ratio Pr_missing_ratio day_missing_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1988 0 0 0 0 0
i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.
Edit: The print from dput(head(df))
> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
The dataframe looks like this sample from it:
date x1 x2 x3 x4 x5 xn year month day
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995-01-01 50.8 62.2 90.2 60 NA 53.2 1995 1 1
2 1999-08-02 62.6 58.7 NA 72 NA 61.1 1999 8 2
3 2001-09-03 57.2 49.9 70.1 68.4 NA 56.6 2001 9 3
4 2008-05-04 56.6 56.4 75.8 65.5 NA 58.6 2008 5 4
5 2012-04-05 36.8 43.2 83.3 63.2 NA 36.2 2012 4 5
6 2019-12-31 39.1 41.6 98.5 55.9 NA 44.4 2019 12 31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
$ x1 : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
$ x2 : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
xn
$ year : num [1:9131] 1995 1995 1995 1995 1995 ...
$ month : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...
My goal is to get for every observation point xn the count of all observations which cross a certain limit per year.
So far i tried to reach this with the Aggregate function.
To get the mean of every year i used the following command:
aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)
this works perfect, i get the mean for every year for every observation point.
To get the sum of one station i used the following code
aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
which results in this print:
year x1
1 1995 52
2 1996 43
3 1997 44
4 1998 42
5 1999 38
6 2000 76
7 2001 52
8 2002 58
9 2003 110
10 2004 34
11 2005 64
12 2006 46
13 2007 46
14 2008 17
15 2009 41
16 2010 30
17 2011 40
18 2012 47
19 2013 40
20 2014 21
21 2015 56
22 2016 27
23 2017 45
24 2018 22
25 2019 45
So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.
But how do I get them all at once?
I tried the following codes:
aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
Argumente müssen dieselbe Länge haben
But unfortunately none of them works. Could someone please give me a hint where my mistake is?
Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind() the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.
This should solve your problem
library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year(date)) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE))
I have the following weekly data frame:
df <- data.frame( Date = c("2017-08-01","2017-08-08","2017-08-15", "2017-08-22", "2017-08-29", "2017-09-05"), item1 = c(1.6,1.8,1.6, 2.0, 1.4, 1.5), item2 = c(38.6,35.1,42.6, 43.1, 42, 41), item3 = c(16.9, 17.6, 18.5, 19.8, 17, 18))
> df
Date item1 item2 item3
1 2017-08-01 1.6 38.6 16.9
2 2017-08-08 1.8 35.1 17.6
3 2017-08-15 1.6 42.6 18.5
4 2017-08-22 2.0 43.1 19.8
5 2017-08-29 1.4 42.0 17.0
6 2017-09-05 1.5 41.0 18.0
Then I convert my df to a monthly dataframe using aggregate function.
df_monthly <- round(aggregate(zoo(df[,-1], as.Date(df$Date)), by=month, FUN=sum),0)
> df_monthly
item1 item2 item3
2017.08 8 201 90
2017.09 2 41 18
No I need to add a date column into df_monthly that shows the month and the year before I write the df_monthly into a csv file. I tried few different methods, but didn't work.
df_monthly$Date<-data.frame(seq(as.Date("2017/08/01"), as.Date("2017/09/05"), "months"))
> df_monthly
item1 item2 item3 Date
1 8 201 90 Numeric,2
6 2 41 18 Numeric,2
Any help will be highly appreciated.
library(dplyr)
library(lubridate)
df <- data.frame( Date = c("2017-08-01","2017-08-08","2017-08-15", "2017-08-22", "2017-08-29", "2017-09-05"), item1 = c(1.6,1.8,1.6, 2.0, 1.4, 1.5), item2 = c(38.6,35.1,42.6, 43.1, 42, 41), item3 = c(16.9, 17.6, 18.5, 19.8, 17, 18))
df %>%
mutate(year = year(Date), # get the year as a variable
month = month(Date)) %>% # get the month as a variable
group_by(year, month) %>% # group by year and month
summarise(item1 = sum(item1), # get the sums for each item
item2 = sum(item2),
item3 = sum(item3)) %>%
ungroup() # forget the grouping
# # A tibble: 2 x 5
# year month item1 item2 item3
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2017 8 8.4 201.4 89.8
# 2 2017 9 1.5 41.0 18.0
In case you have more than 3 items as columns and you want a more general solution you can use this
library(dplyr)
library(lubridate)
library(tidyr)
df <- data.frame( Date = c("2017-08-01","2017-08-08","2017-08-15", "2017-08-22", "2017-08-29", "2017-09-05"), item1 = c(1.6,1.8,1.6, 2.0, 1.4, 1.5), item2 = c(38.6,35.1,42.6, 43.1, 42, 41), item3 = c(16.9, 17.6, 18.5, 19.8, 17, 18))
df %>%
gather(item, value, -Date) %>% # reshape dataset
mutate(year = year(Date), # get the year as a variable
month = month(Date)) %>% # get the month as a variable
group_by(year, month, item) %>% # group by year, month and item
summarise(value = sum(value)) %>% # get the sum of the values
spread(item, value) %>% # reshape dataset back to initial form
ungroup() # forget the grouping
# # A tibble: 2 x 5
# year month item1 item2 item3
# * <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2017 8 8.4 201.4 89.8
# 2 2017 9 1.5 41.0 18.0