I am working on R for a Data Analysis. I have a Dataframe which stores the data for each month in a Year. For certain months of a particular year the data is missing. The dataframe which i am currently using is as below.
How to modify the data in the Dataframe to be stored in another dataframe in this below manner?
The column Year is of the type yearmon and n is of the typr int.
Solution using tidyverse
library(tidyverse)
##Recreate data
df <- tibble(
Year = c("Dec-13", "Jan-14","Feb-14","Mar-14",
"Apr-14", "May-14","Jun-15","Jul-14",
"Aug-15","Sep-18"),
n = c(1,8,2,4,8,9,2,1,1,1)
)
##convert to character, spread, and fill
df_2 <- df %>%
mutate(Year = parse_character(Year)) %>%
separate(Year, into = c("Month", "Year")) %>%
mutate(Year = paste0("20",Year)) %>%
spread(Year,n, fill = "-") %>%
mutate(Month = factor(Month, levels = c("Dec","Jan","Feb", "Mar","Apr",
"May","Jun","Jul", "Aug",
"Sep"))) %>%
arrange(Month)
df_2
Related
I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.
From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))
Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)
I have a data frame stored with daily data within a year and I want to compute monthly averages as well as day of the week averages and add those values as additional columns.
Here is a MWE of my data frame
df <- tibble(Date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 365),
Daily_sales = rnorm(365, 2, 1))
df <- df %>%
mutate(month = lubridate::month(Date), #Month
dow = lubridate::wday(Date, week_start = 1), #Day of the week
dom = lubridate::day(Date)) #Day of the month
My problem is as follows: I know how to compute the monthly averages, e.g.
df %>% group_by(month) %>% summarize(Monthly_avg = mean(Daily_sales))
but i don't know how to add this as an additional column where every value in January has the average, and every value in February has the avg from February. E.g. if the avg of January is 2.22, then the new column should contain 2.22 for all dates in January. The same problem for the day of the week average.
Instead of summarize()ing an entire group into one row, we can mutate() all rows to add the group mean:
result <- df %>%
group_by(month) %>% mutate(monthly_avg = mean(Daily_sales)) %>%
group_by(dow) %>% mutate(dow_avg = mean(Daily_sales)) %>%
group_by(dom) %>% mutate(dom_avg = mean(Daily_sales)) %>%
ungroup()
I have my two chunks of code below. In the first chunk I am trying to get my data ready to display in a gt table. My goal is to display a table with Month abbreviations in ascending order (Jan, Feb, March, etc.) in column 1. Currently, it appears each time I run the second chunk of code to create a table, it sort the month data alphabetically. I have looked into the lubridate package, messed around with the 'month.abb' and 'month.name function and still am outputting the months in alphabetical order. See packages used below and chunks below.
library(tidyverse)
library(quantmod)
library(tidyquant)
library(xts)
library(rvest)
library(stringr)
library(forcats)
library(lubridate)
library(tidyr)
library(plotly)
library(purrr)
library(PerformanceAnalytics)
library(gt)
library(paletteer)
library(readr)
library(janitor)
library(scales)
library(lubridate)
Obtain and Clean Data To Feed Into gt (Chunk 1)
start <- as.Date("2020-01-01")
end <- as.Date("2020-11-30")
StockMonthlyReturns <- c("XLC", "XLC", "XLP", "XLE","XLF", "XLV", "XLI","XLB", "XLRE", "XLK", "XLU") %>%
tq_get(get = "stock.prices",
from = start,
to = end) %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "StockMonthlyReturns")
stockdata <- StockMonthlyReturns %>% group_by(symbol, Month = floor_date(date, "month")) %>% summarise(Amount = sum(StockMonthlyReturns))
monthnum <- month(stockdata$Month)
stockdata <- cbind(stockdata, monthnum)
stockdata <- stockdata %>% rename(Month_Num = ...4)
stockdata$Month_Num <- month.abb[stockdata$Month_Num]
stockdata <- stockdata %>%
select(-Month)
stockdata <- stockdata %>% rename(Month = "Month_Num")
#stockdata$Month <- months(as.Date(stockdata$Month))
stockdata <- adorn_rounding(stockdata, digits= 4) # Rounding Returns
stockdata$Amount <- percent(stockdata$Amount, accuracy = 0.01) # Changing Decimals to Percent
Using dplyr to Graph Data using gt (Chunk 2)
stockdata %>%
tidyr::spread(key="symbol", value = Amount) %>%
gt(rowname_col = "Month") %>%
tab_header(title = "Monthly Return of SPDR ETF's")
You can try :
library(tidyverse)
library(gt)
stockdata %>%
mutate(Month = factor(Month, month.abb)) %>%
tidyr::spread(key=symbol, value = Amount) %>%
#tidyr::pivot_wider(names_from = symbol, values_from= Amount) %>%
gt(rowname_col = "Month") %>%
tab_header(title = "Monthly Return of SPDR ETF's")
Also note that spread has been retired in favor of pivot_wider.
I have multiple columns that has missing values. I want to use the mean of the same day across all years while filling the missing data for each column. for example, DF is my fake data where I see missing values for the two columns (A & X)
library(lubridate)
library(tidyverse)
library(naniar)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1985-01-01"), to = as.Date("1987-12-31"), by = "day"),
A = sample(1:10,1095, replace = T), X = sample(5:15,1095, replace = T)) %>%
replace_with_na(replace = list(A = 2, X = 5))
To fill in Column A, i use the following code
Fill_DF_A <- DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(A = ifelse(is.na(A), mean(A, na.rm=TRUE), A))
I have many columns in my data.frame and I would like to generalize this for all the columns to fill in the missing value?
We can use na.aggregate from zoo
library(dplyr)
library(zoo)
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, na.aggregate))
Or if we prefer to use conditional statements
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, ~ case_when(is.na(.)
~ mean(., na.rm = TRUE), TRUE ~ as.numeric(.))))
I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)