I have a list of date entries where I need to convert every date format to "DDMMYYYY"
Example:
a <- c("31 aug 1953", "1953", "aug 1953")
Desired output:
"31081953", "00001953", "00081953"
As there are different formats, one option would to extract into 'day', 'month', and 'year', then paste together after replaceing the missing values with 0 formatted with str_pad
library(dplyr)
library(tidyr)
library(stringr)
data.frame(a) %>%
extract(a, into = c('day', 'month', 'year'), "(\\d{2})*\\s*([a-z]*)\\s*(\\d{4})") %>%
mutate(month = match(toupper(month), toupper(month.abb))) %>%
mutate_all(funs(str_pad(replace(., is.na(.), "00"), width = 2, pad = "0"))) %>%
unite(newcol, day, month, year, sep="") %>%
pull(newcol)
#[1] "31081953" "00001953" "00081953"
Related
I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.
From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))
Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)
I have multiple columns that has missing values. I want to use the mean of the same day across all years while filling the missing data for each column. for example, DF is my fake data where I see missing values for the two columns (A & X)
library(lubridate)
library(tidyverse)
library(naniar)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1985-01-01"), to = as.Date("1987-12-31"), by = "day"),
A = sample(1:10,1095, replace = T), X = sample(5:15,1095, replace = T)) %>%
replace_with_na(replace = list(A = 2, X = 5))
To fill in Column A, i use the following code
Fill_DF_A <- DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(A = ifelse(is.na(A), mean(A, na.rm=TRUE), A))
I have many columns in my data.frame and I would like to generalize this for all the columns to fill in the missing value?
We can use na.aggregate from zoo
library(dplyr)
library(zoo)
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, na.aggregate))
Or if we prefer to use conditional statements
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, ~ case_when(is.na(.)
~ mean(., na.rm = TRUE), TRUE ~ as.numeric(.))))
I'm trying to count the difference in dates from a single column, based on another columns value.
This is the result I'm looking for
Try this
library('dplyr')
df <- data.frame(id = c(1, 2, 3, 1, 2, 3),
Date = c('1/1/2020', '1/3/2020','1/1/2020','1/7/2020','1/6/2020','1/5/2020'))
df %>% mutate(Date = as.Date(Date, format='%m/%d/%Y')) %>%
group_by(id) %>%
mutate(DIFF = Date - lag(Date))
Here is a way using dplyr and lubridate (needed to make the dates behave when subtracting). It looks like you want the calculation to determine the number of days between the dates in a group by ID and the earliest date for that ID.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = Date - min(Date))
If you want to have NA instead of 0, you can do the following:
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = if_else(Date == min(Date), NA_integer_, Date - min(Date))
I am working on R for a Data Analysis. I have a Dataframe which stores the data for each month in a Year. For certain months of a particular year the data is missing. The dataframe which i am currently using is as below.
How to modify the data in the Dataframe to be stored in another dataframe in this below manner?
The column Year is of the type yearmon and n is of the typr int.
Solution using tidyverse
library(tidyverse)
##Recreate data
df <- tibble(
Year = c("Dec-13", "Jan-14","Feb-14","Mar-14",
"Apr-14", "May-14","Jun-15","Jul-14",
"Aug-15","Sep-18"),
n = c(1,8,2,4,8,9,2,1,1,1)
)
##convert to character, spread, and fill
df_2 <- df %>%
mutate(Year = parse_character(Year)) %>%
separate(Year, into = c("Month", "Year")) %>%
mutate(Year = paste0("20",Year)) %>%
spread(Year,n, fill = "-") %>%
mutate(Month = factor(Month, levels = c("Dec","Jan","Feb", "Mar","Apr",
"May","Jun","Jul", "Aug",
"Sep"))) %>%
arrange(Month)
df_2
library(dplyr)
library(plotly)
library(lubridate)
googlesearch <- read.csv("multiTimeline.csv", header = FALSE, stringsAsFactors = FALSE)
googlesearch2 <- googlesearch [-1, ]
googlesearch2 <- googlesearch2 [-1, ]
colnames(googlesearch2)[1] <- 'Date'
colnames(googlesearch2)[2] <- 'NumberofSearch'
googlesearch2$Date <- as.Date(googlesearch2$Date)
googlesearch2 <- googlesearch2 %>%
filter(Date > "2015-01-04" & Date < "2018-05-27")
googlesearch3 <- googlesearch2 %>%
transform(googlesearch2$Date, Date = as.Date(as.character(Date), "%Y-%m-%d"))
googlesearch3 <- googlesearch2 %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Date, yearMon = as.yearmon(Date, "%m-%d-%Y"))
googlesearch3$Date <- as.numeric(googlesearch3$NumberofSearch)
googlesearch3 <- googlesearch3 %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Date, yearMon = as.yearmon(Date, "%m-%d-%Y")) %>%
summarise(NumberofSearch_sum = sum(NumberofSearch))
data <- tbl_df(googlesearch3)
data %>%
group_by(yearMon) %>%
summarise(NumberofSearch_mon = sum(NumberofSearch))
I know this is messy.
I'm getting this error and I don't know why.Adding the sample code.
Error in summarise_impl(.data, dots) :
Evaluation error: invalid 'type' (character) of argument.
In lack of a reproducible example, try to replace the last code chunk of you sample code with:
library(hablar)
data %>%
retype() %>%
group_by(yearMon) %>%
summarise(NumberofSearch_mon = sum(NumberofSearch))
Maybe it works :)