How to replace non numerical values in a dataset with R - r

I have a dataset who looks like this:
Date Electricity
janv-90 23
juin-90 24
juil-90 34
janv-91 42
juin-91 27
juil-91 13
But I want it looking like that:
Date Electricity
190 23
690 24
790 34
191 42
691 27
791 13
Note that my dataset goes from 90 to 10 (namely 1990 to 2010).

since your monts were in French, found a little long route, else we already have month names as constants in R like in month.abb or month.names
# first I create a look-up vector
month.abb.french <- c("janv", "fevr", "mars", "avril",
"mai", "juin", "juil", "aout", "sept",
"oct", "nov", "dec")
# extract the months
month <- unlist(strsplit(df$Date, "-"))[c(TRUE, FALSE)]
# similarily extract the years
year <- unlist(strsplit(df$Date, "-"))[c(FALSE, TRUE)]
# month
#[1] "janv" "juin" "juil" "janv" "juin" "juil"
# year
#[1] "90" "90" "90" "91" "91" "91"
df$newcol <- paste0(match(month, month.abb.french), year)
# Date Electricity newcol
#1: janv-90 23 190
#2: juin-90 24 690
#3: juil-90 34 790
#4: janv-91 42 191
#5: juin-91 27 691
#6: juil-91 13 791

We can just use match, substr and paste to get the expected output
df$Date <- as.numeric(paste0(match(substr(df$Date, 1, 4), month.abb), substring(df$Date, 6)))
df
# Date Electricity
# 1 190 23
# 2 690 24
# 3 790 34
# 4 191 42
# 5 691 27
# 6 791 13
Or using tidyverse by separating the 'Date' column into two columns ('Date' and 'val') by the - delimiter, then match the 'Date' with the mon_ab from the locale() and finally unite the 'Date' and 'val' columns together
library(dplyr)
library(tidyr)
library(readr)
separate(df, Date, into = c("Date", "val")) %>%
mutate(Date = match(Date, sub("\\.$", "", locale("fr")[[1]]$mon_ab))) %>%
unite(Date, Date, val, sep="")
# Date Electricity
#1 190 23
#2 690 24
#3 790 34
#4 191 42
#5 691 27
#6 791 13
data
df <- structure(list(Date = c("janv-90", "juin-90", "juil-90", "janv-91",
"juin-91", "juil-91"), Electricity = c(23L, 24L, 34L, 42L, 27L,
13L)), .Names = c("Date", "Electricity"), class = "data.frame", row.names = c(NA,
-6L))

Related

How to create and populate dummy rows in tidyverse?

I am working with some monthly data and I would like to convert it to daily data by creating and populating some dummy rows, as the question suggests.
For example, say I have the following data:
date index
2013-04-30 232
2013-05-31 232
2013-06-30 233
Is there an "easy" way, preferably through tidyverse, that I could convert the above data into daily data, assuming I keep the index constant throughout the month? For example, I would like to create another 29 rows for April, ranging from 2013-04-01 to 2013-04-29 with the index of the last day of the month which would be 232 for April. The same should be applied to the rest of months (I have more data than just those three months).
Any intuitive suggestions will be greatly appreciated :)
Using complete and fill from tidyr you could do:
dat <- structure(list(
date = structure(c(15825, 15856, 15886), class = "Date"),
index = c(232L, 232L, 233L)
), class = "data.frame", row.names = c(
NA,
-3L
))
library(tidyr)
dat |>
complete(date = seq(as.Date("2013-04-01"), as.Date("2013-06-30"), "day")) |>
fill(index, .direction = "up")
#> # A tibble: 91 × 2
#> date index
#> <date> <int>
#> 1 2013-04-01 232
#> 2 2013-04-02 232
#> 3 2013-04-03 232
#> 4 2013-04-04 232
#> 5 2013-04-05 232
#> 6 2013-04-06 232
#> 7 2013-04-07 232
#> 8 2013-04-08 232
#> 9 2013-04-09 232
#> 10 2013-04-10 232
#> # … with 81 more rows

How can I convert daily data to weekly and monthly data in R [duplicate]

I have daily data for 7 years. I want to group this into weekly data (based on the actual date) and sum the frequency.
Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82
What is the best way to achieve that? Thank you
These solutions all use base R and differ only in the definition and labelling of weeks.
1) cut the dates into weeks and then aggregate over those. Weeks start on Monday but you can add start.on.monday=FALSE to cut to start them on Sunday if you prefer.
Week <- as.Date(cut(DF$Date, "week"))
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2013-12-30 549
## 2 2014-01-06 418
2) If you prefer to define a week as 7 days starting with DF$Date[1] and label them according to the first date in that week then use this. (Add 6 to Week if you prefer the last date in the week.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[1] + 7 * weekno
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
3) or if you prefer to label it with the first date existing in DF in that week then use this. This and the last Week definition give the same result if there are no missing dates as is the case here. (If you want the last existing date in the week rather than the first then replace match with findInterval.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[match(weekno, weekno)]
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
Note
The input in reproducible form is assumed to be:
Lines <- "Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82"
DF <- read.table(text = Lines)
DF$Date <- as.Date(DF$Date)
I would use library(lubridate).
df <- read.table(header = TRUE,text = "date Frequency
2014-01-01 179
2014-01-02 82
2014-01-03 89
2014-01-04 109
2014-01-05 90
2014-01-06 66
2014-01-07 75
2014-01-08 106
2014-01-09 89
2014-01-10 82")
You can use base R or library(dplyr):
base R:
to be sure that the date is really a date:
df$date <- ymd(df$date)
df$week <- week(df$date)
or short:
df$week <- week(ymd(df$date))
or dplyr:
library(dplyr)
df %>%
mutate(week = week(ymd(date))) %>%
group_by(week)
Out:
Barring a good reason not to, you should be sure to use ISO weeks to be sure your aggregation intervals are equally sized.
data.table makes this work like so:
library(data.table)
setDT(myDF) # convert to data.table
myDF[ , .(weekly_freq = sum(Frequency)), by = isoweek(Date)]
Maybe you can try the base R code with aggregate + format, i.e.,
dfout <- aggregate(Frequency ~ yearweek,within(df,yearweek <- format(Date,"%Y,%W")),sum)
such that
> dfout
yearweek Frequency
1 2014,00 549
2 2014,01 418
DATA
df <- structure(list(Date = structure(c(16071, 16072, 16073, 16074,
16075, 16076, 16077, 16078, 16079, 16080), class = "Date"), Frequency = c(179L,
82L, 89L, 109L, 90L, 66L, 75L, 106L, 89L, 82L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
The new package slider from RStudio addresses this problem directly including the specification of the start of the weekly periods. Suppose the weekly periods were to start on a Monday so that the beginning of the first week would be Monday, 2013-12-30. Then the slider solution would be
library(slider)
slide_period_dfr(.x = DF, .i=as.Date(DF$Date),
.period = "week",
.f = ~data.frame(week_ending = tail(.x$Date,1),
week_freq = sum(.x$Frequency)),
.origin = as.Date("2013-12-30"))
with the result
week_ending week_freq
1 2014-01-05 549
2 2014-01-10 418

Convert Daily Data into Weekly Data in R

I have daily data for 7 years. I want to group this into weekly data (based on the actual date) and sum the frequency.
Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82
What is the best way to achieve that? Thank you
These solutions all use base R and differ only in the definition and labelling of weeks.
1) cut the dates into weeks and then aggregate over those. Weeks start on Monday but you can add start.on.monday=FALSE to cut to start them on Sunday if you prefer.
Week <- as.Date(cut(DF$Date, "week"))
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2013-12-30 549
## 2 2014-01-06 418
2) If you prefer to define a week as 7 days starting with DF$Date[1] and label them according to the first date in that week then use this. (Add 6 to Week if you prefer the last date in the week.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[1] + 7 * weekno
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
3) or if you prefer to label it with the first date existing in DF in that week then use this. This and the last Week definition give the same result if there are no missing dates as is the case here. (If you want the last existing date in the week rather than the first then replace match with findInterval.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[match(weekno, weekno)]
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
Note
The input in reproducible form is assumed to be:
Lines <- "Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82"
DF <- read.table(text = Lines)
DF$Date <- as.Date(DF$Date)
I would use library(lubridate).
df <- read.table(header = TRUE,text = "date Frequency
2014-01-01 179
2014-01-02 82
2014-01-03 89
2014-01-04 109
2014-01-05 90
2014-01-06 66
2014-01-07 75
2014-01-08 106
2014-01-09 89
2014-01-10 82")
You can use base R or library(dplyr):
base R:
to be sure that the date is really a date:
df$date <- ymd(df$date)
df$week <- week(df$date)
or short:
df$week <- week(ymd(df$date))
or dplyr:
library(dplyr)
df %>%
mutate(week = week(ymd(date))) %>%
group_by(week)
Out:
Barring a good reason not to, you should be sure to use ISO weeks to be sure your aggregation intervals are equally sized.
data.table makes this work like so:
library(data.table)
setDT(myDF) # convert to data.table
myDF[ , .(weekly_freq = sum(Frequency)), by = isoweek(Date)]
Maybe you can try the base R code with aggregate + format, i.e.,
dfout <- aggregate(Frequency ~ yearweek,within(df,yearweek <- format(Date,"%Y,%W")),sum)
such that
> dfout
yearweek Frequency
1 2014,00 549
2 2014,01 418
DATA
df <- structure(list(Date = structure(c(16071, 16072, 16073, 16074,
16075, 16076, 16077, 16078, 16079, 16080), class = "Date"), Frequency = c(179L,
82L, 89L, 109L, 90L, 66L, 75L, 106L, 89L, 82L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
The new package slider from RStudio addresses this problem directly including the specification of the start of the weekly periods. Suppose the weekly periods were to start on a Monday so that the beginning of the first week would be Monday, 2013-12-30. Then the slider solution would be
library(slider)
slide_period_dfr(.x = DF, .i=as.Date(DF$Date),
.period = "week",
.f = ~data.frame(week_ending = tail(.x$Date,1),
week_freq = sum(.x$Frequency)),
.origin = as.Date("2013-12-30"))
with the result
week_ending week_freq
1 2014-01-05 549
2 2014-01-10 418

How do I group my date variable into month/year in R?

I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...
The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html
Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...
Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")
Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31
There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01
Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...
There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources