R to create a data frame a specific way - r

I am trying to create a data frame and in the column of time I need to have each time written 1 time before the next date. For example:
1983-01-01
1983-01-01
1983-01-01
1983-01-01
1983-01-02
1983-01-02
etc.
for 10 years.
I used this command, but I don't have the needed format.
data=data.frame(date=as.Date("1983-01-01") +seq(n))
head(data)
date
1 1983-01-02
2 1983-01-03
3 1983-01-04
4 1983-01-05
5 1983-01-06
6 1983-01-07

Here's one way to create the data frame:
library(zoo)
start_date <- as.Date("1983-01-01")
stop_date <- as.Date(as.yearmon(start_date) + 10) - 1
# [1] "1992-12-31"
dat <- data.frame(date = rep(seq(start_date, stop_date, by = 1), each = 4))
Update (based on comment):
dates <- lapply(seq(0, 9), function(x)
rep(as.Date((as.yearmon(start_date) + x) + (0:11)/12), each = 3) + c(0,10,20))
dat <- do.call(rbind, lapply(dates,
function(x) data.frame(date = rep(x, each = 4))))

Related

Change Duration from (Years and Month) to (Month) in R

I have some data frame like below. I need to add a "Duration" column beside the "Years and Month" column and convert the "Years and Month" to Month as duration.
For instance, I need to change the 2Y3M to 27M.
I have searched for it and didn't succeed. How do I do that? Thanks in advance.
Years and Month
Percentage Change
2Y3M
13%
3Y4M
23%
Here are a few approaches. See Note at the end for the input x.
1) Convert to yearmon class which stores its input internally as year+(month-1)/12. We can get the internal number by converting it to numeric, then multiply by 12 and add back the 1.
library(zoo)
ym <- as.yearmon(x, "%YY%mM")
12 * as.numeric(ym) + 1
## [1] 27 40
This could be written as a one-liner like this:
12 * as.numeric(as.yearmon(x, "%YY%mM")) + 1
1a) Using ym from above this would also work where as.integer extracts the year and cycle gets the month:
12 * as.integer(ym) + cycle(ym)
## [1] 27 40
2) A base solution would be to read in x splitting it into a 2 column data frame which is converted to a matrix. matrix multiply that by c(12, 1) to get the result.
d <- read.table(text = x, sep = "Y", comment.char = "M")
c(as.matrix(d) %*% c(12, 1))
## [1] 27 40
This could also be written as a one-liner:
c(as.matrix(read.table(text = x, sep = "Y", comment.char = "M")) %*% c(12, 1))
Note
The input x in reproducible form is
x <- c("2Y3M", "3Y4M")
Assuming your dataframe is called df and column as ym you can use strcapture to extract year and month value.
result <- transform(strcapture('(\\d+)Y(\\d+)M', df$ym,
proto = list(year = integer(), month = integer())),
yearmonth = (year * 12) + month)
result
# year month yearmonth
#1 2 3 27
#2 3 4 40
To assign the value to same column.
df$ym <- transform(strcapture('(\\d+)Y(\\d+)M', df$ym,
proto = list(year = integer(), month = integer())),
yearmonth = (year * 12) + month)$yearmonth
Assuming your "Years and Month" column is a character type, I would extract the years and months separately, then figure out how many months it is.
library(tidyverse)
your_df <- tibble(`Years and Month` = c("2Y3M", "3Y4M"))
your_df %>%
mutate(years = str_extract(`Years and Month`, "^\\d+(?=Y)"),
months = str_extract(`Years and Months`, "(?<=Y)\\d+")) %>%
mutate(total_months = as.numeric(years)*12 + as.numeric(months))

Sum across multiple time frames using R

I have two data frames, x and y. The data frame x has a range of dates while data frame y has individual dates. I want to get the sum of the individual date values for the time ranges in data frame x. Thus id "a" would have the sum of all the values from 2019/1/1 through 2019/3/1.
id <- c("a","b","c")
start_date <- as.Date(c("2019/1/1", "2019/2/1", "2019/3/1"))
end_date <- as.Date(c("2019/3/1", "2019/4/1", "2019/5/1"))
x <- data.frame(id, start_date, end_date)
dates <- seq(as.Date("2019/1/1"),as.Date("2019/5/1"),1)
values <- runif(121, min=0, max=7)
y <- data.frame(dates, values)
Desired output
id start_date end_date sum
a 2019/1/1 2019/3/1 221.8892
One base R option is using apply
x$sum <- apply(x, 1, function(v) sum(subset(y,dates >= v["start_date"] & dates<=v["end_date"])$values))
such that
> x
id start_date end_date sum
1 a 2019-01-01 2019-03-01 196.0311
2 b 2019-02-01 2019-04-01 185.6970
3 c 2019-03-01 2019-05-01 173.6429
Data
set.seed(1234)
id <- c("a","b","c")
start_date <- as.Date(c("2019/1/1", "2019/2/1", "2019/3/1"))
end_date <- as.Date(c("2019/3/1", "2019/4/1", "2019/5/1"))
x <- data.frame(id, start_date, end_date)
dates <- seq(as.Date("2019/1/1"),as.Date("2019/5/1"),1)
values <- runif(121, min=0, max=7)
y <- data.frame(dates, values)
There are many ways of doing this. One possibility would be:
library(data.table)
x <- setDT(x)
# create a complete series for each id
x <- x[, .(dates = seq(start_date, end_date, 1)), by=id]
# merge the data
m <- merge(x, y, by="dates")
# get the sums
m[, .(sum = sum(values)), by=id]
id sum
1: a 196.0311
2: b 185.6970
3: c 173.6429
You can add setseed before you create the random variables to exactly replicate the numbers
set.seed(1234)

How can I use PAD function (from PADR() package) for multiple data frames?

I have 24 files (1 for each hour of the day, HR_NBR = Hour Number) and I've to pad the dates in each of the files.
AS-IS data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
03/07/2016 1 10
TO-BE data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
02/07/2016 NA NA
03/07/2016 1 10
I can use the pad function for each file, like this:
chil_bev1_1 = pad (chil_bev1_1, interval= "day") # Hour1
chil_bev1_2 = pad (chil_bev1_2, interval= "day") # Hour2
and so on.
And it works. But I want to use a loop or LAPPLY.
I tried several variations of these 2 codes, but none of them worked:
df1 = data.frame (chil_bev1_1)
df2 = data.frame (chil_bev1_2)
dflist = c("df1","df2")
CODE1:
x = function(df) {df %>% pad}
allpad = lapply(dflist,x)
CODE2:
x = function(df) {pad (df)}
allpad = lapply(dflist,x)
The error is
"x must be a data frame".
I'm new to R. Any help would be greatly appreciated.
Thank you.
I managed to figure it out. Here's the answer:
hour_list = list(chil_bev1_1, chil_bev1_2)
chil_bev1n = lapply (hour_list, function (x) {x %>% complete(CLNDR_DT = seq.Date(min(CLNDR_DT), max(CLNDR_DT), by="day"), fill = list(QTY=0))})
Notes:
The fill = list() function replaces the NAs with 0s.
CLNDR_DT is the name of the column that contains dates.

Fill a data frame with increasing date objects in R

I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.

R - Aggregate Dates

When aggregating an R dataframe, the dates are converted in integer :
For instance, if I want to take the maximum dates for every Id in the following dataframe :
> df1 <- data.frame(id = rep(c(1, 2), 2), b = as.Date(paste("01/01/", 2000:2003, sep=''), format = "%d/%m/%Y"))
> df1
id b
1 1 2000-01-01
2 2 2001-01-01
3 1 2002-01-01
4 2 2003-01-01
> aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
id b
1 1 11688
2 2 12053
Why does R behave this way ? (and what's the best way to keep a date class column in the returned dataframe?)
Thanks for your help,
That works for me R version 3, perhaps there were some changes in updates, so I recommend you to update R :)
As for this version of R, have you tried as.Date() function after aggregating?
In your example, should be like:
dtf2<-aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
dtf2$b<-as.Date(dtf$b)
You can also add 'origin' option to as.Date, like
as.Date(dtf$b, origin='1970-01-01')
UPD: When R looks at dates as integers, its origin is January 1, 1970.
Hope that will help.

Resources