How to apply a function to every nth month in data frame? - r

I have a data frame like this:
Month Amount
1/31/2014 793
2/28/2014 363
3/31/2014 857
4/30/2014 621
5/31/2014 948
6/30/2014 385
I would like to apply a function (x*0.5) to the third and sixth rows in this data frame. The results will overwrite the data currently in the data frame. So the end result would look like this:
Month Amount
1/31/2014 793
2/28/2014 363
3/31/2014 428.5
4/30/2014 621
5/31/2014 948
6/30/2014 192.5
I've tried the rollapply() function, but that functions seems to start at the first row only without an option to force it to start at the third.
I really appreciate any help around this. Thanks in advance.

Suppose your data.frame is named DT:
DT$Amount[c(3,6)] <- 0.5 * DT$Amount[c(3,6)]
If you have a lot of data, use data.table:
setDT(DT)
DT[
month(as.Date(Month, format = "%m/%d/%Y")) %% 3 == 0,
Amount := 0.5 * Amount
]

If the rows follow a pattern then %% can be used to select every x rows
df1$Amount[seq_len(nrow(df1)) %% 3 == 0] <- df1$Amount[seq_len(nrow(df1)) %% 3 == 0] * 0.5
Month Amount
1 1/31/2014 793.0
2 2/28/2014 363.0
3 3/31/2014 428.5
4 4/30/2014 621.0
5 5/31/2014 948.0
6 6/30/2014 192.5

An alternative for detecting particular months in bigger datasets is using month from lubridate()
month ammount
1 1/31/2014 793
2 2/28/2014 363
3 3/31/2014 857
4 4/30/2014 621
5 5/31/2014 948
6 6/30/2014 385
library(lubridate)
df %>% mutate(month = as.Date(month, "%m/%d/%Y"),
date_month = month(month),
new_ammount = ifelse(date_month %in% c(3,6), ammount*0.5, ammount))
Which provides
month ammount date_month new_ammount
1 2014-01-31 793 1 793.0
2 2014-02-28 363 2 363.0
3 2014-03-31 857 3 428.5
4 2014-04-30 621 4 621.0
5 2014-05-31 948 5 948.0
6 2014-06-30 385 6 192.5

Related

How can I sort by date a column that only has month and day with R?

My data frame look a bit like this:
x freq
1-Apr 892
1-Aug 1221
1-Dec 923
1-Feb 880
1-Jan 889
...
And I canĀ“t seem to sort them in order
You could do:
df[order(as.Date(df$x, "%d-%b")), ]
x freq
5 1-Jan 889
4 1-Feb 880
1 1-Apr 92
2 1-Aug 1221
3 1-Dec 923
Data
df <- read.table(text = "
x freq
1-Apr 92
1-Aug 1221
1-Dec 923
1-Feb 880
1-Jan 889",
header = TRUE)
Attempts
Going off of what Alexlok's answer,
df %>%
mutate(x = as.Date(df$x, format = "%d-%b"))
x freq
2020-04-01 92
2020-08-01 1221
2020-12-01 923
2020-02-01 880
2020-01-01 889
However, you can see this adds in the year (e.g., 1-Jan is 2020-01-01).
This post is helpful, but the format changes from date to character. However, you are able to sort by date.
df %>%
mutate(x = format(as.Date(x, format = "%d-%b"), format = "%m-%d")) %>%
arrange(x)
x freq
01-01 889
02-01 880
04-01 92
08-01 1221
12-01 923

Iterating over Dates by Group in R using FOR loops

I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))

ggplot sort order treatment of NA values

My goal is to create a scatter plot of requests for service.
The X axis will be the date the request was made.
X values will show dates from oldest to newest, left to right.
The Y axis will show the priority assigned to the request.
I wish to order the Y values from highest priority at the top (i.e., 1) to lowest.
Requests which haven't been prioritized have NA in that column.
Here is a sample data set (NOTE - the original data file id tab-separated with no values in the position where "NA" is shown below for clarity's sake):
ID Priority DateCreated
549 NA 2018-02-15
548 NA 2018-02-15
547 3 2018-02-13
537 1 2018-01-17
536 5 2018-01-17
518 NA 2017-12-21
509 3 2017-11-27
500 2 2017-11-16
486 NA 2017-10-04
477 3 2017-08-08
475 1 2017-09-14
448 2 2017-07-21
444 5 2017-07-14
431 5 2017-06-30
425 1 2017-06-21
407 2 2017-05-26
395 4 2017-05-09
394 4 2017-05-09
374 4 2017-04-27
368 2 2017-04-21
352 NA 2017-04-03
328 4 2017-02-28
308 NA 2017-02-28
272 2 2016-10-05
213 4 2016-05-19
212 5 2016-05-19
200 2 2016-04-26
188 NA 2016-03-17
After loading ggplot2 and data.frame, I create the plot with this code:
bl <- fread("backlog.txt")
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority <- as.integer(bl$Priority)
ggplot(bl, aes(x = DateCreated, y = reorder(Priority, -Priority))) +
geom_text((aes(label = ID)))
If you reproduce this plot, you will see that the items with a priority of NA appear at the top. For presentation to my customer, it is much clearer if they appear at the bottom.
I suppose I could replace the NAs with a "magic number" (e.g., 11), but I'd prefer a less kludgey solution.
Anyone dealt with a similar issue already?
Thanks.
This is a bit of a workaround as well but I think more acceptable than setting a 'magic number'
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority[is.na(bl$Priority)] <- "No Data Available"
bl$Priority <- factor(bl$Priority,levels=c("No Data Available","1","2","3","4","5"))
ggplot(bl, aes(x = DateCreated, y = Priority)) + geom_text((aes(label = ID)))

How to calculate the sequential date diff in a dataframe and make it as another column for further analysis?

Please before make it as duplicate read carefully my question!
I am new in R and I am trying to figure it out how to calculate the sequential date difference from one row/variable compare to the next row/variable in based on weeks and create another field/column for making a graph accordingly.
There are couple of answer here Q1 , Q2 , Q3 but none specifically talk about making difference in one column sequentially between rows lets say from top to bottom.
Below is the example and the expected results:
Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234
Expected
Date Var1 week
2/6/2017 493 0
2/20/2017 558 2
3/6/2017 595 4
3/20/2017 636 6
4/6/2017 697 8
4/20/2017 566 10
5/6/2017 234 12
You can use a similar approach to that in your first linked answer by saving the difftime result as a new column in your data frame.
# Set up data
df <- read.table(text = "Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234", header = T)
df$Date <- as.Date(as.character(df$Date), format = "%m/%d/%Y")
# Create exact week variable
df$week <- difftime(df$Date, first(df$Date), units = "weeks")
# Create rounded week variable
df$week2 <- floor(difftime(df$Date, first(df$Date), units = "weeks"))
df
# Date Var1 week week2
# 2017-02-06 493 0.000000 weeks 0 weeks
# 2017-02-20 558 2.000000 weeks 2 weeks
# 2017-03-06 595 4.000000 weeks 4 weeks
# 2017-03-20 636 6.000000 weeks 6 weeks
# 2017-04-06 697 8.428571 weeks 8 weeks
# 2017-04-20 566 10.428571 weeks 10 weeks
# 2017-05-05 234 12.571429 weeks 12 weeks

Identifying incorrectly transformed data cells

I have a massive excel spreadsheet full of dates in %m/%d/%Y format. In R, I convert them date format using as.Date. The problem is that some of the dates in Excel were manually entered incorrectly, for example as section below where 214 was entered instead of 2014.
...
235 2014-01-20
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27
...
For individual columns, I can use the function which(dataframe$colname_X<1900) which will give me the row number. This is easy because I already know which column it is.
My question is, how can I do the same to the entire dataframe, so that I get both row and column number of the faulty cells?.
Starting with:
dat <- rd.txt("235 2014-01-20 # #function to use read.table on text
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27")
dat <- cbind(dat,dat)
dat[] <- lapply(dat, as.Date, origin="1970-01-01")
> dat
X235 X2014.01.20 X235 X2014.01.20
1 1970-08-25 2014-03-03 1970-08-25 2014-03-03
2 1970-08-26 2014-01-24 1970-08-26 2014-01-24
3 1970-08-27 2014-03-07 1970-08-27 2014-03-07
4 1970-08-28 0214-05-23 1970-08-28 0214-05-23
5 1970-08-29 2014-01-31 1970-08-29 2014-01-31
6 1970-08-30 2014-02-19 1970-08-30 2014-02-19
7 1970-08-31 2014-03-27 1970-08-31 2014-03-27
Now use which with arr.ind=TRUE (do need to convert to numeric matrix first)
which( sapply(dat,as.numeric) < (as.numeric(as.Date("1900-01-01") ) ), arr.ind=TRUE)
row col
[1,] 4 2
[2,] 4 4
One potential solution
identify all errors using apply
results <- apply(df, 2, function(x) which(x<1900))
This will return a list with each column as an element of the list. As you don't care about those that are empty (i.e. no errors) you could contract the list to only keep those with errors:
results[lapply(results,length)>0]

Resources