Dates are not keeping specified format in R data frame - r

Simply put, I'm grabbing dates for events that meet certain conditions in df1 and putting them in a new data frame (df2). The formatting of dates in df2 should be the same formatting in df1 ("2000-09-12", or %Y-%m-%d). However, the dates in df2 read "11212", "11213", etc.
to generate data:
"Date"<-c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15","2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25")
"Event"<-c("A","N","O","O","O","O","N","N","N","N","N","A")
df1<-data.frame(Date,Event)
df1
Date Event
1 2000-09-08 A
2 2000-09-11 N
3 2000-09-12 O
4 2000-09-13 O
5 2000-09-14 O
6 2000-09-15 O
7 2000-09-18 N
8 2000-09-19 N
9 2000-09-20 N
10 2000-09-21 N
11 2000-09-22 N
12 2000-09-25 A
here is the code:
"df2"<-data.frame()
"tmp"<-data.frame(1,2)
i<-c(1:4)
for (x in i)
{
date1<- df1$Date[df1$Event=="O"][x]
date2<- df1$Date[df1$Event=="A" & df1$Date => date1] [1]
as.numeric(difftime(date2, date1))->tmp[1,2]
as.Date(as.character(df1$Date[df1$Event=="O"][x]), "%Y-%m-%d")->tmp[1,1] ##the culprit
rbind(df2, tmp)->df2
}
Loop output looks like this:
X1 X2
1 11212 13
2 11213 12
3 11214 11
4 11215 10
I want it to look like this:
X1 X2
1 "2000-09-12" 13
2 "2000-09-13" 12
3 "2000-09-14" 11
4 "2000-09-14" 10

If I understand correctly, the OP wants to find for each "O" event the difference in days to the next following "A" event.
This can be solved using a rolling join. We extract the "O" events and the "A" events into two separate data.tables and join both on date.
This will avoid all the hassle with the data format and works also if df1 is not already ordered by Date.
library(data.table)
setDT(df1)[Event == "A"][df1[Event == "O"],
on = "Date", roll = -Inf, .(Date, x.Date - i.Date)]
Date V2
1: 2000-09-12 13 days
2: 2000-09-13 12 days
3: 2000-09-14 11 days
4: 2000-09-15 10 days
Note that roll = -Inf rolls backwards (next observation carried backward (NOCB)) because the date of the next "A" event is required.
Data
Date <- as.Date(c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15",
"2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25"))
Event <- c("A","N","O","O","O","O","N","N","N","N","N","A")
df1 <- data.frame(Date,Event)

Related

Filter by Condition occurring Consecutively in R

I'm hoping to see if there is a dplyr solution to this problem as I'm building a survival dataset.
I am looking to create my 'event' coding that would satisfy a particular condition if it occurs twice consecutively. In this case, the event condition would be if Var was > 21 for two consecutive dates. For example, in the following dataset:
ID Date Var
1 1/1/20 22
1 1/3/20 23
2 1/2/20 23
2 2/10/20 18
2 2/16/20 21
3 12/1/19 16
3 12/6/19 14
3 12/20/19 22
In this case, patient 1 should remain, and patient 2 and 3 should be filtered out because > 21 did not happen consecutively, and then i'd like to simply take the maximum date by each ID so that I can easily calculate the time to the event.
Final result:
ID Date Var
1 1/3/20 23
Thank you
As long as the dates are sorted (latest date is later in the table) this should work. However, this is in data.table since I dont use dplyr that much, however it should be pretty similar.
library(data.table)
setDT(df)
df = df[Var > 21 & shift(Var > 21, n = -1), ]
df = unique(df, by = "ID", fromLast = T)

Extract the values from the dataframes created in a loop for further analysis (I am not sure, how to sum up the question in one line)

My raw dataset has multiple product Id, monthly sales and corresponding date arranged in a matrix format. I wish to create individual dataframes for each product_id along with the sales value and dates. For this, I am using a for loop.
base is the base dataset.
x is the variable that contains the unique product_id and the corresponding no of observation points.
for(i in 1:nrow(x)){
n <- paste("df", x$vars[i], sep = "")
assign(n, base[base[,1] == x$vars[i],])
print(n)}
This is a part of the output:
[1] "df25"
[1] "df28"
[1] "df35"
[1] "df37"
[1] "df39"
So all the dataframe names are saved in n. This, I think is a string vector.
When I write df25 outside the loop, I get the dataframe I want:
> df25
# A tibble: 49 x 3
ID date Sales
<dbl> <date> <dbl>
1 25 2014-01-01 0
2 25 2014-02-01 0
3 25 2014-03-01 0
4 25 2014-04-01 0
5 25 2014-05-01 0
6 25 2014-06-01 0
7 25 2014-07-01 0
8 25 2014-08-01 0
9 25 2014-09-01 0
10 25 2014-10-01 0
# ... with 39 more rows
Now, I want to use each of these dataframes seperately to perform a forecast analysis. For doing this, I need to get to the values in individual dataframes. This is what I have tried for the same:
for(i in 1:4) {print(paste0("df", x$vars[i]))}
[1] "df2"
[1] "df3"
[1] "df5"
[1] "df14"
But I am unable to refer to individual dataframes.
I am looking for help on how can I get access to the dataframes with their values for further analysis? Since there are more than 200 products, I am looking for some function which deals with all the dataframes.
First, I wish to convert it to a TS, using year and month values from the date variable and then use ets or forecast, etc.
SAMPLE DATASET:
set.seed(354)
df <- data.frame(Product_Id = rep(1:10, each = 50),
Date = seq(from = as.Date("2010/1/1"), to = as.Date("2014/2/1") , by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312) ,]
As always, any suggestion would be highly appreciated.
I think this is one way to get an access to the individual dataframes. If there is a better method, please let me know:
(Var <- get(paste0("df",x$vars[i])))

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Replace values based on months in a dataframe with values in another column in r, using apply functions

I am working with a time series of precipitation data and attempting to use the median imputation method to replace all 0 value data points with the median of all data points for the corresponding month that that 0 value was recorded.
I have two data frames, one with the original precipitation data:
> head(df.m)
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
And one with the median monthly values:
> medians
Group.1 x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
Below is the current solution that I have come up with utilizing the 1st answer here:
df.m[,"prcp"] <- sapply(df.m[,"prcp"], function(y) ifelse(y==0, medians$x,y))
This has not worked as it only applies the first value of the df medians$Group.1, which is the month of January (01). How can I get the values so that correct median will be applied from the corresponding month?
Another way I have attempted a solution is via the below:
df.m[,"prcp"] <- sapply(medians$Group.1, function(y)
ifelse(df.m[format.Date(df.m$date, "%m") == y &
df.m$prcp == 0, "prcp"], medians[medians$Group.1 == y,"x"],
df.m[,"prcp"]))
Description of the above function - this function tests and returns the amount of zeros for every month that there is a zero value in df.m[,"prcp"]
Same issue here as the 1st solution, but it does return all of the 0 values by month (if just executing the sapply() portion).
How can I replace all 0 in df.m$prcp with their corresponding medians from the medians df based on the month of the data?
Apologies if this is a basic question, I'm somewhat of a newbie here. Any and all help would be greatly appreciated.
Consider merging the two dataframes by month/group and then calculating with ifelse:
# MERGE TWO FRAMES
df.m$month <- format(df.m$date, "%m")
df.merge <- merge(df.m, medians, by.x="month", by.y="Group.1")
# CONDITIONAL CALCULATION
df.merge$prcp <- ifelse(df.merge$prcp == 0, df.merge$x, df.merge$prcp)
# RETURN BACK TO ORIGINAL STRUCTURE
df.m <- df.merge[names(df.m)]
A dplyr version, which does not rely on original order. This uses slightly modified test data to show replacement of zeroes and multiple years
require(dplyr)
## test data with zeroes - extended for addtional years
df.m <- read.delim(text="
i prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 0 1976-06-30
8 0 1976-07-31
9 70 1976-08-31
", sep="", stringsAsFactors = FALSE)
medians <- read.delim(text="
i month x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
", sep = "", stringsAsFactors = FALSE, strip.white = TRUE)
# extract the month as integer
df.m$month = as.integer(substr(df.m$date,6,7))
# match to medians by joining
result <- df.m %>%
inner_join(medians, by='month') %>%
mutate(prcp = ifelse(prcp == 0, x, prcp)) %>%
select(prcp, date)
result
yields
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 57.47287 1976-06-30
8 54.16667 1976-07-31
9 70.00000 1976-08-31
I created small datasets with some zero values and added one line of code:
#create sample data
prcp <- c(1.5,0.0,0.0,2.1)
date <- c(01,02,03,04)
x <- c(1.11,2.22,3.33,4.44)
df <- data.frame(prcp,date)
grp <- data.frame(x,date)
#Make the assignment
df[df$prcp == 0,]$prcp <- grp[df$prcp == 0,]$x

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeñorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Resources