I am trying to count the appearances of a value (across 2 columns) consecutively over the previous days. In the example this would be counting the consecutive days a team made an appearance (either in Hteam or Ateam) prior to that date. The aim would be to produce additional columns for both the home and away teams that showed these new values.
Test data:
data<- data.frame(
Date= c("2018-01-01", "2018-01-01", "2018-01-02", "2018-01-03", "2018-01-04", "2018-01-05"),
Hteam= c("A","D","B","A","C","A"),
Ateam= c("B","C","A","C","B","C"))
Date Hteam Ateam
1 2018-01-01 A B
2 2018-01-01 D C
3 2018-01-02 B A
4 2018-01-03 A C
5 2018-01-04 C B
6 2018-01-05 A C
The aim would end up looking like:
Date Hteam Ateam Hdays Adays
1 2018-01-01 A B 0 0
2 2018-01-01 D C 0 0
3 2018-01-02 B A 1 1
4 2018-01-03 A C 2 0
5 2018-01-04 C B 1 0
6 2018-01-05 A C 0 2
In my searching I haven't found an example close enough that I am able to adapt to this situation. I feel like I should be using a rollapply or dplyr grouping, but I can't get close to a solution.
Thanks.
Maybe the following gives what you wanted assuming that data is sorted by Date and missing days are not considered.
t1 <- unique(unlist(data[-1]))
t2 <- do.call(rbind, lapply(split(data[-1], data$Date), function(x) t1 %in% unlist(x)))
t3 <- apply(t2, 2, function(x) ave(x, cumsum(!x), FUN=cumsum))-1
data.frame(data
, Hdays=t3[cbind(match(data$Date, rownames(t3)), match(data$Hteam, t1))]
, Adays=t3[cbind(match(data$Date, rownames(t3)), match(data$Ateam, t1))])
# Date Hteam Ateam Hdays Adays
#1 2018-01-01 A B 0 0
#2 2018-01-01 D C 0 0
#3 2018-01-02 B A 1 1
#4 2018-01-03 A C 2 0
#5 2018-01-04 C B 1 0
#6 2018-01-05 A C 0 2
I think your expected output is incorrect. Namely, row 5's "C" occurs twice above it, but has a 1.
Here's a tidyverse version:
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(Date, rn), names_to = "x", values_to = "team") %>%
mutate(x = gsub("team$", "", x)) %>%
group_by(team) %>%
mutate(days = row_number() - 1) %>%
ungroup() %>%
pivot_wider(c(Date, rn), names_from = x, values_from = c(team, days)) %>%
select(-rn)
# # A tibble: 6 x 5
# Date team_H team_A days_H days_A
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2018-01-01 A B 0 0
# 2 2018-01-01 D C 0 0
# 3 2018-01-02 B A 1 1
# 4 2018-01-03 A C 2 1
# 5 2018-01-04 C B 2 2
# 6 2018-01-05 A C 3 3
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
I am trying to count the number of positive events over a 12 month rolling window.
I can create 365 rows of missing data per year and use zoo::rollapply to sum the number of events per 365 rows of data, but my data frame is really big and I want to do this on a bunch of variables, so this takes forever to run.
I can get the correct output with this:
data <- data.frame(id = c("a","a","a","a","a","b","b","b","b","b"),
date = c("20-01-2011","20-04-2011","20-10-2011","20-02-2012",
"20-05-2012","20-01-2013","20-04-2013","20-10-2013",
"20-02-2014","20-05-2014"),
event = c(0,1,1,1,0,1,0,0,1,1))
library(lubridate)
library(dplyr)
library(tidyr)
library(zoo)
data %>%
group_by(id) %>%
mutate(date = dmy(date),
cumsum = cumsum(event)) %>%
complete(date = full_seq(date, period = 1), fill = list(event = 0)) %>%
mutate(event12 = rollapplyr(event, width = 365, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)
Which is this:
id date event cumsum event12
<fct> <date> <dbl> <dbl> <dbl>
a 2011-01-20 0 0 0
a 2011-04-20 1 1 1
a 2011-10-20 1 2 2
a 2012-02-20 1 3 3
a 2012-05-20 0 3 2
b 2013-01-20 1 1 1
b 2013-04-20 0 1 1
b 2013-10-20 0 1 1
b 2014-02-20 1 2 1
b 2014-05-20 1 3 2
But want to see if there's a more efficient way, as in how would I make the width in rollyapply count up dates rather than count up rows.
This can be done without filling out the missing dates using a complex self join and a single sql statement after converting the dates to Date class:
library(sqldf)
data2 <- transform(data, date = as.Date(date, "%d-%m-%Y"))
sqldf("select a.*, sum(b.event) as event12
from data2 as a
left join data2 as b on a.id = b.id and b.date between a.date - 365 and a.date
group by a.rowid
order by a.rowid")
giving:
id date event event12
1 a 2011-01-20 0 0
2 a 2011-04-20 1 1
3 a 2011-10-20 1 2
4 a 2012-02-20 1 3
5 a 2012-05-20 0 2
6 b 2013-01-20 1 1
7 b 2013-04-20 0 1
8 b 2013-10-20 0 1
9 b 2014-02-20 1 1
10 b 2014-05-20 1 2
I want to insert rows between two dates by group. My way of doing it is so complicated that I insert missing values by last observation carry forwards and then merge. I was wondering is there any easier way to achieve it.
# sample data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
dt<-data.frame(user,dummy,date)
user dummy date
1 A 1 2017-01-03
2 A 1 2017-01-06
3 B 1 2016-05-01
4 B 1 2016-05-03
5 B 1 2016-05-05
Desired output
By using dplyr and tidyr :)(one line solution )
library(dplyr)
library(tidyr)
dt %>% group_by(user) %>% complete(date=full_seq(date,1),fill=list(dummy=0))
# A tibble: 9 x 3
# Groups: user [2]
user date dummy
<fctr> <date> <dbl>
1 A 2017-01-03 1
2 A 2017-01-04 0
3 A 2017-01-05 0
4 A 2017-01-06 1
5 B 2016-05-01 1
6 B 2016-05-02 0
7 B 2016-05-03 1
8 B 2016-05-04 0
9 B 2016-05-05 1
you can try this
library(data.table)
setDT(dt)
tmp <- dt[, .(date = seq.Date(min(date), max(date), by = '1 day')), by =
'user']
dt <- merge(tmp, dt, by = c('user', 'date'), all.x = TRUE)
dt[, dummy := ifelse(is.na(dummy), 0, dummy)]
We can use the tidyverse to achieve this task.
library(tidyverse)
dt2 <- dt %>%
group_by(user) %>%
do(date = seq(from = min(.$date), to = max(.$date), by = 1)) %>%
unnest() %>%
left_join(dt, by = c("user", "date")) %>%
replace_na(list(dummy = 0)) %>%
select(colnames(dt))
dt2
# A tibble: 9 x 3
user dummy date
<fctr> <dbl> <date>
1 A 1 2017-01-03
2 A 0 2017-01-04
3 A 0 2017-01-05
4 A 1 2017-01-06
5 B 1 2016-05-01
6 B 0 2016-05-02
7 B 1 2016-05-03
8 B 0 2016-05-04
9 B 1 2016-05-05
The simplest way that I have found to do this is with the padr library.
library(padr)
dt_padded <- pad(dt, group = "user", by = "date") %>%
replace_na(list(dummy=0))
A Base R (not quite as elegant) solution:
# Data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
df1 <-data.frame(user,dummy,date)
# Solution
do.call(rbind, lapply(split(df1, df1$user), function(df) {
dff <- data.frame(user=df$user[1], dummy=0, date=seq.Date(min(df$date), max(df$date), 'day'))
dff[dff$date %in% df$date, "dummy"] <- df$dummy[1]
dff
}))
# user dummy date
# A 1 2017-01-03
# A 0 2017-01-04
# A 0 2017-01-05
# A 1 2017-01-06
# B 1 2016-05-01
# B 0 2016-05-02
# B 1 2016-05-03
# B 0 2016-05-04
# B 1 2016-05-05
Assuming your data is called df1, and you want to add dates between two days try this:
library(dplyr)
df2 <- seq.Date(as.Date("2015-01-03"), as.Date("2015-01-06"), by ="day")
left_join(df2, df1)
If you're simply trying to add a new record, I suggest using rbind.
rbind()