My data frame looks like this
> data
Date Dummy
1 2020-01-01 1
2 2020-01-02 0
3 2020-01-03 0
4 2020-01-04 0
5 2020-01-05 1
6 2020-01-06 1
7 2020-01-07 1
8 2020-01-08 0
9 2020-01-09 1
10 2020-01-10 1
11 2020-01-11 0
I want to create a new column which gives the cumulative frequency of dummy values but conditional on whether the dummy was present or not. The final data set looks like this
> data
Date Dummy Modified
1 2020-01-01 1 1
2 2020-01-02 0 1
3 2020-01-03 0 1
4 2020-01-04 0 1
5 2020-01-05 1 2
6 2020-01-06 1 3
7 2020-01-07 1 4
8 2020-01-08 0 4
9 2020-01-09 1 5
10 2020-01-10 1 6
11 2020-01-11 0 6
How can I acheive this in R. Preferably dplyr . Any help will be greatly appreciated
We can just do a cumsum
cumsum(data$Dummy)
#[1] 1 1 1 1 2 3 4 4 5 6 6
This can be implemented within the %>% chain
library(dplyr)
data %>%
mutate(Modified = cumsum(Dummy))
Related
I am trying to find a way to subset or filter my dataset (repeated measures of individuals) using a conditional statement on the first measure. In other words, I want to filter the dataset to only include data for all time points for the individuals which have a specific condition present at time point 1.
Example Data:
Puck_Number <- c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
Date <- as.Date(c('2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-07-29','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-07','2020-09-22','2020-09-22','2020-09-22','2020-09-22','2020-09-22','2020-09-22'))
Bleached <- c(1,0,1,1,0,1,1,0,1,1,0,1,0,0,0,1,0,1)
Alive <- c(1,1,1,1,1,1,1,1,1,1,0,1,0,1,0,1,0,1)
Data <- data.frame(Puck_Number, Date, Bleached, Alive)
Which will produce the following:
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
2 2 2020-07-29 0 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
5 5 2020-07-29 0 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
8 2 2020-09-07 0 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
11 5 2020-09-07 0 0
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
14 2 2020-09-22 0 1
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
17 5 2020-09-22 0 0
18 6 2020-09-22 1 1
What I want to include through filtering or subsetting is only those which have a 1 in the bleached column during the Date of '2020-07-29' and the repeated measure of those individuals for the entire dataset.
So I am looking for the data to look like this:
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
18 6 2020-09-22 1 1
The puck number is a unique identifier for each individual (repeated for each measurement) and I suspect that it may help in this filtering, but I haven't come across a way to accomplish this with the R skill set I have.
Try this
with(Data, Data[Puck_Number %in% Puck_Number[Date == as.Date("2020-07-29") & Bleached], ])
Output
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
3 3 2020-07-29 1 1
4 4 2020-07-29 1 1
6 6 2020-07-29 1 1
7 1 2020-09-07 1 1
9 3 2020-09-07 1 1
10 4 2020-09-07 1 1
12 6 2020-09-07 1 1
13 1 2020-09-22 0 0
15 3 2020-09-22 0 0
16 4 2020-09-22 1 1
18 6 2020-09-22 1 1
Or a tidyverse way
library(tidyverse)
subset <- Data %>% filter(Date == as.Date("2020-07-29", format = "%Y-%m-%d") & Bleached == 1) %>%
select(Puck_Number) %>% left_join(Data)
> subset
Puck_Number Date Bleached Alive
1 1 2020-07-29 1 1
2 3 2020-07-29 1 1
3 4 2020-07-29 1 1
4 6 2020-07-29 1 1
5 1 2020-09-07 1 1
6 3 2020-09-07 1 1
7 4 2020-09-07 1 1
8 6 2020-09-07 1 1
9 1 2020-09-22 0 0
10 3 2020-09-22 0 0
11 4 2020-09-22 1 1
12 6 2020-09-22 1 1
I have data in which subjects completed multiple ratings per day over 6-7 days. The number of ratings per day varies. The data set includes subject ID, date, and the ratings. I would like to create a new variable that recodes the dates for each subject into "study day" --- so 1 for first day of ratings, 2 for second day of ratings, etc.
For example, I would like to take this:
id Date Rating
1 10/20/2018 2
1 10/20/2018 3
1 10/20/2018 5
1 10/21/2018 1
1 10/21/2018 7
1 10/21/2018 9
1 10/22/2018 4
1 10/22/2018 5
1 10/22/2018 9
2 11/15/2018 1
2 11/15/2018 3
2 11/15/2018 4
2 11/16/2018 3
2 11/16/2018 1
2 11/17/2018 0
2 11/17/2018 2
2 11/17/2018 9
And end up with this:
id Day Date Rating
1 1 10/20/2018 2
1 1 10/20/2018 3
1 1 10/20/2018 5
1 2 10/21/2018 1
1 2 10/21/2018 7
1 2 10/21/2018 9
1 3 10/22/2018 4
1 3 10/22/2018 5
1 3 10/22/2018 9
2 1 11/15/2018 1
2 1 11/15/2018 3
2 1 11/15/2018 4
2 2 11/16/2018 3
2 2 11/16/2018 1
2 3 11/17/2018 0
2 3 11/17/2018 2
2 3 11/17/2018 9
I was going to look into setting up some kind of loop, but I thought it would be worth asking if there is a more efficient way to pull this off. Are there any functions that would allow me to automate this sort of thing? Thanks very much for any suggestions.
Since you want to reset the count after every id , makes this question a bit different.
Using only base R, we can split the Date based on id and then create a count of each distinct group.
df$Day <- unlist(sapply(split(df$Date, df$id), function(x) match(x,unique(x))))
df
# id Date Rating Day
#1 1 10/20/2018 2 1
#2 1 10/20/2018 3 1
#3 1 10/20/2018 5 1
#4 1 10/21/2018 1 2
#5 1 10/21/2018 7 2
#6 1 10/21/2018 9 2
#7 1 10/22/2018 4 3
#8 1 10/22/2018 5 3
#9 1 10/22/2018 9 3
#10 2 11/15/2018 1 1
#11 2 11/15/2018 3 1
#12 2 11/15/2018 4 1
#13 2 11/16/2018 3 2
#14 2 11/16/2018 1 2
#15 2 11/17/2018 0 3
#16 2 11/17/2018 2 3
#17 2 11/17/2018 9 3
I don't know how I missed this but thanks to #thelatemail who reminded that this is basically the same as
library(dplyr)
df %>%
group_by(id) %>%
mutate(Day = match(Date, unique(Date)))
AND
df$Day <- as.numeric(with(df, ave(Date, id, FUN = function(x) match(x, unique(x)))))
If you want a slightly hacky dplyr version....you can use the date column and convert it to a numeric date then manipulate that number to give the desired result
library(tidyverse)
library(lubridate)
df <- data_frame(id=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
Date= c('10/20/2018', '10/20/2018', '10/20/2018', '10/21/2018', '10/21/2018', '10/21/2018',
'10/22/2018', '10/22/2018', '10/22/2018','11/15/2018', '11/15/2018', '11/15/2018',
'11/16/2018', '11/16/2018', '11/17/2018', '11/17/2018', '11/17/2018'),
Rating=c(2,3,5,1,7,9,4,5,9,1,3,4,3,1,0,2,9))
df %>%
group_by(id) %>%
mutate(
Date = mdy(Date),
Day = as.numeric(Date),
Day = Day-min(Day)+1)
# A tibble: 17 x 4
# Groups: id [2]
id Date Rating Day
<dbl> <date> <dbl> <dbl>
1 1 2018-10-20 2 1
2 1 2018-10-20 3 1
3 1 2018-10-20 5 1
4 1 2018-10-21 1 2
5 1 2018-10-21 7 2
6 1 2018-10-21 9 2
7 1 2018-10-22 4 3
8 1 2018-10-22 5 3
9 1 2018-10-22 9 3
10 2 2018-11-15 1 1
11 2 2018-11-15 3 1
12 2 2018-11-15 4 1
13 2 2018-11-16 3 2
14 2 2018-11-16 1 2
15 2 2018-11-17 0 3
16 2 2018-11-17 2 3
17 2 2018-11-17 9 3
I have huge dataframe like this:
df <- read.table(text="
id date
1 1 2016-12-01
2 2 2016-12-02
3 4 2017-01-03
4 6 2016-11-04
5 7 2017-11-05
6 9 2017-12-06", header=TRUE)
I generate randomly 1 or 0 for each id. I'm doing it with this code.
set.seed(5)
df %>%
arrange(id) %>%
mutate(
rn = runif(id),
discount = if_else(rn < 0.5, 0, 1)
)
It works perfectly until I add new rows to my dataframe. Then are my random numbers different.
But what I need is not just generate random number for each id, but that number has to remain same even if new rows are added.
That means:
id date discount
1 1 2016-12-01 1
2 2 2016-12-02 0
3 4 2017-01-03 0
4 6 2016-11-04 1
5 7 2017-11-05 1
6 9 2017-12-06 1
When new rows are added
id date discount
1 1 2016-12-01 1
2 2 2016-12-02 0
3 4 2017-01-03 0
4 6 2016-11-04 1
5 7 2017-11-05 1
6 9 2017-12-06 1
7 12 2017-12-06 0
8 13 2017-12-06 1
You need to reset the same seed before the "new" data.frame "call":
set.seed(5) # first call
df %>%
arrange(id) %>%
mutate(
rn = runif(id),
discount = if_else(rn < 0.5, 0, 1)
)
# id date rn discount
# 1 1 2016-12-01 0.2002145 0
# 2 2 2016-12-02 0.6852186 1
# 3 4 2017-01-03 0.9168758 1
# 4 6 2016-11-04 0.2843995 0
# 5 7 2017-11-05 0.1046501 0
# 6 9 2017-12-06 0.7010575 1
set.seed(5) # added two rows, reset the seed
df2 %>%
arrange(id) %>%
mutate(
rn = runif(id),
discount = if_else(rn < 0.5, 0, 1)
)
# id date rn discount
# 1 1 2016-12-01 0.2002145 0
# 2 2 2016-12-02 0.6852186 1
# 3 4 2017-01-03 0.9168758 1
# 4 6 2016-11-04 0.2843995 0
# 5 7 2017-11-05 0.1046501 0
# 6 9 2017-12-06 0.7010575 1
# 7 12 2017-12-06 0.5279600 1
# 8 13 2017-12-06 0.8079352 1
Data:
df <- read.table(text="
id date
1 1 2016-12-01
2 2 2016-12-02
3 4 2017-01-03
4 6 2016-11-04
5 7 2017-11-05
6 9 2017-12-06", header=TRUE)
df2 <- read.table(text="
id date
1 1 2016-12-01
2 2 2016-12-02
3 4 2017-01-03
4 6 2016-11-04
5 7 2017-11-05
6 9 2017-12-06
7 12 2017-12-06
8 13 2017-12-06", header=TRUE)
I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))
I have a dataset as below:
the outcome have no relationship with contact_date, when a subscriber response a cold call, we mark it successful contact attempt(1) else (0). The count is how many times we called the subscriber.
subscriber_id outcome contact_date queue multiple_number count
(int) (int) (date) (fctr) (int) (int)
1 1 1 2015-01-29 2 1 1
2 1 0 2015-02-21 2 1 2
3 1 0 2015-03-29 2 1 3
4 1 1 2015-04-30 2 1 4
5 2 0 2015-01-29 2 1 1
6 2 0 2015-02-21 2 1 2
7 2 0 2015-03-29 2 1 3
8 2 0 2015-04-30 2 1 4
9 2 1 2015-05-31 2 1 5
10 2 1 2015-08-25 5 1 6
11 2 0 2015-10-30 5 1 7
12 2 0 2015-12-14 5 1 8
13 3 1 2015-01-29 2 1 1
I would like to get the count number for the first outcome ==1 for each subscriber, could you please tell me how can I get it? the final data set I would like is:
(Please noticed some may don't have any success call, in this case, I would like to mark the first_success as 0)
subscriber_id first_success
1 1
2 5
3 1
...
require(dplyr)
data %>% group_by(subscriber_id) %>% filter(outcome==1) %>%
slice(which.min(contact_date)) %>% data.frame() %>%
select(subscriber_id,count)