I have a dataframe that contains id(contains duplicate),date(contains duplicate),value. the values are recorded for different consecutive days. now what i want is to group the dataframe with id and date(as n consecutive days) and find mean of values. and return NA if the last group does not contain n days.
id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2
. . .
. . .
. . .
20 2012-2-6 10
desired output with n-consecutive days as 3
id date value group_n_consecutive_days mean_n_consecutive_days
1 2016-10-5 2 1 2
1 2016-10-6 3 1 2
1 2016-10-7 1 1 2
1 2016-10-8 2 2 NA
1 2016-10-9 5 2 NA
2 2013-10-6 2 1 4
.
.
.
.
20 2012-2-6 10 6 25
The data in the question is sorted and consecutive within id so we assume that that is the case. Also when the question refers to duplicate dates we assume that that means that different id values can have the same date but within id the dates are unique and consecutive. Now, using the data shown reproducibly in Note 2 at the end group by id and compute the group numbers using gl. Then grouping by id and group_no take the mean of each group of 3 or NA for smaller groups.
library(dplyr)
DF %>%
group_by(id) %>%
mutate(group_no = c(gl(n(), 3, n()))) %>%
group_by(group_no, add = TRUE) %>%
mutate(mean = if (n() == 3) mean(value) else NA) %>%
ungroup
giving:
# A tibble: 6 x 5
id date value group_no mean
<int> <date> <int> <int> <dbl>
1 1 2016-10-05 2 1 2
2 1 2016-10-06 3 1 2
3 1 2016-10-07 1 1 2
4 1 2016-10-08 2 2 NA
5 1 2016-10-09 5 2 NA
6 2 2013-10-06 2 1 NA
Note 1
An alternative to gl(...) could be cumsum(rep(1:3, length = n()) == 1) and an alternative to if (n() = 3) mean(value) else NA could be mean(head(c(value, NA, NA), 3)) .
Note 2
The input data in reproducible form was assumed to be:
Lines <- "id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)
Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10
I'm trying to restructure my data to recode a variable ('Event') so that I can determine the number of days between events. Essentially, I want to be able to count the number of days that occur between events occuring Importantly, I only want to start the 'count' between events after the first event has occurred for each person. Here is a sample dataframe:
Day = c(1:8,1:8)
Event = c(0,0,1,NA,0,0,1,0,0,1,NA,NA,0,1,0,1)
Person = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
sample <- data.frame(Person,Day,Event);sample
I would like it to end up like this:
NewEvent = c(NA,NA,0,1,2,3,0,1,NA,0,1,2,3,0,1,0)
sample2 <- sample <- data.frame(Person,Day,NewEvent);sample2
I'm new to R, unfamiliar with loops or if statements, and I could not find a thread which already answered this type of issue, so any help would be greatly appreciated. Thank you!
One approach is to group on Person and calculate distinct occurrence of events by cumsum(Event == 1). Now, group on both Person and grp to count days passed from occurrence of distinct event. The solution will be as :
library(dplyr)
sample %>% group_by(Person) %>%
mutate(EventNum = cumsum(!is.na(Event) & Event == 1)) %>%
group_by(Person, EventNum) %>%
mutate(NewEvent = ifelse(EventNum ==0, NA, row_number() - 1)) %>%
ungroup() %>%
select(Person, Day, NewEvent) %>%
as.data.frame()
# Person Day NewEvent
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 0
# 4 1 4 1
# 5 1 5 2
# 6 1 6 3
# 7 1 7 0
# 8 1 8 1
# 9 2 1 NA
# 10 2 2 0
# 11 2 3 1
# 12 2 4 2
# 13 2 5 3
# 14 2 6 0
# 15 2 7 1
# 16 2 8 0
Note: If data is not sorted on Day then one should add arrange(Day) in above code.
I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7
It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7
I have a data frame with 2 columns. Patient_Id and time (when visit the doctor).
I would like to add a new column "timestart" which have 0 at the first row for each different Patient_id and the other rows with the same id have the preview value from column time.
I think to do this with loop for, but I am new user in R and I don’t know how.
Thanks in advance.
We can group by 'Patient_id' and create the new column with the lag of 'time'
library(dplyr)
df1 %>%
group_by(Patient_id) %>%
mutate(timestart = lag(time, default = 0))
# Patient_id time timestart
# <int> <int> <int>
#1 1 1 0
#2 1 2 1
#3 1 3 2
#4 2 1 0
#5 2 2 1
#6 2 3 2
data
df1 <- data.frame(Patient_id = rep(1:2, each = 3), time = 1:3)