Extracting row in time series in R - r

I'm trying to extract the rows from a data frame containing the lowest value in a specific column:
income = c(2, 3, 5, 5, -15, 2, 1)
balance = c(15, 17, 20, 25, 30, 15, 17)
date = as.Date(c("2016/02/11", "2016/02/14", "2017/02/16", "2016/03/01", "2017/03/12", "2016/04/11", "2017/04/24"))
df = data.frame(income, balance, date)
Now what I want to get the rows containing the minimum "balance" value from each month, so that the outcome would be a data frame looking like this:
income balance date
1 2 15 2016-02-11
2 5 25 2016-03-01
3 2 33 2016-04-11
I have tryed the aggregate function:
bymonth = aggregate(balance~months(date), data=df,FUN=min)
print(bymonth)
But this gives me the following output:
months(date) balance
1 April 15
2 Februar 15
3 Marts 25
Help!

We can do with dplyr. After grouping by months of 'date', we slice the row which has the min 'balance' and remove the 'mth' column using select
library(dplyr)
df %>%
group_by(mth = months(date)) %>%
slice(which.min(balance)) %>%
ungroup() %>%
select(-mth)
# A tibble: 3 x 3
# income balance date
# <dbl> <dbl> <date>
#1 2 15 2016-04-11
#2 2 15 2016-02-11
#3 5 25 2016-03-01
Note that if there are ties for the 'balance', then use filter(balance == min(balance)) in place of slice
Or with ave from base R tp create a logical vector and use that to subset the rows of 'df'
df[with(df, ave(balance, months(date), FUN = min)==balance),]
# income balance date
#1 2 15 2016-02-11
#4 5 25 2016-03-01
#6 2 15 2016-04-11

Related

R: creating a longitudinal dataset using tidyr

I am looking to generate a longitudinal dataset. I have generated my pat numbers and treatment groups:
library(dplyr)
set.seed(420)
Pat_TNO <- 1001:1618
data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
My timepoints (in days) are:
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
How can I pivot this dataset using the vector timepoint_days, so I have 10 rows per participant and column names Pat_TNO, trt, timepoint_days.
You can use the unnest function from tidyr to achieve what you want.
Here is the code
library(dplyr)
library(tidyr)
set.seed(420)
Pat_TNO <- 1001:1618
x <- data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
x %>%
mutate(timepoint_days = list(timepoint_days)) %>%
unnest()
Output
# A tibble: 6,180 × 3
Pat_TNO trt timepoint_days
<int> <int> <dbl>
1 1001 1 14
2 1001 1 28
3 1001 1 42
4 1001 1 56
5 1001 1 70
6 1001 1 84
7 1001 1 112
8 1001 1 140
9 1001 1 168
10 1001 1 364
# … with 6,170 more rows
Here I used the mutate function to add a column with a list containing timepoint_days in every row. And then unnest collapses each row to get 10 rows per participant.

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Combine rows with consecutive dates into single row with start and end dates

I have a dataframe of events that looks something like this:
EVENT DATE LONG LAT TYPE
1 1/1/2000 23 45 A
2 2/1/2000 23 45 B
3 3/1/2000 23 45 B
3 5/2/2000 22 56 A
4 6/2/2000 19 21 A
I'd like to collapse this so that any events that occur on consecutive days at the same location (as defined by LONG, LAT) are collapsed into a single event with a START and END date and a concatenated column of the TYPES involved.
Thus the above table would become:
EVENT START-DATE END-DATE LONG LAT TYPE
1 1/1/2000 3/1/2000 23 45 ABB
2 5/2/2000 5/2/2000 22 56 A
3 6/2/2000 6/2/2000 19 21 A
Any advice on how to best approach this would be greatly appreciated.
Here's a modified version of Ronak Shah's solution, taking non-consecutive events at the same location as separate event periods.
# expanded data sample
df <- data.frame(
DATE = as.Date(c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05",
"2000-02-05", "2000-02-06", "2000-02-07"), format = "%Y-%m-%d"),
LONG = c(23, 23, 23, 23, 22, 19, 22),
LAT = c(45, 45, 45, 45, 56, 21, 56),
TYPE = c("A", "B", "B", "A", "A", "B", "A")
)
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
arrange(DATE) %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(PERIOD = cumsum(DATE.diff != 1)) %>%
ungroup() %>%
group_by(LONG, LAT, PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 5 x 6
LONG LAT PERIOD START_DATE END_DATe TYPE
<dbl> <dbl> <int> <date> <date> <chr>
1 19 21 0 2000-02-06 2000-02-06 B
2 22 56 0 2000-02-05 2000-02-05 A
3 22 56 1 2000-02-07 2000-02-07 A
4 23 45 0 2000-01-01 2000-01-03 ABB
5 23 45 1 2000-01-05 2000-01-05 A
Edit to add explanation for what's going on with the "PERIOD" variable.
For simplicity, let's consider some sequential consecutive & non-consecutive events at the same location, so we can skip the group_by(LONG, LAT) & arrange(DATE) steps:
# sample dataset of 10 events at the same location.
# first 3 are on consecutive days, next 2 are on consecutive days,
# next 4 are on consecutive days, & last 1 is on its own.
df2 <- data.frame(
DATE = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-05", "2001-01-06",
"2001-02-01", "2001-02-02", "2001-02-03", "2001-02-04",
"2001-04-01"), format = "%Y-%m-%d"),
LONG = rep(23, 10),
LAT = rep(45, 10),
TYPE = LETTERS[1:10]
)
As an intermediate step, we create some helper variables:
"DATE.diff" counts the difference between current row's date & previous row's date. Since the first row has no date before "2001-01-01", we default the difference to 1.
"non.consecutive" indicates whether the calculated date difference is not 1 (i.e. not consecutive from previous day), or 1 (i.e. consecutive from previous day). If you need to account for same-day events at the same location in the dataset, you can change the calculation from DATE.diff != 1 to DATE.diff > 1 here.
"PERIOD" keeps track of the number of TRUE results in the "non.consecutive" variable. Starting from the first row, every time a row's is non-consecutive from the previous row, "PERIOD" increments by 1.
As a result of the helper variables, "PERIOD" takes on a different value for each group of consecutive dates.
df2.intermediate <- df2 %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(non.consecutive = DATE.diff != 1) %>%
mutate(PERIOD = cumsum(non.consecutive))
> df2.intermediate
DATE LONG LAT TYPE DATE.diff non.consecutive PERIOD
1 2001-01-01 23 45 A 1 FALSE 0
2 2001-01-02 23 45 B 1 FALSE 0
3 2001-01-03 23 45 C 1 FALSE 0
4 2001-01-05 23 45 D 2 TRUE 1
5 2001-01-06 23 45 E 1 FALSE 1
6 2001-02-01 23 45 F 26 TRUE 2
7 2001-02-02 23 45 G 1 FALSE 2
8 2001-02-03 23 45 H 1 FALSE 2
9 2001-02-04 23 45 I 1 FALSE 2
10 2001-04-01 23 45 J 56 TRUE 3
We can then treat "PERIOD" as a grouping variable in order to find the start / end date & events within each period:
df2.intermediate %>%
group_by(PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 4 x 4
PERIOD START_DATE END_DATe TYPE
<int> <date> <date> <chr>
1 0 2001-01-01 2001-01-03 ABC
2 1 2001-01-05 2001-01-06 DE
3 2 2001-02-01 2001-02-04 FGHI
4 3 2001-04-01 2001-04-01 J
With dplyr, we can group by LAT and LONG and select the maximum and minimum DATE for each group and paste the TYPE column together.
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
summarise(start_date = min(as.Date(DATE, "%d/%m/%Y")),
end_date = max(as.Date(DATE, "%d/%m/%Y")),
type = paste0(TYPE, collapse = ""))
# LONG LAT start_date end_date type
# <int> <int> <date> <date> <chr>
#1 19 21 2000-02-06 2000-02-06 A
#2 22 56 2000-02-05 2000-02-05 A
#3 23 45 2000-01-01 2000-01-03 ABB

Counting the number of observations by groups with conditions in R

I would like to count the number of observations within each group using conditions in R.
For example, I would like to count how many observations for ID "A" in every 10 days.
ID (A,A,A,A,A,A,A,A)
Day (7,14,17,25,35,37,42,57)
X (9,20,14,24,23,30,20,40)
Output Image
(In the first 10 days, we have one observation for ID "A". Days:7
In the next 10 days, we have two observations for ID "A". Days:14,17)
ID (A,A,A,A,A,A,A,A)
Day_10 (1,2,3,4,5,6)
Count_10 (1,2,1,2,1,1)
Also it would be great if I can calculate the number of observations before and after the certain values. For the given X value, I would like to know how many observation between [X-10, X+10] within ID "A".
The output image would be as follows:
ID (A,A,A,A,A,A,A,A)
X (9,20,14,24,23,30,40,50)
Count_X10 (3,3,3,3,3,3,2,1)
Count_X10: for a given X(=9) there are three observations within ID "A" [-1,19]
Here are the data loaded as a data.frame to keep the observations connected. Note that I added a second group to to show how to handle that
df <-
data.frame(
ID = rep(c("A","B"), each = 8)
, Day = c(7,14,17,25,35,37,42,57)
, X = c(9,20,14,24,23,30,20,40)
)
Then, I used dplyr to pass the data through a series of steps. First, I split by the ID column, then used lapply to run a function on each of those ID groups, including calculating two columns of interest (then returning the whole data.frame). Finally, I stitch the rows back together with bind_rows
df %>%
split(.$ID) %>%
lapply(function(x){
x$nextTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay, thisDay + 10))
})
x$plusMinusTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay - 10, thisDay + 10))
})
return(x)
}) %>%
bind_rows()
The result is
ID Day X nextTen plusMinusTen
1 A 7 9 3 3
2 A 14 20 2 3
3 A 17 14 2 4
4 A 25 24 2 3
5 A 35 23 3 4
6 A 37 30 2 3
7 A 42 20 1 3
8 A 57 40 1 1
9 B 7 9 3 3
10 B 14 20 2 3
11 B 17 14 2 4
12 B 25 24 2 3
13 B 35 23 3 4
14 B 37 30 2 3
15 B 42 20 1 3
16 B 57 40 1 1
But any condition you are interested good be added to that lapply step.
Your sample data :
df = data.frame(
ID = rep('A', 8),
Day = c(7, 14, 17, 25, 35, 37, 42, 57),
X = c(9, 20, 14, 24, 23, 30, 40, 50),
stringsAsFactors = FALSE)
Note: You give two different values for vector X. I suppose it is c(9, 20, 14, 24, 23, 30, 40, 50), and not c(9, 20, 14, 24, 23, 30, 20, 40).
First calculation:
library(dplyr)
output1 = df %>%
mutate(Day_10 = ceiling(Day/10)) %>%
group_by(ID, Day_10) %>%
summarise(Count_10 = n())
The mutate step creates the ranges of 10 days by rounding Day/10. Then we group by ID and Day_10 and we count the number of observations within each group.
> output1
ID Day_10 Count_10
<chr> <dbl> <int>
1 A 1 1
2 A 2 2
3 A 3 1
4 A 4 2
5 A 5 1
6 A 6 1
Second calculation:
output2 = df %>%
group_by(ID) %>%
mutate(Count_X10 = sapply(X, function(x){sum(Day >= x-10 & Day <= x+10)})) %>%
select(-Day)
We group by ID, and for each X we count the number of days with this ID that are between X-10 and X+10.
> output2
ID X Count_X10
<chr> <dbl> <int>
1 A 9 3
2 A 20 3
3 A 14 3
4 A 24 3
5 A 23 3
6 A 30 3
7 A 40 3
8 A 50 2
Note: I suppose there's a mistake in the desired output you give, because for instance, when X = 50, there are 2 observations within [40, 60] with ID "A": days 42 and 57.

R: Replacing NA values by mean of hour with dplyr

I'm learning the dplyr package in R and I really like it. But now I'm dealing with NA values in my data.
I would like to replace any NA by the average of the corresponding hour, for example with this very easy example:
#create an example
day = c(1, 1, 2, 2, 3, 3)
hour = c(8, 16, 8, 16, 8, 16)
profit = c(100, 200, 50, 60, NA, NA)
shop.data = data.frame(day, hour, profit)
#calculate the average for each hour
library(dplyr)
mean.profit <- shop.data %>%
group_by(hour) %>%
summarize(mean=mean(profit, na.rm=TRUE))
> mean.profit
Source: local data frame [2 x 2]
hour mean
1 8 75
2 16 130
Can I use the dplyr transform command to replace the NA's of day 3 in the profit with 75 (for 8:00) and 130 (for 16:00)?
Try
shop.data %>%
group_by(hour) %>%
mutate(profit= ifelse(is.na(profit), mean(profit, na.rm=TRUE), profit))
# day hour profit
#1 1 8 100
#2 1 16 200
#3 2 8 50
#4 2 16 60
#5 3 8 75
#6 3 16 130
Or you could use replace
shop.data %>%
group_by(hour) %>%
mutate(profit= replace(profit, is.na(profit), mean(profit, na.rm=TRUE)))
A (less elegant) approach with base functions:
transform(shop.data,
profit = ifelse(is.na(profit),
ave(profit, hour, FUN = function(x) mean(x, na.rm = TRUE)),
profit))
# day hour profit
# 1 1 8 100
# 2 1 16 200
# 3 2 8 50
# 4 2 16 60
# 5 3 8 75
# 6 3 16 130

Resources