Create a sequence of values by group between a min and max interval using dplyr - r

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!

I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

Related

Using Rolling Average to Calculate over Window of Values

I am trying to calculate rolling averages of Heart Rate over 15 second intervals. I have millisecond data for many participants and as such the millisecond values can potentially be repeated multiple times, and due to inconsistent time readings, creating intervals by row is not viable.
Below is a small sample of the data for one participant. Data for another participant would obviously feature different millisecond data taken at different intervals.
Ideal output would involve a new column with the rolling average for each value of millisecond data.
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(MS, HR)
I have tried a few packages (namely Zoo's suite of rolling functions) but have had trouble applying them to this problem.
Thank you!
rollapplyr in zoo accepts a vector of widths and findInterval can be used to calculate the index in MS 15 seconds ago so if we subtract that from 1:n we get w, the number of positions to average. Exactly which intervals to produce is not discussed in the question so we will assumes that the right hand edge of each interval is at an input point.
library(zoo)
w <- with(df, seq_along(MS) - findInterval(MS - 15000, MS))
transform(df, roll = rollapplyr(HR, w, mean, fill = NA))
An option using non-equi join in data.table which also handles an ID:
library(data.table)
setDT(df)[, avgHR :=
df[.(ID=ID, start=MS-15000, end=MS), on=.(ID, MS>=start, MS<=end),
by=.EACHI, mean(HR)]$V1
]
output:
ID MS HR avgHR
1: 1 36148 84 84.00000
2: 1 36753 84 84.00000
3: 1 37364 84 84.00000
4: 1 38062 84 84.00000
5: 1 38737 84 84.00000
6: 1 39580 96 86.00000
7: 1 40029 84 85.71429
8: 1 40387 84 85.50000
9: 1 41208 96 86.66667
10: 1 42006 84 86.40000
11: 1 42796 84 86.18182
12: 1 43533 96 87.00000
13: 1 44274 84 86.76923
14: 1 44988 84 86.57143
15: 1 45696 96 87.20000
16: 1 46398 84 87.00000
17: 1 47079 84 86.82353
18: 1 47742 84 86.66667
19: 1 48429 84 86.52632
20: 1 49135 84 86.40000
21: 1 49861 84 86.28571
22: 1 50591 84 86.18182
23: 1 51324 84 86.18182
24: 1 52059 84 86.18182
ID MS HR avgHR
data:
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84)
df <- data.frame(ID=1, MS, HR)
I'm not totally sure how you want to apply the 15s rolling average, but here is one way to go about what I think youre looking for. First we subset the data that is between 7.5s before and 7.5s after, then we take the average. This, however, will have an edge effect since there is no 7.5s before the first value.
library(tidyverse)
roll_vec <- c()
for(i in 1:nrow(df)){
ref <- df$MS[[i]]
val <- df %>%
filter(MS <= ref + 7500 & MS >= ref- 7500) %>%
pull(HR) %>%
mean
roll_vec[[i]] <- val
}
df %>%
mutate(roll_15s = roll_vec)
#> MS HR roll_15s
#> 1 36148 84 87.00000
#> 2 36753 84 87.00000
#> 3 37364 84 86.76923
#> 4 38062 84 86.57143
#> 5 38737 84 86.57143
#> 6 39580 96 86.57143
#> 7 40029 84 86.57143
#> 8 40387 84 86.57143
#> 9 41208 96 86.57143
#> 10 42006 84 86.57143
#> 11 42796 84 86.57143
#> 12 43533 96 86.57143
#> 13 44274 84 87.00000
#> 14 44988 84 87.27273
#> 15 4569 96 96.00000
df %>%
mutate(roll_15s = roll_vec) %>%
ggplot(aes(MS, HR))+
geom_line()+
geom_line(aes(y = roll_15s), color = "blue")
Notice that in the plot, the black line is the raw data and the blue line is the 15s rolling average.
One possible solution:
library(magrittr)
start_range <- df$MS[df$MS < max(df$MS)-15000]
lapply(start_range,function(t){
data.frame(MS = mean(df$MS[df$MS %between% c(t,t+15000)]),
HR = mean(df$HR[df$MS %between% c(t,t+15000)]))
}) %>% Reduce(rbind,.)
MS HR
1 43218.00 86.18182
2 43907.82 86.18182
3 44603.55 86.18182
4 44948.29 86.28571
5 45673.38 86.33333
I added some points to your data (I had only two points with the data you give):
MS <- c(36148, 36753,37364,38062,38737,39580,40029,40387,41208,42006,42796, 43533,44274,44988,45696,46398,47079,47742,48429,49135,49861,50591,51324,52059,53289,54424)
HR <- c(84,84,84,84,84,96,84,84,96,84,84,96,84,84,96,84,84,84,84,84,84,84,84,84,85,88)
df <- data.frame(MS, HR)
The idea here is to calculate, for each MS value, the mean of HR and the time MSof all points having a time between this value (t in lapply) and 15 s after.
I restrict that on the range where I have values encompassing the 15s : the start_range vector.

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

R - setting equiprobability over a specific variable when sampling

I have a data set with more than 2 millions entries which I load into a data frame.
I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.
This is what my data looks like with str(data):
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You've noticed that I have 3 variables : category,id and text.
I have tried the following :
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the output of count(data$category) :
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try setting the probability I get the following error :
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.
Question : How can I sample my data over an equal probability for the category variable?
Thanks in advance.
I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.
Here's an example data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So here's a possible data.table equivalent
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit #2: Here's a very nice solution using the dplyr package contributed by #Khashaa and #docendodiscimus
The nice thing about this solution is that it returns the exact sample size within each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit #3:
It seems that data.table equivalent to dplyr::sample_n would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
Which will also return the exact sample size within each group

Resources