R - Finding muliple max values - r

I have the following sample data set
Time <- c(1,2,3,4,5,6,7,8,9,10,11,12)
Value <- c(0,1,2,3,2,1,2,3,2,1,2,3)
Data <- data.frame(Time, Value)
I would like to automatically find each maximum for the Value column and create a new data frame with only the Value and associated Time. In this example, maximum values occur every fourth time interval. I would like to group the data into bins and find the associated max value.
I kept my example simple for illustrative purposes, however, keep in mind:
Each max value in my data set will be different
Each max value is not guaranteed to occur at equal intervals but rather, I can guarantee that each max value will occur within a range (i.e. a bin) of time values.
Thank you for any help with this process!

You could find the local maxima by finding the points where the diff of the sign of the diff of the Value column is negative.
Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
#> Time Value
#> 4 4 3
#> 8 8 3
We can see that this works in a more general case too:
Time <- seq(0, 10, 0.1)
Value <- sin(Time)
Data <- data.frame(Time, Value)
plot(Data$Time, Data$Value)
Data2 <- Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
abline(v = Data2$Time, col = 'red')
Edit
Following more info from the OP, it seems we are looking for the maxima within a 120-second window. This being the case, we can get the solution more easily like this:
library(dplyr)
bin_size <- 4 # Used for example only, will be 120 in real use case
Data %>%
mutate(Bin = floor((Time - 1) / bin_size)) %>%
group_by(Bin) %>%
filter(Value == max(Value))
#> # A tibble: 3 x 3
#> # Groups: Bin [3]
#> Time Value Bin
#> <dbl> <dbl> <dbl>
#> 1 4 3 0
#> 2 8 3 1
#> 3 12 3 2
Obviously in the real data, change bin_size to 120.

Maybe this one?
library(dplyr)
Data %>%
slice_max(Value)
Time Value
1 4 3
2 8 3
3 12 3

Related

Conditionally mutate dataframe based on multiple conditions R

I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.
I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).
ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
dummy_data <- data.frame(ID, obs, total_obs)
I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):
it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.
Here is the expected output:
ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6, 3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")
wanted_data <- data.frame(ID, segment, total_segments, interval)
I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?
You just need dplyr::mutate() and dplyr::case_when().
The following should give you something to work off of.
dummy_data %>%
mutate(interval = case_when(obs < (total_obs/3) ~ "start",
obs < 2*(total_obs/3) ~ "center",
TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false
Which gives slightly different results.
I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.
First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.
For the interval column, here is one approach of creating it:
group the data based on segment column
create a column, say tile, and fill it with ntile(segment) results.
create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
drop the tile column.
wanted_data <- dummy_data %>%
rename(segment = obs, total_segments = total_obs) %>%
group_by(total_segments) %>%
mutate(tile = ntile(segment, 3)) %>%
mutate(interval = case_when(tile == 1~"start",
tile == 2~"center",
tile == 3~"end")) %>%
select(-tile)
wanted_data
# A tibble: 34 × 4
# Groups: total_segments [5]
ID segment total_segments interval
<chr> <dbl> <dbl> <chr>
1 item_001 1 6 start
2 item_001 2 6 start
3 item_001 3 6 center
4 item_001 5 6 center
5 item_001 6 6 end
6 item_452 3 16 start
7 item_452 4 16 start
8 item_452 5 16 start
9 item_452 7 16 center
10 item_452 8 16 center
# … with 24 more rows
It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.

purrr::Compose or any alternative to reduce the run time for a long nested function in R?

Problem: I have several (10+) custom functions, each defining a step in the workflow. I want to run a nested function of these steps over a large data frame for n (50+) periods iteratively. My current function achieves the result but it is too slow and not very elegant.
Example Input
id x_1975 z_1975
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 NA
Step 1:
Compare initial x values (x_1975) against a cutoff=3. If x is greater than 3, then the z value should be "Y".
Step 2:
If z value is "Y", then x value in next year should be x times 2. Otherwise, it should be x times 5. Although the z values can be skipped altogether, I need the categorical column to create summary stats.
Note:
The data set I am working with has 20 variables that need to be calculated based on some similar logics.
Desired Output
id x_1975 z_1975 x_1976 z_1976 x_1977 z_1977 x_1978
1 1 1 <NA> 5 Y 10 Y 20
2 2 2 <NA> 10 Y 20 Y 40
3 3 3 <NA> 15 Y 30 Y 60
4 4 4 Y 8 Y 16 Y 32
5 5 5 Y 10 Y 20 Y 40
6 6 6 Y 12 Y 24 Y 48
What I have tried:
Tried setting the data in long format. But found it complicated to iterate over rows.
Pre-allocated all columns with appropriate class. That reduced run time a little although not enough.
Have been trying to use purrr::compose to nest all the functions. But I am not being able to make it work.
Reproducible Example
library(dplyr)
library(purrr)
# Create Data Frame
n <- 6
dat <- data.frame(id=1:n,
x_1975=seq(1,6,1),
z_1975=NA)
cut_off <- 3
# Functions
# Set a value for "z_" variables in period t by comparing "x_" value in period t against the the cut_off value.
func_1 <- function(dat,yr){
# pre-define variables
z <- paste0("z_",yr)
x <- paste0("x_",yr)
# Caclulate values for "z_" in period t
dat <- dat %>% mutate(!!sym(z):=
case_when(!!sym(x)>cut_off ~ "Y",
TRUE~as.character(NA)
))
}
# Calculate the value for "x_" variables in period t+1 based on "z_" variables in period t.
func_2 <- function(dat,yr){
# pre-define variables
x <- paste0("x_",yr+1)
x_lag <- paste0("x_",yr)
z <- paste0("z_",yr)
# Calculate "x_" value for t+1
dat <- dat %>% mutate(!!sym(x):=case_when(
!!sym(z)=="Y"~!!sym(x_lag)*2,
TRUE~!!sym(x_lag)*5
))
}
# Join function 1 and function 2 together. The joined function needs to iterate over the `dat` from beginning year to ending year
joined_func <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- func_1(dat,year)
# Output of step 1 is used as input for step 2
dat <- func_2(dat,year)
}
return(dat)
}
# Run the code from 1975 to 2025.The data_output has the desired output, but need to reduce runtime.
data_output <- joined_func(dat,1975,1977)
# Tried to use the compose function from purrr. but getting error.
my_funs <- c(func_1, func_2)
f1 <- invoke(compose, my_funs)
joined_func_2 <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- f1(dat,year=year)
}
}
data_output_2 <- joined_func_2(dat,1975,1977)
# Error message:
# Error in f1(dat, year = year) : unused argument (year = year).
Questions
a) how do I make purrr::compose work? b) any other way to achieve efficiency?
Would really appreciate if someone could help me on this!

How to sum every nth (200) observation in a data frame using R [duplicate]

This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)

Round midpoint values based on proportion table in R

I've got a numeric variable in the range from 1 (min) to 5 (max). The value ranges across 8 different variables. Therefore, the first row would look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
4 4 1 4 5 4 4 1
I've computed (row-wise) a median value for each row across the 8 variables. Occasionally, the median will be a midpoint value, for example, 4.5 (since it's even number of variables). Therefore the resulting row might look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Median
1 2 3 4 5 5 5 5 4.5
When I call table on each median value calculated in Medina variable, I'll get this:
table(df$Median)
1 1.5 2 2.5 3 3.5 4 4.5 5
2 3 10 5 25 17 75 53 87
The issue I am trying to overcome is that I wish to "get rid" of the midpoint/decimal values by including them in the nearest nondecimal values; however, if I simply use round(), then I end up biasing the values (as by definition the 4.5 is really in between), like this:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
What I was thinking of doing would be to round values based on the proportion of non-decimal numbers in the table (excluding the midpoint values):
So I would get proportion of non-decimal numbers using dplyr filter functions:
df %>% filter(median %% 1 == 0) %>%
select(median) %>% table() %>% prop.table()
To get:
1 2 3 4 5
0.01005025 0.05025126 0.12562814 0.37688442 0.43718593
Next step requires constructing a function that will take all midpoint values in the median variable and round them to their nearest non-decimal values while keeping the proportion of the non-decimal variables intact or close to the original one. For example, 4.5 nearest values are 4 and 5, so it would have a chance of going becoming 4 based on proportion 0.37688442 and 5 based on proportion 0.43718593. This way I would transform midpoint values to the whole; however, it would not be as biased as using simply round().
An alternative approach is to split the value equally between 4 and 5. So 50% of variables with value 4.5 will go to 4, 50% will go to 5.
I am thankful for any suggestions that would help me to solve this problem or get to the point I can start developing the function.
Edit1. Provided my own attempt to answer this question.
Edit2. Provided data.
dput(head(df, 15))
structure(list(uniqueID = c("R_AtXpiwxKPvILFv3", "R_2xwP4iz6UAu1fTj",
"R_b8IXGRKHP58x7GR", "R_ZelynHN8PCxxYyt", "R_PNjIc7h4dHebRgR",
"R_2bTZvYLUuKNC22D", "R_3iLqwuDs493HstB", "R_291dITimLKjYXeL",
"R_YWWGleFLxlIYzrX", "R_3st91vjNWNXlTHt", "R_3Mm8P52gaaxIpwD",
"R_3MxHXTnrncpgnB8", "R_1LqDx1uxReOQHvO", "R_vJEGJDmbqdfO7qF",
"R_3q8Wl8qys6nqxBH"), Median = c(4, 4.5,
1, 4, 5, 4.5, 4, 1.5, 4.5, 4, 3.5, 2, 4.5, 4.5, 3.5)), .Names = c("uniqueID",
"Median"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
I'd implement it like this:
round_randomly = function(x, tolerance = 1e-6) {
round(x + sample(c(-tolerance, tolerance), size = length(x), replace = TRUE))
}
Calling your sample data dd,
table(round_randomly(dd$Median))
# 1 2 4 5
# 1 2 8 4
Any tolerance value less than 0.5 will work the same if your data is only integers and 0.5. If you have more continuous data, a smaller tolerance is better (to prevent, say 4.4 from being jittered up to 4.51 and being rounded to 5). I set the default to 1e-6, which seems reasonable, a value > 4.499999 might get rounded up to 5.
Your answer goes to quite a bit of trouble to only add a random value to the midpoints - this isn't necessary because of the rounding. If the original value is 4, 4.000001 will still round to 4. Even if you set the tolerance to 0.4, 4.4 will still round to 4).
My method makes no guarantees about rounding exactly 50% of midpoints up and 50% down, but each midpoint is rounded up and down with equal probability. Unless you have very little data and an unusually skewed random draw, that should be close enough.
Following suggestion from comments, I've attempted to create a function that randomly adds 0.1 or subtract 0.1 from all median midpoint values. It's not exactly the most elegant function ever but it does the job. One issue with the approach might be that randomization occurs by randomly sampling fraction of the dataset and adding 0.1 to it. Therefore, remaining unsampled fraction automatically gets to be subtracted by 0.1. It would be more elegant to do this for every value individually but I would have to explore this option.
The function:
randomize_midpoint <- function(dataset, new_random_median) {
# Prepare variable for mutate
new_random_median <- enquo(new_random_median)
# Get Sample A
sample_A <- dataset %>%
filter(Median %% 1 != 0) %>% # get midpoint values
sample_frac(0.5, replace = F) %>% # randomly sample 50% of them
select(uniqueID, Median) # anti_join will need some unique identifier
# Get Sample B by anti_join
sample_B <- dataset %>%
filter(Median %% 1 != 0) %>%
anti_join(sample_A) %>% # anti_join automatically uses uniqueID
select(uniqueID, Median)
# Create opposite of %in%
"%w/o%" <- Negate("%in%")
# Mutate median according to conditions in case_when()
dataset %>% mutate(
!!quo_name(new_random_median) := case_when(
uniqueID %in% sample_A$uniqueID ~ round(Median + 0.1),
uniqueID %in% sample_B$uniqueID ~ round(Median - 0.1),
uniqueID %w/o% c(sample_A$uniqueID , sample_B$uniqueID) ~ Median
)
)
}
The output of the function to compare with previous table():
randomize_midpoint(dataset = df, new_random_median = random_med) %>%
select(random_med) %>%
table()
Will return:
Joining, by = c("uniqueID", "Median")
1 2 3 4 5
2 16 36 110 113
Previous table:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

Resources