Grouping data in R based on specific column values - r

I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.
Simplified version of data:
Time Phase Pressure Speed
1 0 0.015 0
2 25 0.015 0
3 25 0.234 0
4 25 0.111 0
5 0 0.567 0
6 0 0.876 0
7 75 0.234 0
8 75 0.542 0
9 75 0.543 0
The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.
What i've tried:
`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`
This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.

Super quick:
df <- read.csv("your_file_name.csv")
cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])
The cbind is fancy - depending on the distribution of values of Phase, you'll need to merge instead.

EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle or the run-level encoding function, to get a group number that you can use in the aggregate command.
I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep function to repeat each one by the appropriate length.
After this is done, we can use the same basic aggregation command again.
df_example <- data.frame(Time = 1:9,
Phase = c(0,25,25,25,0,0,25,25,0),
Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
Speed = rep(x = 0,times = 9))
encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
times = encoded_runs$lengths)
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
Phase = df_example[df_example$Phase != 0,"Phase"]),
FUN = mean)
Group_No Phase Pressure Speed
1 2 25 0.120 0
2 4 25 0.388 0

Building upon comment by Solos, and answer by Cheesman,
try:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
df_example = csv
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Phase = df_example[df_example$Phase != 0,"block"]),
FUN = mean)
actually plyr would be handy:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
require(plyr)
ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))

Related

Filtering (temporally similar) duplicates while keeping max value based on another column in R

I'm hoping to clean out a time series dataset so that only the maximum value of each event is retained. To start, I filtered the data so that only values above a certain threshold are maintained but there are still values that, while separated by a millisecond or two, act as duplicate values but will throw off later analysis.
My initial dataset has >100,000 rows and a few more columns but here is the top of a smaller version.
head(shortfilter)
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761
2 5534.024 1.232424
3 5534.152 1.414432
4 5534.153 1.359914
5 5534.272 1.639033
6 5534.396 1.476161
Explained: I don't have a concrete time value that they need to be within for it to be considered a duplicate, but the rest of the data is similar to this in that they are generally within .003 s.
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761 #<-- Higher value (keep)
2 5534.024 1.232424
3 5534.152 1.414432 #<-- Higher value (keep)
4 5534.153 1.359914
5 5534.272 1.639033 #<-- Only value (keep)
6 5534.396 1.476161 #<-- Only value (keep)
Ideal:
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761
2 5534.152 1.414432
3 5534.272 1.639033
4 5534.396 1.476161
5 ____.___ _.______
6 ____.___ _.______
I'll add my initial attempt at some conditionals to do what I was hoping, but keep in mind I'm new to coding in general and so I know it isn't remotely correct, just wanted to get some ideas out there. Hope it can give some additional info on what I hope to do though. I'm positive the formatting & syntax are complete gibberish but I'm sure many of you will understand what I was going for lol...
for (i in shortfilter$`Time (Sec)`){
for (j in shortfilter$`ECG (Channel 6)`){
if ((i+1)-i > 0.01 && j > j+1){
remove(j+1)
} else if ((i+1)-i > 0.01 && j < j+1){
remove(j)
}
}
}
Welcome to StackOverflow! My solution compares each value to the next value and finds the difference, then adjusts the predicted grouping number based on those values. Currently it can handle up to five consecutive duplicated numbers, but you can easily add more if you would like.
library(tidyverse)
tibble::tribble(
~`Time`, ~`ECG`,
5534.023, 1.371761,
5534.024, 1.232424,
5534.025, 1.27,
5534.026, 1.28,
5534.152, 1.414432,
5534.153, 1.359914,
5534.272, 1.639033,
5534.396, 1.476161
) %>%
arrange(Time) %>%
mutate(sim_val = if_else(!is.na(lead(Time)), lead(Time) - Time, 5),
Num = if_else(sim_val <= 0.03, row_number() + 1, as.numeric(row_number())),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num)) %>%
arrange(Num, desc(ECG)) %>%
group_by(Num) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(Time, ECG)
Also, feel free to fine-tune the threshold of 0.03 to your data. Let me know if this works!

row index of "looked at" row case_when in R

I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.

Use first digit as factor to standardize values in R

I have large data frame tocalculate from a survey (original data frame brfss2013 where one of the variables represents the number of times a person checks blood glucose levels. The data is in 3 digits:
First digit tells you if the measurements are per day (1), per week (2), per month (3)or per year (4). The second and third digits represent the actual value.
Example: 101 is once ( _01) per day (1 _ _), 202 is twice per week, etc.
I want to standardize everything to get value of times per year. So I will multiply the 2nd and 3rd digits by 365, 52.143, 12 and 1 (days, weeks, months, year).
I think I would be able to "select" the digits to use, but I'm not sure how to write something that can work with different rows with different set of instructions.
EDIT:
Adding my attempt and sample data.
tocalculate <- brfss2013 %>%
filter(nchar(bldsugar) > 2)
bldsugar2 <- sapply(tocalculate$bldsugar, function(x) {
if (substr(x,1,1) == 1) {x*365}
if (substr(x,1,1) == 2) {x*52}
if (substr(x,1,1) == 3) {x*12}
if (substr(x,1,1) == 4) {x*365}
})
I'm getting a lot of NULL values though...
Since you're already using dplyr, recode is a handy function. I use %/% to see how many times 100 goes in to each bldsugar value and %% to get the remainder when divided by 100.
# sample data
brfss_sample = data.frame(bldsugar = c(101, 102, 201, 202, 301, 302, 401, 402))
library(dplyr)
mutate(
brfss_sample,
mult = recode(
bldsugar %/% 100,
`1` = 365.25,
`2` = 52.143,
`3` = 12,
`4` = 1
),
checks_per_year = bldsugar %% 100 * mult
)
# bldsugar mult checks_per_year
# 1 101 365.250 365.250
# 2 102 365.250 730.500
# 3 201 52.143 52.143
# 4 202 52.143 104.286
# 5 301 12.000 12.000
# 6 302 12.000 24.000
# 7 401 1.000 1.000
# 8 402 1.000 2.000
You could, of course, remove the mult column (or combine the definitions so it is never created in the first place).
#Data
set.seed(42)
x = sample(101:499, 100, replace = TRUE)
#1st digit
as.factor(floor((x/100)))
#Values
((x/100) %% 1) * 100
Perhaps the first thing you can do is to split the 3-digit variable into two variables. The first variable is only one digit, which shows sampling frequency; and the second variable shows times of measurement.
In R, substr or substring can select the string by specifying the first and last position to subset.
# Create example data frame
ex_data <- data.frame(var = c("101", "202", "204"))
# Split the variable to create two new columns
ex_data$var1 <- substring(ex_data$var, first = 1, last = 1)
ex_data$var2 <- substring(ex_data$var, first = 2, last = 3)
# Remove the original variable
ex_data$var <- NULL
After this, you can manipulate your data frame. Perhaps convert var1 to factor and var2 to numeric for further manipulation and analysis.

Simulating a timeseries in dplyr instead of using a for loop

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.
Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))
Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))
The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.
What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Resources