Calculate mean after last time of specific column specification - r

example.df <- data.frame(GY = sample(300:600, 200, replace = T), sacc
= rep("f", each = 100), trial.number = rep(1:2,
each = 100), stringsAsFactors = F)
example.df$sacc[50:70] <- "s"
example.df$sacc[164:170] <- "s"
I have data looking similar to this. I would like to calculate the mean of GY after the last appearance of "s" for all the rest of the values of GY where sacc is f. In this example I could ofcourse just average on index number 71:100, however in the real data this isn't the case.
What I tried after the comment of Ronak (thanks!):
library(dplyr)
example.df %>%
group_by(trial.number) %>%
summarise(mean_tr = mean(GY[(max(which(sacc == "s")) + 1) : n()]))
%>%
data.frame()
I cant get it to work. Can someone help me out ? My original data.frame is 70k rows, an consists of a lot of variables. class = data.frame.

Update
As we need to do this by group, we can split it on trial.number and then apply the same operation to each group.
sapply(split(example.df, example.df$trial.number), function(x)
mean(x$GY[(max(which(x$sacc == "s")) + 1) : nrow(x)]))
# 1 2
#446.2333 471.7000
The same using dplyr could be achieved by
library(dplyr)
example.df %>%
group_by(trial.number) %>%
summarise(mean_tr = mean(GY[(max(which(sacc == "s")) + 1) : n()])) %>%
data.frame()
# trial.number mean_tr
#1 1 446.2333
#2 2 471.7000
Confirming again,
mean(example.df$GY[71:100])
#[1] 446.2333
mean(example.df$GY[171:200])
#[1] 471.7
Original Answer
We could do
mean(example.df$GY[(max(which(example.df$sacc == "s")) + 1) : nrow(example.df)])
#[1] 443.6667
Here, we first get all the indices where sacc is "s" then take max of it to get last occurrence. We get the mean of GY values from that index to end of the dataframe (nrow(example.df)).
To confirm,
mean(example.df$GY[71:100])
#[1] 443.6667

Related

More efficient way to compute mean for subset

In this dataframe:
df <- data.frame(
comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
hbr = c(101:110)
)
let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(comp)) %>%
summarise(M = mean(hbr[grp==2]))
M
1 103.5
I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':
df %>%
mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
filter(grp == min(grp, na.rm = TRUE)) %>%
summarise(M = mean(hbr))
#> M
#> 1 103.5
In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.
idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5
mean(df$hbr[first])
#[1] 103.5

Remove observation before certain row

I have a data frame and I want to compute the mean across the variable value for all the period excluding +- two observations before/after that the crisis is 1 (i don't care about missing val). The calculation should be done by country (even though here in the example below I have only one country). Example:
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
mean(df$value[df$crisis == 0], na.rm=TRUE)
# expected result
exp_mean <- (1+2+3+9+10)/5
exp_mean
edit:
I would like to get a general case where we take into account other possible 1 in the dataset, for instance if we have
crisis[10] = 1
the result should be (3+9)/2
in order not to consider the periods after the first crisis but that actually experience a crisis at the second perdiod. Any idea?
Another base R solution, using outer + c + unique to filter out rows, i.e.,
r <- mean(na.omit(df[-unique(c(outer(which(df$crisis==1),-2:2,"+"))),"value"]))
such that
> r
[1] 5
We can write a function which excludes the variables which are +- 2 observations after crisis = 1.
custom_mean <- function(c, v) {
inds <- which(c == 1)
mean(v[-unique(c(sapply(inds, `+`, -2:2)))], na.rm = TRUE)
}
sapply is used assuming there could be multiple crisis = 1 situations for a country.
We can then apply this function for each country.
library(dplyr)
df %>% group_by(country) %>% summarise(exp_mean = custom_mean(crisis, value))
# A tibble: 1 x 2
# country exp_mean
# <fct> <dbl>
#1 AT 5
This solution using base R works as long as there is only one value with 'crisis == 1' and as long as there are always two rows befor and after the row with 'crisis == 1'
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
df[(which(df$crisis == 1) - 2):(which(df$crisis == 1) + 2), ]
This solution does not work for this data:
country <- rep("AT",11)
value <- seq(1,11,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0,1)
df2 <- data.frame(country, value, crisis)
df2[(which(df2$crisis == 1) - 2):(which(df2$crisis == 1) + 2), ]

R - Extracting values from other rows

As suggested by the title, I would like to extract values from other rows.
In particular, as an example please consider the following dataset:
id.in.group <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
group <- c(1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4)
trial <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
subject <- c("s7","s11","s3","s6","s9","s4","s12","s10","s1","s8","s2","s5","s5","s9","s6","s10","s1","s3","s4","s7","s2","s8","s12","s11","s5","s3","s9","s12","s11","s10","s1","s6","s7","s4","s2","s8")
df <- data.frame(group, id.in.group, trial, subject)
df$other1.id <- 0
df$other2.id <- 0
df$other1.id <- ifelse(df$id.in.group == "1" , 2, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "1" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "2" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "2" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "3" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "3" , 2, df$other2.id)
View(df)
Given the group number (df$group) and the id of the others in the group (df$other1.id and df$other2.id), I would like to create two further variables showing, for each trial and each subject, the value of the other 2 subjects rather than their relative id.in.group, so as to get the two following columns
df$other1.subject<-c("s11","s7","s7","s9","s6","s6","s10","s12","s12","s2","s8","s8","s9","s5","s5","s1","s10","s10","s7","s4","s4","s12","s8","s8", "s3","s5","s5","s11","s12","s12","s6","s1","s1","s2","s4","s4")
df$other2.subject<-c("s3","s3","s11","s4","s4","s9","s1","s1","s10","s5","s5","s2","s6","s6","s9","s3","s3","s1","s2","s2","s7","s11","s11","s12","s9","s9","s3","s10","s10","s11","s7","s7","s6","s8","s8","s2")
View(df)
For instance, if trial = 1 and id.in.group = 1 (or alternatively, subject = s7), then other1.subject = s11 while other2.subject = s3. I would like to extract such values for each id.in.group (or each subject) or for each row.
I beg you a pardon if I don't provide any previous attempt but, honestly, I have no clue about how to tackle the problem. I remain open to any further clarification.
Many thanks for all your help!
You need to left join df with itself two times - one for other1, second for other2:
library(dplyr)
df %>%
left_join(
df %>%
select(group, trial, other1.id = id.in.group, other1.subject = subject),
by = c("group", "trial", "other1.id")
) %>%
left_join(
df %>%
select(group, trial, other2.id = id.in.group, other2.subject = subject),
by = c("group", "trial", "other2.id")
)

Using apply to replace nested for loop

My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))

Add time strata variable and change format in r

I have tried to get an answer to this with no luck. Hopefully someone out there can assist me. I have a data set of patients.
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
I want to change the dataset so I have one line per patient and I want to calculate how many patients had fever in hour 1, on ventilator in hour 1, high blood pressure in hour 1, combinations of fever and ventilator and blood pressure in hour 1. The same for hour 2, 3, etc.
So I believe I first need to add a time strata variable that defines hour 1, 2, 3 etc. So Hour 1 = 0.0 - 1.0 and Hour 2 is >1.0 to 2.0. And then do a conditional count or something like that.
I have tried with the publish package, but cannot get the output right.
The output from the new data frame should look something like this:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
Can you help me?
Current data frame
How the new dataframe could look like
As an initial approach I would propose the following way. First of all, group the data by the patients and the time spans
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
If we want to process NA in another way, the solution may be
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
So, we obtain the grouped data frame in a long format
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
Then create a function to generate a data frame of a desired structure:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
And then finely apply this function to the data of each individual patient
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
Hope, it'll be helpful. However, I'm not sure about two things:
1) Both hour2.fev and hour2.BP are 0 in your output example, so why hour2.fev&vent is 1?
2) Why high.bloodpressure is 0 for the PatientID == 1 on the second time span? There is a high.bloodpressure == 1 at time 1.2 hours. This time should be included into the second time span (Hour2 between 1 and 2), shouldn't it?

Resources