I have a data frame and I want to compute the mean across the variable value for all the period excluding +- two observations before/after that the crisis is 1 (i don't care about missing val). The calculation should be done by country (even though here in the example below I have only one country). Example:
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
mean(df$value[df$crisis == 0], na.rm=TRUE)
# expected result
exp_mean <- (1+2+3+9+10)/5
exp_mean
edit:
I would like to get a general case where we take into account other possible 1 in the dataset, for instance if we have
crisis[10] = 1
the result should be (3+9)/2
in order not to consider the periods after the first crisis but that actually experience a crisis at the second perdiod. Any idea?
Another base R solution, using outer + c + unique to filter out rows, i.e.,
r <- mean(na.omit(df[-unique(c(outer(which(df$crisis==1),-2:2,"+"))),"value"]))
such that
> r
[1] 5
We can write a function which excludes the variables which are +- 2 observations after crisis = 1.
custom_mean <- function(c, v) {
inds <- which(c == 1)
mean(v[-unique(c(sapply(inds, `+`, -2:2)))], na.rm = TRUE)
}
sapply is used assuming there could be multiple crisis = 1 situations for a country.
We can then apply this function for each country.
library(dplyr)
df %>% group_by(country) %>% summarise(exp_mean = custom_mean(crisis, value))
# A tibble: 1 x 2
# country exp_mean
# <fct> <dbl>
#1 AT 5
This solution using base R works as long as there is only one value with 'crisis == 1' and as long as there are always two rows befor and after the row with 'crisis == 1'
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
df[(which(df$crisis == 1) - 2):(which(df$crisis == 1) + 2), ]
This solution does not work for this data:
country <- rep("AT",11)
value <- seq(1,11,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0,1)
df2 <- data.frame(country, value, crisis)
df2[(which(df2$crisis == 1) - 2):(which(df2$crisis == 1) + 2), ]
Related
I'm trying to create multiple data frames within a list within another list from one original data base using two for loops.
The first iteration applies a for loop to de original data base that uses the levels of the factor as index to group data by sites, creating a sites list.
The second iteration (the one i'm having problems), I wan't it to create data frames within the sites lists that are grouped by year.
set.seed(100)
N <- sample(50, 100, replace = TRUE)
Year <- as.factor(sample(rep(2011:2020, each = 5)))
Site <- as.factor(sample(rep(c('S1', 'S2', 'S3', 'S4', 'S5'), each = 10)))
Species <- sample(rep(c('spp1', 'spp2', 'spp3', 'spp4', 'spp5'), each = 10))
DataBase <- data.frame(Year, Site, Species, N)
Ind <- list()
Ind_year <- list ()
for (i in levels(DataBase$Site)) {
Ind[[i]] <- DataBase %>%
filter (Site == as.character(i)) %>%
group_by(Year, Species) %>%
count() %>%
droplevels()
for(j in levels(Ind[[(i)]]$Year)) {
Ind_year[[j]] <- as.data.frame(Ind[[i]] %>%
filter (Year == as.character(j)) %>%
group_by(Year, Species) %>%
droplevels())
}
}
No error detected, but the result within the first list is this:
Site 1
Site 2
Site 3
.
.
.
Year 1
Year 2
Year 3
For example, I want the Site 1 list within the Ind list to contain the data frames of Year 1...Year n.
Any help would be appreciated.
You seem to be very close to the solution - If I understood your problem correctly there are just two more lines needed and well I cleaned your code a little. One slightly unfortunate aspect is that year is a number and when using this directly instead of getting a named list entry you get a entry at list positon of the year number -> so I converted the years to text before running the loop:
set.seed(100)
library(dplyr)
# Your dummy data - we do not need factors but having the year as character is very helpfull
DataBase <- data.frame(Year = as.character(sample(rep(2011:2020, each = 5))),
Site = sample(rep(c('S1', 'S2', 'S3', 'S4', 'S5'), each = 10)),
Species = sample(rep(c('spp1', 'spp2', 'spp3', 'spp4', 'spp5'), each = 10)),
N = sample(50, 100, replace = TRUE))
Ind <- list()
Ind_year <- list()
for (i in unique(DataBase$Site)) {
Ind[[i]] <- DataBase %>%
dplyr::filter(Site == i) %>%
dplyr::count(Year, Species)
for(j in unique(Ind[[i]]$Year)) {
Ind_year[[j]] <- Ind[[i]] %>%
dplyr::filter(Year == j) %>%
dplyr::group_by(Year, Species)
}
# put the inner loop list where the result of the corresponding first loop resides
Ind[[i]] <- Ind_year
# out of precaution we set the result to nothing so that there is no risk of reusing the result from the prior site
Ind_year <- NULL
}
Ind$S1$`2012`
# A tibble: 2 x 3
# Groups: Year, Species [2]
Year Species n
<chr> <chr> <int>
1 2012 spp3 2
2 2012 spp5 2
I hope this is what your where looking for?!
You can split by multiple columns :
result <- split(DataBase, list(DataBase$Site, DataBase$Year))
Or if you want a nested list you can use split with lapply :
result <- lapply(split(DataBase, DataBase$Site), function(x) split(x, x$Year))
I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.
I have a dataframe with two columns for year and age, e.g.:
df <- data.frame(year = 1980:2000, age = c(40:45, 31:40, 32:36))
I need to create a categorical variable that identifies each age sequence. That would look something like this:
df$seq <- as.character(c(rep(1,6), rep(2,10), rep(3,5)))
Any ideas how to do this efficiently? I have managed to create a dummy for sequence breaks
require(dplyr)
df <- df %>% mutate(brk = case_when(age - lag(age) != 1 ~ 1, T ~ 0)
but I'm struggling with filling in the rest.
You have almost done it already. You just need to create a cumulative sum (cumsum) of your brk column:
df %>% mutate(brk = cumsum(case_when(age - lag(age) != 1 ~ 1, T ~ 0)))
You can add 1 to the whole vector if you want to start the first sequence from 1 instead of 0.
I have tried to get an answer to this with no luck. Hopefully someone out there can assist me. I have a data set of patients.
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
I want to change the dataset so I have one line per patient and I want to calculate how many patients had fever in hour 1, on ventilator in hour 1, high blood pressure in hour 1, combinations of fever and ventilator and blood pressure in hour 1. The same for hour 2, 3, etc.
So I believe I first need to add a time strata variable that defines hour 1, 2, 3 etc. So Hour 1 = 0.0 - 1.0 and Hour 2 is >1.0 to 2.0. And then do a conditional count or something like that.
I have tried with the publish package, but cannot get the output right.
The output from the new data frame should look something like this:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
Can you help me?
Current data frame
How the new dataframe could look like
As an initial approach I would propose the following way. First of all, group the data by the patients and the time spans
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
If we want to process NA in another way, the solution may be
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
So, we obtain the grouped data frame in a long format
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
Then create a function to generate a data frame of a desired structure:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
And then finely apply this function to the data of each individual patient
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
Hope, it'll be helpful. However, I'm not sure about two things:
1) Both hour2.fev and hour2.BP are 0 in your output example, so why hour2.fev&vent is 1?
2) Why high.bloodpressure is 0 for the PatientID == 1 on the second time span? There is a high.bloodpressure == 1 at time 1.2 hours. This time should be included into the second time span (Hour2 between 1 and 2), shouldn't it?
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.