Vectorising linear interpolation function for use with mutate - r

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

Related

What is the best way to re-write (simplify) the same logic to produce the same result as below codes in R?

I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result
The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0

Add the number in every row and take the sum

Having a dataframe like this
data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
How is it possible to take the sum of every row from the column num, and include the minuse into the calculation?
Example of expected output?
data.frame(id = c(1,2), sum = c(32, 30)
Using Base R you could do the following:
# data
df <- data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
# split by ",", convert to numeric and then sum
df[, 2] <- sapply(strsplit(as.character(df$num), ","), function(x){
sum(as.numeric(x))
})
# result
df
# id num
# 1 1 32
# 2 2 30
If you can use packages, the tidy packages make this easy and use tidy data principals which are quick and easy once you get used to thinking this way.
library(tidyr)
library(dplyr)
df %>%
# Convert the string of numbers to a tidy dataframe
# with one number per row with the id column for grouping
separate_rows(num,sep = ",") %>%
# Convert the text to a number so we can sum
mutate(num = as.numeric(num)) %>%
# Perform the calculation for each id
group_by(id) %>%
# Sum the number
summarise(sum = sum(num,na.rm = TRUE)) %>%
# Ungroup for further use of the data
ungroup()
# A tibble: 2 x 2
# id sum
# <dbl> <dbl>
# 1 1 32
# 2 2 30
library(stringr)
df <- data.frame(id = c(1,2), num = c("30, 4, -2","10, 20"))
df$sum <- NA
for (i in 1:nrow(df)) {
temp <- as.character(df[i,2])
n_num <- str_count(temp, '[0-9.]+')
total <- 0
for (j in 1:n_num) {
digit <- strsplit(temp, ',')[[1]][j]
total <- total + as.numeric(digit)
temp <- sub(digit, '', temp)
}
df[i, 'sum'] <- total
}
print(df)
id num sum
1 1 30, 4, -2 32
2 2 10, 20 30

dplyr: ignore grouping variables for function input

I am trying to use tidyverse tools (instead of for loops) on some groups to be evaluated with procedures from the mvabund package.
Basically, for the procedure I need a dataframe with just numeric columns (species abundances) first and then grouping variables for a downstream procedure.
But if I want to do this on multiple groupings, I need to include grouping variables. However, when using group_by these non-numeric variables are still present and the procedure will not run.
How can I use dplyr to pass the numeric variables to a (mvabund) function?
If I were to just one group, the process is as follows:
library(tidyverse)
library(mvabund)
df <- data.frame(Genus.species1 = rep(c(0, 1), each = 10),
Genus.species2 = rep(c(1, 0), each = 10),
Genus.species3 = sample(1:100,20,replace=T),
Genus.species4 = sample(1:100,20,replace=T),
GroupVar1 = rep(c("Site1", "Site2"), each=2, times=5),
GroupVar2 = rep(c("AA", "BB"), each = 10),
GroupVar3 = rep(c("A1", "B1"), times=10))
df1 <- filter(df, GroupVar2 == "AA" & GroupVar3 == "A1") # get desired subset/group
df2 <- select(df1, -GroupVar1, -GroupVar2, -GroupVar3) # retain numeric variables
MVA.fit <- mvabund(df2) # run procedure
MVA.model <- manyglm(MVA.fit ~ df1$GroupVar1, family="negative binomial") # here I need to bring back GroupVar1 for this procedure
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
MVA.anova$table[2,] # desired result
I have tried using map, do, nest, etc to no avail.
Without groupings this works
df.t <- as_tibble(df)
nest.df <- df.t %>% nest(-GroupVar1, -GroupVar2, -GroupVar3)
mva.tt <- nest.df %>%
mutate(mva.tt = map(data, ~ mvabund(.x)))
but this next step does not
mva.tt %>% mutate(MANY = map(data, ~ manyglm(.x ~ GroupVar1, family="negative binomial")))
Moreover, once I try to remove columns that sum to zero or include groupings, everything fails.
Is there a smart way to to this with dplyr and pipes? Or is a for loop the answer?
Edit:
Originally, I asked about this :Also, when broken into groups, the dataframe will contain columns that are all zeroes, normally I'd remove these. Can I have dplyr groupings that vary in the number of variables?" but the comments revealed this is not possible given my proposed set up. So I am still interested in the above.
Copied the steps into a function. Also added group information to differentiate in the last line.
fun <- function(df) {
df1 <- select(df, -GroupVar1, -GroupVar2, -GroupVar3)
df3 <- df1 %>% select_if(~sum((.)) > 0)
MVA.fit <- mvabund(df3)
MVA.model <- manyglm(MVA.fit ~ df$GroupVar1, family="negative binomial")
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
cbind(Group2 = df$GroupVar2[1], Group3 = df$GroupVar3[1], MVA.anova$table[2,])
}
Split the dataframe into groups and apply the function
library(tidyverse)
library(mvabund)
df %>%
group_split(GroupVar2, GroupVar3) %>%
map_dfr(fun)
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
# Group2 Group3 Res.Df Df.diff wald Pr(>wald)
#1 AA A1 3 1 1.028206 0.7432567
#2 AA B1 3 1 2.979169 0.1608392
#3 BB A1 3 1 2.330708 0.2137862
#4 BB B1 3 1 1.952617 0.2567433

Add time strata variable and change format in r

I have tried to get an answer to this with no luck. Hopefully someone out there can assist me. I have a data set of patients.
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
I want to change the dataset so I have one line per patient and I want to calculate how many patients had fever in hour 1, on ventilator in hour 1, high blood pressure in hour 1, combinations of fever and ventilator and blood pressure in hour 1. The same for hour 2, 3, etc.
So I believe I first need to add a time strata variable that defines hour 1, 2, 3 etc. So Hour 1 = 0.0 - 1.0 and Hour 2 is >1.0 to 2.0. And then do a conditional count or something like that.
I have tried with the publish package, but cannot get the output right.
The output from the new data frame should look something like this:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
Can you help me?
Current data frame
How the new dataframe could look like
As an initial approach I would propose the following way. First of all, group the data by the patients and the time spans
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
If we want to process NA in another way, the solution may be
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
So, we obtain the grouped data frame in a long format
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
Then create a function to generate a data frame of a desired structure:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
And then finely apply this function to the data of each individual patient
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
Hope, it'll be helpful. However, I'm not sure about two things:
1) Both hour2.fev and hour2.BP are 0 in your output example, so why hour2.fev&vent is 1?
2) Why high.bloodpressure is 0 for the PatientID == 1 on the second time span? There is a high.bloodpressure == 1 at time 1.2 hours. This time should be included into the second time span (Hour2 between 1 and 2), shouldn't it?

Compute variable according to factor levels

I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.

Resources