Replace value and previous value if condition is met in R - r

I have a dataset that looks like this:
df <- matrix(c(0.2, 0.5, 1, 3.1, 0.5, 0.3, 0.1, 4, 0.3, 1.2), nrow=5, ncol=2)
(This is a simplified example)
I would like to write a function or loop that checks if each value (t) or its previous value (t-1) are bigger than 3, and that replaces both t and t-1 with NA if either one of them is bigger than 3.
The desired outcome would thus look something like this:
Thanks in advance.

Here is a base R solution which should yield your desired outcome. Note, that since there is no "wrap-around" in your desired output, I turned the matrix into a data.frame.
# your data
df <-matrix(c(0.2, 0.5, 1, 3.1, 0.5, 0.3, 0.1, 4, 0.3, 1.2), nrow=5, ncol=2)
# needs to be converted to a data.frame
df <- as.data.frame(df)
# recode function
recode_df <- function(x) {
x2 <- c(NA,x[-length(x)])
x3 <- c(x[-1], NA)
replace(x, (x > 3 | x2 > 3 | x3 > 3), NA)
}
# apply recode function on all columns
as.data.frame(lapply(df, recode_df))
#> V1 V2
#> 1 0.2 0.3
#> 2 0.5 NA
#> 3 NA NA
#> 4 NA NA
#> 5 NA 1.2
Created on 2020-05-23 by the reprex package (v0.3.0)

Related

Pooled Mean Matching MICE

For MICE imputations I need to constrict the predictions so that the predicted values will have the same mean (which is a measured value). The situation is we are dealing with mean blood serum samples (individual blood samples are pooled together) where we have measured values, which are representative of the mean of those individuals. I am trying to predict what the concentration of x was in those individuals based on the measured mean and covariate data. You'll notice in my dummy dataset that there are 3 individuals (Individual_id) for each pool (Pool_id). So when imputing these values to the individuals we need the average of those 3 individuals to equal the Pool_mean.
How can we constrict the Mice algorithm to still predict based on covariate data, but have the means match exactly (can be any method chose, "cart", in this circumstance)? Could this conceptually be done through a MICE squeeze constraint with inputs from the mean?
The code is below:
library(mice)
library(dplyr)
#create demo data table as an example
Pool_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
Pool_mean <- c(15, 15, 15, 35, 35, 35, 42, 42, 42)
Individual_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
concentration <- c(10, 20, NA, 30, NA,NA, NA, NA, 70)
co_variate <- c(0.1, 0.2, 0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.3)
df <- data.frame(Pool_id, Pool_mean, Individual_id, concentration, co_variate)
#run mice to impute missing data
initial_imputed <- mice(df, m = 5, maxit = 10, meth = "cart", seed = 3985))
completed <- complete(intial_imputed)
I know that we can constraint mice using a post process and maybe a custom function like the vec_squeeze below. However, I need to constraint the values based on a mean. How could I update this function to create this?
vec_squeeze <- function(x, bounds) {
stopifnot(length(x) == nrow(bounds))
pmin(pmax(x, bounds[,1]), bounds[,2])
}
Here's an example of how to use passive imputation on the 3rd variable to force the mean of the imputations to be equal to pool_mean from the data. First generate some data in 'wide' format.
set.seed(123)
# Using larger example data to avoid issues with imputation models
n <- 20
pool_id <- rep(1:n, each = 3)
ind_id <- rep(1:3, times = n)
cov_1 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_2 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_3 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
conc_1 <- round(rnorm(n*3, mean = 20 + 5*cov_1, sd = 5))
conc_2 <- round(rnorm(n*3, mean = 20 + 5*cov_2, sd = 5))
conc_3 <- round(rnorm(n*3, mean = 20 + 5*cov_3, sd = 5))
pool_mean <- apply(cbind(conc_1, conc_2, conc_3), FUN = mean, MARGIN = 1)
df <- data.frame(pool_id, ind_id, pool_mean, conc_1, conc_2,
conc_3, cov_1, cov_2, cov_3)
df[which(rbinom(n*3, 1, prob = 0.5) == 1), "conc_3"] <- NA
df[which(rbinom(n*3, 1, prob = 0.2) == 1), "conc_2"] <- NA
df[which(is.na(df$conc_2)),"conc_3"] <- NA
head(df)
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 NA NA 0.3 0.1 0.2
#> 4 2 1 25.00000 25 NA NA 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 NA 0.2 0.3 0.3
I forced missing values into the third position to avoid re-arranging. I also have ind_id repeated within each pool_id instead of unique, but that's not important for what follows.
The key part of the passive imputation is meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)". If (A+B+C)/3 = D, then 3D - A - B = C.
library(mice)
ini <- mice(df, maxit = 0, printFlag = FALSE)
# Limit the variables used in prediction, to avoid co-linearity
pred <- ini$predictorMatrix
pred[,] <- 0
pred["conc_1", c("pool_mean","cov_1")] <- 1
pred["conc_2", c("pool_mean","conc_1","cov_2")] <- 1
# Set the imputation methods. Use passive imputation for conc_3
meth <- ini$method
meth["conc_2"] <- "pmm"
meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)"
# Control the visit sequence to ensure that conc_3 is updated
# after conc_2. Add other missing variables if needed.
visit_seq <- c("conc_2", "conc_3")
imps <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visit_seq,
printFlag = FALSE)
head(complete(imps, action = 1))
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 18 5 0.3 0.1 0.2
#> 4 2 1 25.00000 25 23 27 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 24 0.2 0.3 0.3
Created on 2022-11-20 with reprex v2.0.2
The imputation procedure has correctly replaced row 6's conc_3 value with 24. The other rows have received a random imputation for conc_2 and then had conc_3 passively imputed. In general conc_3 has no other restrictions on it. In row 3 conc_3 = 5, which may be questionably low. It could even become negative in some situations. Better modelling of conc_2 would help.

Extract dataframes from a large list based on matching values given by another dataframe

I am new to R and despite having researched the site I can't seem to solve this:
I have a very large list of dataframes (of historical climate data from different sites) and I need to make a new list that would only contain those dataframes with specific values (longitudes and latitudes) given by one independent dataframe.
So I have:
df1 <- data.frame("x" = c(1, 1, 1), "y" = c(-2, -2, -2), "a" = c(3, 6, 9), "b" = c(4, 5, 3))
df2 <- data.frame("x" = c(1.2, 1.2, 1.2), "y" = c(-2, -2, -2), "a" = c(3, 4, 78), "b" = c(12, 5, 8))
df3 <- data.frame("x" = c(1.3, 1.3, 1.3), "y" = c(-2.1, -2.1, -2.1), "a" = c(19, 5, 5), "b" = c(7, 7, 20))
my_list <- list(df1, df2, df3)
Each dataframe corresponds to one site, with a specific value of lon and lat given in two columns (constant value for all row length, since it's the same site)
> df1
x y a b
1 1 -2 3 4
2 1 -2 6 5
3 1 -2 9 3
> df2
x y a b
1 1.2 -2 3 12
2 1.2 -2 4 5
3 1.2 -2 78 8
> df3
x y a b
1 1.3 -2.1 19 7
2 1.3 -2.1 5 7
3 1.3 -2.1 5 20
Then, I have an independent dataframe with longitudes and latitudes that I will like to obtain the climate data for.
df_xy <- data.frame("x"= c(1, 1.3), "y" = c(-2, -2.1))
> df_xy
x y
1 1.0 -2.0
2 1.3 -2.1
Ideally I would obtain a new list with the dataframes that match the x, y specified.
output_list <- list(df1, df3)
As much as I have tried with lapply and Map(merge, my_list, df_xy) I can't seem to make it work out. Thank you very much for any help!
You could do:
output_list <- lapply(my_list, function(x) merge(x, df_xy))
In output_list, the second list will be empty.
Optionally, based on How do I remove empty data frames from a list?, you could then disregard empty dataframes from output_list using Filter(function(x) dim(x)[1], output_list)

How to call up different dataframes in loops in r and add to them different computational results?

How can I automate the steps below?
I have the following example of what I would like to do - in the end get a dataframe made up of smaller dataframes that are generated automatically in earlier steps. These smaller dataframes need also calculations done in them before they are aggregated. I can do all manually with a long script, but can't seem to figure out how to combine properly list(), apply() or for() loops to get the result I wanted (not sure those are the best option here).
Please advise.
Thank you!
########### MY QUESTION IN DETAILED CODE
# DATASET
a <- c(2.0, 2.4, 2.1, 2.2, 2.3)
b <- c(4.0, 0, 4.5, 4.4, 4.8)
c <- c(0.3, 0.2, 2.0, 2.1, 2.3)
d <- c(5.0, 4.8, 4.8, 4.9, 5.0)
test.data <- data.frame(rbind(a,b,c,d))
#STEP 1: create separate dfs and do different calculations by column in each
#LONG WAY, MANUAL
# calculates % difference between each value with respect to first value in row
# in df1, then second value in row for df2, etc.
nc <- ncol(test.data)
df1 <- (test.data[,1:nc] - test.data[[1]])/(test.data[[1]])*100
df2 <- (test.data[,1:nc] - test.data[[2]])/(test.data[[2]])*100
df3 <- (test.data[,1:nc] - test.data[[3]])/(test.data[[3]])*100
df4 <- (test.data[,1:nc] - test.data[[4]])/(test.data[[4]])*100
df5 <- (test.data[,1:nc] - test.data[[5]])/(test.data[[5]])*100
# some results from above give Inf (since divided by zero), so set those to NA
df1[df1==Inf] <- NA
df2[df2==Inf] <- NA
df3[df3==Inf] <- NA
df4[df4==Inf] <- NA
df4[df4==Inf] <- NA
df5[df5==Inf] <- NA
#next will filter each calculated %-value by the specified percent difference filter
# and save the results in separate associated dataframes.
percent.diff <- 30
df.A1 <- data.frame(ifelse(df1 > -percent.diff & df1 < percent.diff, 1, 0))
df.A2 <- data.frame(ifelse(df2 > -percent.diff & df2 < percent.diff, 1, 0))
df.A3 <- data.frame(ifelse(df3 > -percent.diff & df3 < percent.diff, 1, 0))
df.A4 <- data.frame(ifelse(df4 > -percent.diff & df4 < percent.diff, 1, 0))
df.A5 <- data.frame(ifelse(df5 > -percent.diff & df5 < percent.diff, 1, 0))
#next add ID columns to each of the newly created dataframes
obs <- 4
#add row and df ID variables to each of the above
df.A1["df.cat"] <- 1
df.A1["row"] <- 1:obs
df.A2["df.cat"] <- 2
df.A2["row"] <- 1:obs
df.A3["df.cat"] <- 3
df.A3["row"] <- 1:obs
df.A4["df.cat"] <- 4
df.A4["row"] <- 1:obs
df.A5["df.cat"] <- 5
df.A5["row"] <- 1:obs
#combine the individual dataframes with IDs into a single dataframe.
Combo.df <-list(df.A1, df.A2, df.A3, df.A4, df.A5)
All.df <- Reduce(rbind, Combo.df)
FINAL OUTPUT SHOULD LOOK LIKE THIS (only first few rows shown)
X1 X2 X3 X4 X5 df.cat row
a 1 1 1 1 1 1 1
b 1 0 1 1 1 1 2
c 1 0 0 0 0 1 3
d 1 1 1 1 1 1 4
a1 1 1 1 1 1 2 1
b1 1 1 1 1 1 2 2
c1 0 1 0 0 0 2 3
d1 1 1 1 1 1 2 4
a2 1 1 1 1 1 3 1
b2 1 0 1 1 1 3 2
c2 0 0 1 1 1 3 3
d2 1 1 1 1 1 3 4
FAILED ATTEMPT TO TRY TO AUTOMATE ABOVE STEPS
#
a) created the number of dataframes I will need
num.reps <- 5
obs <- 4
n.cols <- 5
lst <- replicate(num.reps,data.frame(matrix(NA, nrow = obs, ncol = n.cols)), simplify=FALSE)
names(lst) <- paste0('df', 1:num.reps)
list2env(lst, envir = .GlobalEnv)
# b) fill dataframes (not sure how to call up dataframe by sequential names in loop)
# THIS DOES NOT WORK
f.diff.calc <- function(i)
{df[[i]] <-(df[,1:nc] - df[[i]])/(df[[i]])*100}
diff.calc.list <- replicate(5, f.diff.calc(list))
#Error in `[.data.frame`(df, , 1:nc) : undefined columns selected
This is a simplification of your code and as far as I can see it does what you want.
fun1 <- function(col, DF = test.data){
res <- 100*(DF - DF[[col]])/DF[[col]]
is.na(res) <- is.infinite(as.matrix(res))
res
}
fun2 <- function(DF, percent.diff = 30){
data.frame(ifelse(-percent.diff < DF & DF < percent.diff, 1, 0))
}
df_list <- lapply(seq_len(ncol(test.data)), fun1)
names(df_list) <- paste0("df", seq_along(df_list))
#next will filter each calculated %-value by the specified percent difference filter
# and save the results in a list of dataframes.
percent.diff <- 30
df.A_list <- lapply(df_list, fun2)
#next add ID columns to each of the newly created dataframes
tmp <- names(df.A_list)
df.A_list <- lapply(seq_along(df.A_list), function(i){
df.A_list[[i]][["df.cat"]] <- i
df.A_list[[i]][["row"]] <- seq_len(nrow(df.A_list[[i]]))
df.A_list[[i]]
})
names(df.A_list) <- tmp
# combine the results in one dataframe
All.df <- do.call(rbind, df.A_list)
Well I sincerely think with a bit more research you could have solved it. Also I cannot recreate the exact output you were getting, but I was able to match the output I was getting using your code.
Here is the automated version of the code.
a <- c(2.0, 2.4, 2.1, 2.2, 2.3)
b <- c(4.0, 0, 4.5, 4.4, 4.8)
c <- c(0.3, 0.2, 2.0, 2.1, 2.3)
d <- c(5.0, 4.8, 4.8, 4.9, 5.0)
test.data <- data.frame(rbind(a,b,c,d))
#STEP 1: create separate dfs and do different calculations by column in each
#LONG WAY, MANUAL
# calculates % difference between each value with respect to first value in row
# in df1, then second value in row for df2, etc.
nc <- ncol(test.data)
calc<-function(x,percent.diff=30,i){
x[x==Inf] <- NA
obs<-4
x.A<- data.frame(ifelse(x > -percent.diff & x < percent.diff, 1, 0))
x.A$df.cat<-i
x.A$row<-1:obs
return(x.A)
}
output<-data.frame()
for(i in 1:5){
assign(paste('df',i,sep=""),(test.data[,1:nc] - test.data[[i]])/(test.data[[i]])*100)
}
for(i in 1:5){
output<-rbind.data.frame(output,calc(x = get(paste('df',i,sep="")),percent.diff = 30,i=i))
}

How can I vectorize this task in R?

For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2

R Missing Value Replacement Function

I have a table with missing values and I'm trying to write a function that will replace the
missing values with a calculation based on the nearest two non-zero values.
Example:
X Tom
1 4.3
2 5.1
3 NA
4 NA
5 7.4
For X = 3, Tom = 5.1 + (7.4-5.1)/2.
For X = 4, Tom = (5.1 + (7.4-5.1)/2) + (7.4-5.1)/2
Does this function already exist? If not, any advice would be greatly appreciated.
A more usual way to do this (but not equivalent to the question) is to use linear interpolation:
library(zoo)
df <- data.frame(X = 1:5, Tom = c(4.3, 5.1, NA, NA, 7.4))
na.approx(df)
or spline interpolation:
na.spline(df)
Actually the imputeTS package (I am the maintainer) offers a good solutions for this.
Replacement with the Moving Average
na_ma(x, k = 2)
x is your input object
k is the moving average window
k of 1 means you only consider the values before and after
k of 2 means you consider the 2 values before and the 2 values after
This function is probably the closest to the required calculation.
The only difference is, that the imputeTS method does not jump over NA values. (as required by the thread starter)
But especially for long NA streaks this makes perfectly sense.
1, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14, 15, 16
(taking the average of 2 and 14 for the NA at position 3 would be no good idea)
Furthermore Last Observation Forward (as mentioned by 42 in the comment)
imputeTS::na_locf(x)
or Interpolation (as also mentioned by G. Grothendieck)
imputeTS::na_interpolation(x)
are also missing data replacement options that go a little bit in the same direction.
Here is a introduction to the imputeTS package in the R Journal if you are interested.
Just use a loop in this scenario, other approaches are much harder.
for (i in seq_len(nrow(df)) {
if (is.na(df[i, 'Tom']))
df[i, 'Tom'] <- ((tmp <- c(0, df$Tom[!is.na(df$Tom)], 0))[i+1] + tmp[i]) / 2 + tmp[i]
}
Example
df <- data.frame(X = seq_len(100), Tom = ifelse(runif(100, 0, 1) > 0.5, NA, round(runif(100, 0, 10), 1)))
head(df)
# X Tom
# 1 1 NA
# 2 1.4
# 3 3 NA
# 4 4 3.9
# 5 5 NA
for (i in seq_len(nrow(df))) { if (is.na(df[i, 'Tom'])) df[i, 'Tom'] <- ((tmp <- c(0, df$Tom[!is.na(df$Tom)], 0))[i+1] + tmp[i]) / 2 + tmp[i] }
head(df)
# X Tom
# 1 1 0.70
# 2 2 1.40
# 3 3 4.05
# 4 4 3.90
# 5 5 9.05

Resources