Replace matching column values in multiple dataframes with NA in R - r

I have a dataframe which is like the followng:
dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
target = rep(c("1", "2", "3"), times = 2),
eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 0.75 0.09
I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:
outliers1 <- data.frame(participant = c("01", "02"),
target = c("1", "3"),
eucDist = c(0.06, 0.75),
eucDist2 = c(0.09, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 02 3 0.75 0.09
outliers2 <- data.frame(participant = "01",
target = "1",
eucDist = 0.06,
eucDist2 = 0.09)
participant target eucDist eucDist2
1 01 1 0.06 0.09
The rows shown in Outliers1 indicate outliers in the eucDist column in dat, and the row shown in Outliers2 indicates an outlier in the eucDist2 column.
I would like to replace the outlier values in the eucDist and eucDist2 columns of datwith 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.
Here is what I would like:
participant target eucDist eucDist2
1 01 1 NA NA
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 NA 0.09
I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:
library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
replace_with_na_all(dat$eucDist)
}

As you have the data frames in this format, you can set the values of the required data to NA in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update(). This assumes you have at least dplyr v1.0.0.
library(dplyr)
outliers1$eucDist <- NA
outliers2$eucDist2 <- NA
dat |>
rows_update(outliers1, by = c("participant", "target")) |>
rows_update(select(outliers2, -eucDist), by = c("participant", "target"))
# participant target eucDist eucDist2
# 1 01 1 NA NA
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 1.45
# 6 02 3 NA 0.09

I would recommend using case_when function in the dplyr package.
dat_final <- dat %>%
mutate(eucDist = case_when(
eucDist %in% outliers1$eucDist ~ as.numeric(NA),
T ~ eucDist
),
eucDist2 = case_when(
eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
T ~ eucDist2
))
> str(dat_final)
'data.frame': 6 obs. of 4 variables:
$ participant: chr "01" "01" "01" "02" ...
$ target : chr "1" "2" "3" "1" ...
$ eucDist : num NA 0.16 0.89 0.1 0.11 NA
$ eucDist2 : num 0.09 0.04 0.03 0.05 NA 0.09

rbind the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for loop.
outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))
for (i in seq_len(nrow(outliers))) {
dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
# participant target eucDist eucDist2
# 1 01 1 NA 0.09
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 NA
# 6 02 3 NA 0.09

Related

Operations within rows in R

Consider the following dataset, where:
Variables 1-7 (var1-7) are linear measurements taken from five lizards (indvA-E);
Variable 8 (var8) is the number of variables, for each lizard, that contain values that are not equal to NA;
Variable 9 (var9) is the sum of variables 1-7;
data <- data.frame(var1 = c(0.13,0.08,0.05,0.11,0.09),
var2 = c(0.17,0.09,0.07,0.15,0.13),
var3 = c(0.19,0.11,0.19,0.17,0.14),
var4 = c(NA,0.11,0.31,0.38,0.17),
var5 = c(NA,NA,0.39,0.41,0.19),
var6 = c(NA,NA,0.40,0.75,NA),
var7 = c(NA,NA,0.45,0.79,NA))
row.names(data) <- c("indv.A","indv.B","indv.C","indv.D","indv.E")
data[,"var8"] <- rowSums(!is.na(data))
data[,"var9"] <- rowSums(data[,1:7], na.rm = TRUE)
data
# var1 var2 var3 var4 var5 var6 var7 var8 var9
# indv.A 0.13 0.17 0.19 NA NA NA NA 3 0.49
# indv.B 0.08 0.09 0.11 0.11 NA NA NA 4 0.39
# indv.C 0.05 0.07 0.19 0.31 0.39 0.40 0.45 7 1.86
# indv.D 0.11 0.15 0.17 0.38 0.41 0.75 0.79 7 2.76
# indv.E 0.09 0.13 0.14 0.17 0.19 NA NA 5 0.72
I'd like to create a new variable, named var10, that can be described as either "var8 divided by (var7 minus the last non-NA value of variables 1-7)" or "var8 divided by all but the last non-NA value of variables 1-7".
For the above dataset, this new variable will contain:
# var1-9 var10
# indv.A [...] 10.00
# indv.B [...] 14.29
# indv.C [...] 4.96
# indv.D [...] 3.55
# indv.E [...] 9.43
I just don't know how to write in R the formula to obtain this variable. Any help will be greatly appreciated.
1) If we need the last non-NA value from var1 to var7, we can do
v1 <- data[cbind(seq_len(nrow(data)), max.col(!is.na(data[1:7]), "last"))]
data$var10 <- data$var8/v1
2) For the second case to skip the last non-NA
data$var10 <- data$var8/
apply(data[1:7], 1, \(x) sum(head(x[!is.na(x)], -1)))
> data$var10
[1] 10.000000 14.285714 4.964539 3.553299 9.433962

How to update values of certain columns of a dataframe with values from another dataframe in r

I am struggling to write an R code for the following problem:
df1 and df2 are two dataframes.
> df1 <- read.csv(file = 'Indx.csv')
> df1
St_Name I1 I2 I3 I4
1 TN 0.10 0.15 0.20 0.25
2 AZ 0.30 0.35 0.40 0.45
3 TX 0.50 0.55 0.60 0.65
4 KS 0.70 0.75 0.80 0.85
5 KY 0.90 0.95 0.11 0.12
6 MN 0.13 0.14 0.16 0.17
> df2 <- as.data.frame(fromJSON(file = "NewIndx.json"))
> df2
St_Name I1 I3
1 KS 100 200
# The output should be
> df1
St_Name I1 I2 I3 I4
1 TN 0.10 0.15 0.20 0.25
2 AZ 0.30 0.35 0.40 0.45
3 TX 0.50 0.55 0.60 0.65
4 KS 100 0.75 200 0.85
5 KY 0.90 0.95 0.11 0.12
6 MN 0.13 0.14 0.16 0.17
>
what is the optimal code to achieve this?
We could use this slightly modified function coalesce_join provided by Edward Visel:
library(tidyverse)
# the function:
coalesce_join <- function(x, y,
by = NULL, suffix = c(".y", ".x"),
join = dplyr::full_join, ...) {
joined <- join(y, x, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(y), names(x))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
# apply
coalesce_join(df1, df2, by = 'St_Name')
St_Name I1 I3 I2 I4
1 KS 100.00 200.00 0.75 0.85
2 TN 0.10 0.20 0.15 0.25
3 AZ 0.30 0.40 0.35 0.45
4 TX 0.50 0.60 0.55 0.65
5 KY 0.90 0.11 0.95 0.12
6 MN 0.13 0.16 0.14 0.17
Kindly let me know if this is what you were anticipating.
library(tidyr)
id<- "St_Name"
df_1<- melt(df_1, id.vars = id, measure.vars = setdiff(colnames(df_1),id))
df_2 <- melt(df_2, id.vars = id, measure.vars = setdiff(colnames(df_2),id))
result <- merge(df_1,df_2, by=c("St_Name","variable"),no.dups = TRUE,all.x = TRUE)
result$value.x[which(!is.na(result$value.y))]<- result$value.y[which(!is.na(result$value.y))]
result <- result[,-4]
result <-spread(result, variable, value.x)
We could use {powerjoin} and use the conflict argument. coalesce_xy will pick in priority data from the right side table.
data
df1 <- tibble::tribble(
~St_Name, ~I1, ~I2, ~I3, ~I4,
"TN", 0.10, 0.15, 0.20, 0.25,
"AZ", 0.30, 0.35, 0.40, 0.45,
"TX", 0.50, 0.55, 0.60, 0.65,
"KS", 0.70, 0.75, 0.80, 0.85,
"KY", 0.90, 0.95, 0.11, 0.12,
"MN", 0.13, 0.14, 0.16, 0.17)
df2 <- tibble::tribble(
~St_Name, ~I1, ~I3,
"KS", 100, 200
)
solution
library(powerjoin)
power_left_join(df1, df2, by = "St_Name", conflict = coalesce_yx)
#> # A tibble: 6 × 5
#> St_Name I2 I4 I1 I3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 TN 0.15 0.25 0.1 0.2
#> 2 AZ 0.35 0.45 0.3 0.4
#> 3 TX 0.55 0.65 0.5 0.6
#> 4 KS 0.75 0.85 100 200
#> 5 KY 0.95 0.12 0.9 0.11
#> 6 MN 0.14 0.17 0.13 0.16

Dplyr mutate many columns with each new column conditional on two columns

My data frame is like the simple one below with many more columns and rows.
My goal is to add a new column for every "model" based on some ifelse() output using the matching pval and value_IC column.
Here, models are linear, beta and emax.
The closest problem I have found so far was here
https://community.rstudio.com/t/how-to-mutate-at-mutate-if-multiple-columns-using-condition-on-other-column-outside-vars-dplyr/17506/2
where there is always the same "second" column used.
data <- data.frame(pval.linear.orig = c(0.01, 0.06, 0.02),
pval.beta.orig = c(0.06, 0.02, 0.01),
pval.emax.orig = c(0.03, 0.01, 0.07),
value_IC.linear.orig = c(-5, NA, -4),
value_IC.beta.orig = c(NA, NA, -10),
value_IC.emax.orig = c(NA, -11, NA))
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig
1 0.01 0.06 0.03 -5 NA NA
2 0.06 0.02 0.01 NA NA -11
3 0.02 0.01 0.07 -4 -10 NA
If I only wanted it for one model, let's say beta, I would do this:
library(dplyr)
data_new <- data %>% mutate(conv.beta.orig = case_when(
pval.beta.orig > 0.025~ NA,
pval.beta.orig <= 0.025 & !(is.na(value_IC.beta.orig)) ~ TRUE,
pval.beta.orig <= 0.025 & is.na(value_IC.beta.orig) ~ FALSE))
data_new
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig conv.beta.orig
1 0.01 0.06 0.03 -5 NA NA NA
2 0.06 0.02 0.01 NA NA -11 FALSE
3 0.02 0.01 0.07 -4 -10 NA TRUE
to get the conv.beta.orig column. The column name does not have to be exactly in this format.
My problem now is to do so with all models I have each using the pval.MODEL.orig and value_IC.MODEL.orig column as above.
Thank you very much for your help!
This is the first question I ever posted, let me now if I should reformulate something / missed something or didn't spot this problem in case it already exists / etc.

Remove leading zeros in numbers *within a data frame*

Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...

Mutate group of columns based on values in other group of columns

I am trying to convert values in one column to NA based on if the values in another corresponding column are NA. I need to do this for two large groups of corresponding columns so I cannot mutate each column one by one.
For example, below, 2002 inflationNext2Years turns to NA since 2002 realReturnNext2Years is NA.
year <- c(2000, 2001, 2002)
realReturnNext1Years <- c(.1,.2,.3)
realReturnNext2Years <- c(.15,.25, NA)
realReturnNext3Years <- c(.45, NA, NA)
inflationNext1Years <- c(.02, .03, .07)
inflationNext2Years <- c(.03, .05, .08)
inflationNext3Years <- c(.04, .06, .09)
data <- data.frame(year, realReturnNext1Years, realReturnNext2Years, realReturnNext3Years, inflationNext1Years, inflationNext2Years, inflationNext3Years)
data
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
1 2000 0.1 0.15 0.45 0.02 0.03 0.04
2 2001 0.2 0.25 NA 0.03 0.05 0.06
3 2002 0.3 NA NA 0.07 0.08 0.09
I am trying to covert data into:
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
2000 0.1 0.15 0.45 0.02 0.03 0.04
2001 0.2 0.25 NA 0.03 0.05 NA
2002 0.3 NA NA 0.07 NA NA
Since I have many columns, I cannot do this one column at a time. I tried to use mutate_at with an ifelse() but was not sure how to test if the number of years lined up.
I have a vector of the realReturn column names and another vector of the inflation column names. I am trying to change the inflation columns to NA if their corresponding realReturnColumn is NA, but keep the inflation column the same if the realReturnColumn is not NA.
We can collect indices of "realReturnNext" columns using grep, get the position of their NA's and replace the corresponding positions in "inflationNext" cols to NA's
real_cols <- grep("^realReturnNext", colnames(data))
inflation_cols <- grep("^inflationNext", colnames(data))
data[inflation_cols][is.na(data[real_cols])] <- NA
# year realReturnNext1Years realReturnNext2Years realReturnNext3Years
#1 2000 0.1 0.15 0.45
#2 2001 0.2 0.25 NA
#3 2002 0.3 NA NA
# inflationNext1Years inflationNext2Years inflationNext3Years
#1 0.02 0.03 0.04
#2 0.03 0.05 NA
#3 0.07 NA NA

Resources