Replace several values and keep others same efficiently in R - r

I have a dataframe like the following:
combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7
Each column has two single-digit values and two double-digit values composed of the single-digit values in each possible order.
I am trying to determine how to replace certain values in the dataframe so that there is only one version of the double-digit value. For example, all values of 21 in the first column should be 12. All values of 32 in the second column should become 23.
I know I can do something like this using the following code:
df <- df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17))
The problem with this is that it gives me a dataframe that contains the correct values when specified but leaves all the other values as NA. The resulting dataframe only contains values where 21, 32, 41, and 71 were. I know I could address this by specifying each value, like x == 1 ~ 1. However, I have many values and would prefer to only specify the ones that I am trying to change.
How can I replace several values in a dataframe without all the other values becoming NA? Is there a way for me to replace the values I want to replace while holding the other values the same without directly specifying those values?

You can use TRUE ~ x at the end of your case_when() sequence:
df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17, TRUE ~ x))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Another option that may be more efficient would be data.table's fcase() function.
Data:
df = read.table(header = TRUE, text = "combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7")
df[] = lapply(df, as.double) # side-note: tidyverse has become very stict about types

One dplyr and stringi option may be:
df %>%
mutate(across(everything(),
~ if_else(. %in% c(21, 32, 41, 71), as.integer(stri_reverse(.)), .)))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7

Using mapply:
df1[] <- mapply(function(d, x1, x2){ ifelse(d == x1, x2, d) },
d = df1,
x1 = c(21, 32, 41, 71),
x2 = c(12, 23, 14, 17))
df1
# combo_2 combo_4 combo_7 combo_9
# 1 12 23 14 17
# 2 12 23 14 17
# 3 2 3 1 7
# 4 1 2 4 1
# 5 12 23 14 17
# 6 2 23 1 7

Related

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

Loop over certain columns to replace NAs with 0 in a dataframe

I have spent a lot of time trying to write a loop to replace NAs with zeros for certain columns in a data frame and have not yet succeeded. I have searched and can't find similar question.
df <- data.frame(A = c(2, 4, 6, NA, 8, 10),
B = c(NA, 10, 12, 14, NA, 16),
C = c(20, NA, 22, 24, 26, NA),
D = c(30, NA, NA, 32, 34, 36))
df
Gives me:
A B C D
1 2 NA 20 30
2 4 10 NA NA
3 6 12 22 NA
4 NA 14 24 32
5 8 NA 26 34
6 10 16 NA 36
I want to set NAs to 0 for only columns B and D. Using separate code lines, I could:
df$B[is.na(df$B)] <- 0
df$D[is.na(df$D)] <- 0
However, I want to use a loop because I have many variables in my real data set.
I cannot find a way to loop over only columns B and D so I get:
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Essentially, I want to apply a loop using a variable list to a data frame:
varlist <- c("B", "D")
How can I loop over only certain columns in the data frame using a variable list to replace NAs with zeros?
here is a tidyverse aproach:
library(tidyverse)
df %>%
mutate_at(.vars = vars(B, D), .funs = funs(ifelse(is.na(.), 0, .)))
#output:
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
basically you say vars B and D should change by a defined function. Where . corresponds to the appropriate column.
Here's a base R one-liner
df[, varlist][is.na(df[, varlist])] <- 0
using the zoo package we can fill the selected columns.
library(zoo)
df[varlist]=na.fill(df[varlist],0)
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
In base R we can have
df[varlist]=lapply(df[varlist],function(x){x[is.na(x)]=0;x})
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36

How to perform complex multicolumn match in R /

I wish to match two dataframes based on conditionals on more than one column but cannot figure out how. So if there are my data sets:
df1 <- data.frame(lower=c(0,5,10,15,20), upper=c(4,9,14,19,24), x=c(12,45,67,89,10))
df2 <- data.frame(age=c(12, 14, 5, 2, 9, 19, 22, 18, 23))
I wish to match age from df2 that falls into the range between lower and upper in df1 with the aim to add an extra column to df2 containing the value of x in df1 where age lies between upper and lower. i.e. I want df2 to look like
age x
12 67
14 67
5 45
....etc.
How can I achieve such a match ?
I would go with a simple sapply and a "anded" condition in the df1$x selection like this:
df2$x <- sapply( df2$age, function(x) { df1$x[ x >= df1$lower & x <= df1$upper ] })
which gives:
> df2
age x
1 12 67
2 14 67
3 5 45
4 2 12
5 9 45
6 19 89
7 22 10
8 18 89
9 23 10
For age 12 for example the selection inside the brackets gives:
> 12 >= df1$lower & 12 <= df1$upper
[1] FALSE FALSE TRUE FALSE FALSE
So getting df1$x by this logical vector is easy as your ranges don't overlap
Using foverlaps from data.table is what you are looking for:
library(data.table)
setDT(df1)
setDT(df2)[,age2:=age]
setkey(df1,lower,upper)
foverlaps(df2, df1, by.x = names(df2),by.y=c("lower","upper"))[,list(age,x)]
# age x
# 1: 12 67
# 2: 14 67
# 3: 5 45
# 4: 2 12
# 5: 9 45
# 6: 19 89
# 7: 22 10
# 8: 18 89
# 9: 23 10
Here's another vectorized approach using findInterval on a melted data set
library(data.table)
df2$x <- melt(setDT(df1), "x")[order(value), x[findInterval(df2$age, value)]]
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10
The idea here is to
First, tidy up you data so lower and upper will be in the same column and x will have corresponding values to that new column,
Then, sort the data according to these ranges (necessary for findInterval).
Finally, run findInterval within the x column in order to find the correct incidences
And here's a possible dplyr/tidyr version
library(tidyr)
library(dplyr)
df1 %>%
gather(variable, value, -x) %>%
arrange(value) %>%
do(data.frame(x = .$x[findInterval(df2$age, .$value)])) %>%
cbind(df2, .)
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10

dplyr- renaming sequence of columns with select function

I'm trying to rename my columns in dplyr. I found that doing it with select function. however when I try to rename some selected columns with sequence I cannot rename them the format that I want.
test = data.frame(x = rep(1:3, each = 2),
group =rep(c("Group 1","Group 2"),3),
y1=c(22,8,11,4,7,5),
y2=c(22,18,21,14,17,15),
y3=c(23,18,51,44,27,35),
y4=c(21,28,311,24,227,225))
CC <- paste("CC",seq(0,3,1),sep="")
aa<-test%>%
select(AC=x,AR=group,CC=y1:y4)
head(aa)
AC AR CC1 CC2 CC3 CC4
1 1 Group 1 22 22 23 21
2 1 Group 2 8 18 18 28
3 2 Group 1 11 21 51 311
4 2 Group 2 4 14 44 24
5 3 Group 1 7 17 27 227
6 3 Group 2 5 15 35 225
the problem is even I set CC value from CC0, CC1, CC2, CC3 the output gives automatically head names starting from CC1.
how can I solve this issue?
I think you'll have an easier time crating such an expression with the select_ function:
library(dplyr)
test <- data.frame(x=rep(1:3, each=2),
group=rep(c("Group 1", "Group 2"), 3),
y1=c(22, 8, 11, 4, 7, 5),
y2=c(22, 18, 21, 14, 17, 15),
y3=c(23, 18, 51, 44, 27, 35),
y4=c(21, 28, 311,24, 227, 225))
# build out our select "translation" named vector
DQ <- paste0("y", 1:4)
names(DQ) <- paste0("DQ", seq(0, 3, 1))
# take a look
DQ
## DQ0 DQ1 DQ2 DQ3
## "y1" "y2" "y3" "y4"
test %>%
select_("AC"="x", "AR"="group", .dots=DQ)
## AC AR DQ0 DQ1 DQ2 DQ3
## 1 1 Group 1 22 22 23 21
## 2 1 Group 2 8 18 18 28
## 3 2 Group 1 11 21 51 311
## 4 2 Group 2 4 14 44 24
## 5 3 Group 1 7 17 27 227
## 6 3 Group 2 5 15 35 225

Resources