How to recode values in a columns sequence in R - r

How can I recode 0 to 1 and 1 to 0 for columns i1:i3 in the below sample dataset?
df <- data.frame(id = c(11,22,33),
i1 = c(0,1,NA),
i2 = c(1,1,0),
i3 = c(0,NA,1))
> df
id i1 i2 i3
1 11 0 1 0
2 22 1 1 NA
3 33 NA 0 1
I have tens of columns starting with i... So I need a indexing condition to apply only for those columns. The desired output would be:
> df1
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0

You could approach this by indexing; would work fine if all variables beyond the id column begin with i as in the question.
df[, 2:4] <- ifelse(df[, 2:4] == 0, 1, 0)
# or more succinctly, following the examples of others, and still using `ifelse`
df[-1] <- ifelse(df[-1] == 0, 1, 0)
df
#> id i1 i2 i3
#> 1 11 1 0 1
#> 2 22 0 0 NA
#> 3 33 NA 1 0
Created on 2022-10-10 with reprex v2.0.2

We can just negate and coerce
df[-1] <- +(!df[-1])
-output
> df
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0

We can simply use -
> df[-1] <- 1 - df[-1]
> df
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0

We can mutate only the columns beginning with i followed by a number using across and matches from dplyr and we can change values as you've specified using recode.
library(dplyr)
df %>%
mutate(across(matches('^i\\d+'), recode, '1' = 0, '0' = 1))
Alternatively, in base R you can do this
i_cols <- str_detect(colnames(df), '^i\\d+')
df[,i_cols] <- ifelse(df[,i_cols] == 0, 1, 0)

Related

Ifelse across multiple columns matching on similar attributes

I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).

Return 3 next and previous rows if value occurs in particular row

I have data frame like this:
Input = (" v1 v2
1 A1 0
2 B1 0
3 C1 0
4 D1 1
5 E1 0
6 F1 0
7 G1 0
8 H1 0
9 I1 0
10 J1 0
11 K1 0
12 A2 1
13 B2 0
14 C2 0
15 D2 0
16 E2 0
17 F2 0
18 G2 0
19 H2 0
20 I2 0
21 J2 0
22 K2 0
")
df = as.data.frame(read.table(textConnection(Input), header = T, row.names=1))
And I'd like to keep only rows with 1 in v2 and 3 previous and next rows around each 1, so desired output is:
v1 v2
A1 0
B1 0
C1 0
D1 1
E1 0
F1 0
G1 0
I1 0
J1 0
K1 0
A2 1
B2 0
C2 0
D2 0
So we have all 1-rows (in this case 2) and 6 corresponding neighbor rows (3 lower, 3 upper).
In orginal dataset I have 100k+ rows and only several 1-rows spreaded in whole dataset.
I tried to do this with simple ifelse() in apply for prevs and next rows separately and then combine everything together but it doesn't work.
prev <- as.data.frame(apply(df, 1, function(x) ifelse(x[1]==1,x-1:3,0)))
next <- as.data.frame(apply(df, 1, function(x) ifelse(x[1]==1,x+1:3,0)))
I was thinking to use lag() and lead() but I don't know how to lag or lead n=3 rows only around with 1 in v2. Could you please help me out?
One possible solution (maybe a bit lengthy but very interesting for other purposes as well) is to create multiple lags and leads of the variable of interest and then filter for any variable that has value equal to 1.
We first create two functions that produce n lags and n leads, respectively, starting from a dataframe:
lags <- function(data, variable, n){
require(dplyr)
require(purrr)
variable <- enquo(variable)
indices <- seq_len(n)
quosures <- map(indices, ~quo(lag(!!variable, !!.x))) %>%
set_names(sprintf("lag_%02d", indices))
mutate(data, !!!quosures)
}
leads <- function(data, variable, n){
require(dplyr)
require(purrr)
variable <- enquo(variable)
indices <- seq_len(n)
quosures <- map(indices, ~quo(lead(!!variable, !!.x))) %>%
set_names(sprintf("lead_%02d", indices))
mutate(data, !!!quosures)
}
Then we apply them to our dataframe and filter the observations that contains a 1:
library(dplyr)
df %>%
lags(v2, n = 3) %>%
leads(v2, n = 3) %>%
filter_all(any_vars(. == 1)) %>%
select(v1, v2)
# v1 v2
# 1 A1 0
# 2 B1 0
# 3 C1 0
# 4 D1 1
# 5 E1 0
# 6 F1 0
# 7 G1 0
# 8 I1 0
# 9 J1 0
# 10 K1 0
# 11 A2 1
# 12 B2 0
# 13 C2 0
# 14 D2 0
We can find out indices where v2 = 1 occurs and use sapply to generate row numbers -3 to +3 of each index.
#get row index where v2 = 1
inds <- which(df$v2 == 1)
#unique to remove overlapping row index
inds2 <- unique(c(sapply(inds, `+`, -3:3)))
#remove negative values or values which are greater than number of rows in df
inds2 <- inds2[inds2 > 0 & inds2 <= nrow(df)]
#select rows.
df[inds2, ]
# v1 v2
#1 A1 0
#2 B1 0
#3 C1 0
#4 D1 1
#5 E1 0
#6 F1 0
#7 G1 0
#9 I1 0
#10 J1 0
#11 K1 0
#12 A2 1
#13 B2 0
#14 C2 0
#15 D2 0

How to paste values in a variable that are based conditionally on values in another different variable in r?

I have searched on stack overflow for examples similar to the problem I am facing and am stuck, so any help would be appreciated! I have a dataframe that is similar to the one below:
df <- data.frame( "ID" = c(rep(1,6), rep(2,6), rep(3,5), rep(4,5)), "A" = c(0, rep(0,4),1 ,rep(0, 5), 1, rep(0,3), 1, rep(0,2), 1, rep(0,3)), "count" = NA)
and I would like to edit the "count" variable so the dataframe looks like this:
df2 <- data.frame( "ID" = c(rep(1,6), rep(2,6), rep(3,5), rep(4,5)), "A" = c(0, rep(0,4),1 ,rep(0, 5), 1, rep(0,3), 1, rep(0,2), 1, rep(0,3)), "count" = c(NA, NA, -3:-1,1, NA, NA, -3:-1,1, -3:-1, 1:2, -1, 1:3, NA ))
Within each df$ID, when df$A = 1 I need df$count = 1. Additionally, I need df$count to count forward from 1:3 and count backwards from -1:-3, omitting zero so df2 is produced. Any help is appreciated!
You can write a function which gives you the desired sequence :
library(dplyr)
add_num <- function(x) {
#Get the index of 1
inds <- which(x == 1)
#Create a sequence with that index as 0
num <- lapply(inds, function(i) {
num <- seq_along(x) - i
#Add 1 to values greater than equal to 0
num[num >= 0] <- num[num >= 0] + 1
num[num < -3 | num > 3] <- NA
num
})
#Select the first non-NA values from the sequence
do.call(coalesce, num)
}
now apply this function each ID :
df %>% group_by(ID) %>% mutate(count = add_num(A))
# ID A count
#1 1 0 NA
#2 1 0 NA
#3 1 0 -3
#4 1 0 -2
#5 1 0 -1
#6 1 1 1
#7 1 0 2
#8 1 0 3
#9 1 0 NA
#...
#...
#46 4 0 NA
#47 4 0 NA
#48 4 0 -3
#49 4 0 -2
#50 4 0 -1
#51 4 1 1
#52 4 0 2
#53 4 0 3
#54 4 0 NA
#55 4 0 -3
#56 4 0 -2
#57 4 0 -1
#58 4 1 1
#59 4 0 2

how subset rows that have value larger than other values for multiple columns in R

I have the following data.table
library(data.table)
dt <- data.table(V1=c(1,3,1,0,NA,0),
V2=c(1,0,1,0,1,3),
Q1=c(3,5,10,14,0,3),
Q2=c(0,1,8,NA,0,NA))
and i want to add a new column that will have value 1:
if any of the columns V1,V2 has value larger than 2,
and
if any of the columns Q1,Q2 has value larger than 0
So in the end i want to up with something like this:
> dt
V1 V2 Q1 Q2 new
1: 1 1 3 0 0
2: 3 0 5 1 1
3: 1 1 10 8 0
4: 0 0 14 NA 0
5: NA 1 0 0 0
6: 0 3 3 NA 1
EDIT
In principle i would like to have 2 vectors of column names, so something like v_columms <- names(dt)[names(dt) %like%"V"] and q_columms <- names(dt)[names(dt) %like%"q"] and use these
We can use melt to process multiple columns by specifying the patterns in measure to convert it to 'long' format and then apply the condition
dt[, new := melt(dt, measure = patterns("V", "Q"))[,
+(any(value1 > 2) & any(value2 > 0)),rowid(variable)]$V1]
dt
# V1 V2 Q1 Q2 new
#1: 1 1 3 0 0
#2: 3 0 5 1 1
#3: 1 1 10 8 0
#4: 0 0 14 NA 0
#5: NA 1 0 0 0
#6: 0 3 3 NA 1
Or without melt, if there are only two groups of columns, then
vs <- grep("V", names(dt))
qs <- grep("Q", names(dt))
dt[, new := +(Reduce(`|`, lapply(.SD[, ..vs], `>`, 2)) &
Reduce(`|`, lapply(.SD[, ..qs], `>`, 0)))]
Using dplyr and either case_when or if_else:
dt %>%
mutate(new = case_when((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2) > 0 ~ 1,
TRUE ~ 0))
dt %>%
mutate(new = if_else((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2 > 0), 1 , 0))
V1 V2 Q1 Q2 new
1 1 1 3 0 0
2 3 0 5 1 1
3 1 1 10 8 0
4 0 0 14 NA 0
5 NA 1 0 0 0
6 0 3 3 NA 1
Here's another approach with some helper functions:
foo <- function(.dt, cols, vals, na.rm = TRUE) {
rowSums(.dt[, cols, with=FALSE] > vals, na.rm = na.rm) > 0
}
bar <- function(.dt, cols_list, vals_list) {
as.integer(Reduce("&", Map(function(cols, vals) foo(.dt, cols, vals), cols_list, vals_list)))
}
dt[, new := bar(.SD, list(v_columms, q_columms), list(2, 0))]

changing the values in many r variables

I want to do the equivalent of find and replace 1=0;2=0;3=0;4=1;5=2;6=3 for many different variables in my data set.
Things I've tried:
making 1=0;2=0;3=0;4=1;5=2;6=3 into a function and using sapply. I changed the ; to , and changed the = to <- and no combination of these were recognized as a function. I tried creating a function with that definition and putting it into sapply and it didn't work.
I tried using recode and it did not work:
wdata[ ,cols2] = recode(wdata[ ,cols2], 1=0;2=0;3=0;4=1;5=2;6=3)
Assuming you are working with a data.frame or matrix you can use direct indexing:
# Sample data
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df;
#V1 V2 V3 V4
#1 6 5 5 3
#2 4 1 1 3
#3 3 3 1 5
#4 2 3 3 6
#5 5 2 3 5
df[df == 1 | df == 2 | df == 3] <- 0;
df[df == 4] <- 1;
df[df == 5] <- 2;
df[df == 6] <- 3;
df;
# V1 V2 V3 V4
#1 3 2 2 0
#2 1 0 0 0
#3 0 0 0 2
#4 0 0 0 3
#5 2 0 0 2
Note that the order of the substitutions matters. For example, df[df == 4] = 1; df[df == 1] <- 0; will give a different output from df[df == 1] <- 0; df[df == 4] <- 1;
Alternative solution using recode from dplyr with sapply or mutate_all:
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df
library(dplyr)
f = function(x) recode(x, `1`=0, `2`=0, `3`=0, `4`=1, `5`=2, `6`=3)
sapply(df, f)
# V1 V2 V3 V4
# [1,] 3 2 2 0
# [2,] 1 0 0 0
# [3,] 0 0 0 2
# [4,] 0 0 0 3
# [5,] 2 0 0 2
df %>% mutate_all(f)
# V1 V2 V3 V4
# 1 3 2 2 0
# 2 1 0 0 0
# 3 0 0 0 2
# 4 0 0 0 3
# 5 2 0 0 2
A looping alternative with lapply and match is as follows:
dat[] <- lapply(dat, function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
This uses a lookup table on the vector c(0,0,0,1,2,3) with match selecting the indices. Using the data.frame created by Maurits Evers, we get
dat
V1 V2 V3 V4
1 3 2 2 0
2 1 0 0 0
3 0 0 0 2
4 0 0 0 3
5 2 0 0 2
To do this for a subset of the columns, just select them on each side, like
dat[, cols2] <-
lapply(dat[, cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
or
dat[cols2] <- lapply(dat[cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])

Resources