I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).
Related
How can I recode 0 to 1 and 1 to 0 for columns i1:i3 in the below sample dataset?
df <- data.frame(id = c(11,22,33),
i1 = c(0,1,NA),
i2 = c(1,1,0),
i3 = c(0,NA,1))
> df
id i1 i2 i3
1 11 0 1 0
2 22 1 1 NA
3 33 NA 0 1
I have tens of columns starting with i... So I need a indexing condition to apply only for those columns. The desired output would be:
> df1
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0
You could approach this by indexing; would work fine if all variables beyond the id column begin with i as in the question.
df[, 2:4] <- ifelse(df[, 2:4] == 0, 1, 0)
# or more succinctly, following the examples of others, and still using `ifelse`
df[-1] <- ifelse(df[-1] == 0, 1, 0)
df
#> id i1 i2 i3
#> 1 11 1 0 1
#> 2 22 0 0 NA
#> 3 33 NA 1 0
Created on 2022-10-10 with reprex v2.0.2
We can just negate and coerce
df[-1] <- +(!df[-1])
-output
> df
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0
We can simply use -
> df[-1] <- 1 - df[-1]
> df
id i1 i2 i3
1 11 1 0 1
2 22 0 0 NA
3 33 NA 1 0
We can mutate only the columns beginning with i followed by a number using across and matches from dplyr and we can change values as you've specified using recode.
library(dplyr)
df %>%
mutate(across(matches('^i\\d+'), recode, '1' = 0, '0' = 1))
Alternatively, in base R you can do this
i_cols <- str_detect(colnames(df), '^i\\d+')
df[,i_cols] <- ifelse(df[,i_cols] == 0, 1, 0)
Consider the vector:
use = c(1,1,2,2,5,1,2,1,2,5,1)
I'm trying to replace all the numbers different from 5 to NA before the first number 5 shows up in the sequence:
ifelse(use != 5,NA,1).
After that the condition should be
ifelse(use != 5,0,1).
The output would be:
after = c(NA,NA,NA,NA,1,0,0,0,0,1,0)
Any tips?
You should try:
`is.na<-`(match(use, 5, 0), seq(match(5, use) - 1))
[1] NA NA NA NA 1 0 0 0 0 1 0
Here is a base R solution
after <- replace(v<- ifelse(use !=5,NA,1),
which(head(which(v==1),1)<seq_along(v) & is.na(v)),
0)
such that
> after
[1] NA NA NA NA 1 0 0 0 0 1 0
Weird subsetting:
c(NA[!cumsum(use == 5)], +(use[!!cumsum(use == 5)] == 5))
#[1] NA NA NA NA 1 0 0 0 0 1 0
We can use match
replace(use, seq_len(match(5, use) - 1), NA)
#[1] NA NA NA NA 5 1 2 1 2 5 1
Or as #M-- commented, this can be changed to binary with
+(replace(use, seq_len(match(5, use) - 1), NA)==5)
This will work if there's only one 5 in your vector
use = c(1,1,2,2,5,1,2,2,2)
use <- findInterval(use,5)*5
i <- which(use > 0)
if(i > 1) use[1:(i-1)] <- NA
Here is another variation. I through in some error handling in case there are no 5's in the vector.
test1 <- c(1,1,1,1,2,3,3)
test2 <- c(5,1,1,2,5,1,2,7,8)
test3 <- c(1,1,3,5,6,7,8,2)
test4 <- c(1,2,3,4,5,5,1,5,5,5,1,1,7,8,1)
find_and_replace <- function(vec, target){
tryCatch(
ifelse( seq_along(vec) %in% 1:{(which(vec == target)[[1]])-1}, NA, ifelse(vec == 5, 1, 0)),
error = function(x) {
warning(paste("Warning: No", target))
vec
}
)
}
find_and_replace(test1, 5)
#> Warning: No 5
#> [1] 1 1 1 1 2 3 3
find_and_replace(test2, 5)
#> [1] NA 0 0 0 1 0 0 0 0
find_and_replace(test3, 5)
#> [1] NA NA NA 1 0 0 0 0
find_and_replace(test4, 5)
#> [1] NA NA NA NA 1 1 0 1 1 1 0 0 0 0 0
The following code solves the problem:
use[1:(which(use == 5)[1]-1)] = NA
use[(which(use == 5)[1]+1):length(use)] = 0
use[which(use == 5)[1]] = 1
use
[1] NA NA NA NA 1 0 0 0 0
You can use which to find the location of the target, and then case_when
use <- c(1,1,2,2,5,1,2,1,2)
first_five <- min(which(use == 5))
dplyr::case_when(
seq_along(use) < first_five ~ NA_real_,
seq_along(use) == first_five ~ 1,
TRUE ~ 0
)
#> [1] NA NA NA NA 1 0 0 0 0
use
#> [1] 1 1 2 2 5 1 2 1 2
Created on 2020-01-14 by the reprex package (v0.3.0)
You could detect the first 5,
first_pos <- which(use==5)
and, if such elements exist, set all entries before the first occurence to NA:
if(length(first_pos)>0) {
use[seq(1,first_pos[1]-1)] <- NA
use[seq(1,first_pos[1])] <- 1
use[seq(first_pos[1]+1, length(use)] <- 0
}
Note that first_pos[1] is called in case there are more than one 5.
I have a dataframe (fbwb) with multiple assessments of bullying (1-6) using multiple measures (1-3) in a group of participants. The df looks like this:
fbwb <- read.table(text="id year bully1 bully2 bully3 cbully bully_ever
100 1 NA 1 NA 1 1
100 2 1 1 NA 1 1
100 3 NA 0 NA 0 1
101 1 NA NA 1 1 1
102 1 NA 1 NA 1 1
102 2 NA NA NA NA 1
102 3 NA 1 1 1 1
102 4 0 0 0 0 1
103 1 NA 1 NA 1 1
103 2 NA 0 0 0 1", header=TRUE)
Where bully1, bully2, and bully3 are binary variables that each = 1 if bullying was reported on the respective measure.
cbully is binary and = 1 if any of the 3 bullying variables = 1 for a given year.
bully_ever is binary and = 1 if bullying was reported on any measure in any year for a given participant.
I want to create a new binary variable in my df called bully_past. bully_past represents the case when cbully = 1 in ANY PAST YEAR. This is subtly different from bully_ever. For example, if a participant has been assessed 4 times:
bully_past should use info from years 3, 2, and 1 AT YEAR 4.
bully_past should use info from years 2 and 1 AT YEAR 3.
bully_past should use info from year 1 AT YEAR 2.
bully_past should be NA at year 1.
I have tried quite a few things, but the most recent rendition is the following:
fbwb <- fbwb %>%
dplyr::group_by(id) %>%
dplyr::mutate(bully_past = case_when(cbully == 1 & year == (year - 1) |
cbully == 1 & year == (year - 2) |
cbully == 1 & year == (year - 3) |
cbully == 1 & year == (year - 4) |
cbully == 1 & year == (year - 5) ~ 1,
(is.na(cbully) & year == (year - 1) &
is.na(cbully) & year == (year - 2) &
is.na(cbully) & year == (year - 3) &
is.na(cbully) & year == (year - 4) &
is.na(cbully) & year == (year - 5)) ~ NA_real_,
TRUE ~ 0)) %>%
dplyr::ungroup()
This does not work because the syntax for indicating which years to use is not correct - so it generates a column of NA values. I have made other attempts, but I have not been able to manage to take into account observations from ALL PREVIOUS YEARS.
It can be done in Stata using this code:
gen bullyingever = bullying
sort iid time
replace bullyingever = 1 if bullying[_n - 1]==1 & iid[_n - 1]==iid
replace bullyingever = 1 if bullying[_n - 2]==1 & iid[_n - 2]==iid
replace bullyingever = 1 if bullying[_n - 3]==1 & iid[_n - 3]==iid
replace bullyingever = 1 if bullying[_n - 4]==1 & iid[_n - 4]==iid
replace bullyingever = 1 if bullying[_n - 5]==1 & iid[_n - 5]==iid
I appreciate any input on how to accomplish this in R, preferably using dplyr.
Here we can write a helper function that can look at previous events both using cumsum (to keep a cumulative account of events which lets you look into the past) and lag() in order to look exclusively behind the current value. So we have
had_previous_event <- function(x) {
lag(cumsum(!is.na(x) & x==1)>0)
}
You can then use that with your dplyr chain
fbwb %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(bully_past = had_previous_event(cbully))
This returns TRUE/FALSE but if you want zero/one you can change that to
mutate(bully_past = as.numeric(had_previous_event(cbully)))
One solution can be using dplyr and ifelse as:
library(dplyr)
fbwb %>% group_by(id) %>%
arrange(id, year) %>%
mutate(bully_past_year = ifelse(is.na(lag(cbully)), 0L, lag(cbully))) %>%
mutate(bully_past = ifelse(cumsum(bully_past_year)>0L, 1L, 0 )) %>%
select(-bully_past_year) %>% as.data.frame()
# id year bully1 bully2 bully3 cbully bully_ever bully_past
# 1 100 1 NA 1 NA 1 1 0
# 2 100 2 1 1 NA 1 1 1
# 3 100 3 NA 0 NA 0 1 1
# 4 101 1 NA NA 1 1 1 0
# 5 102 1 NA 1 NA 1 1 0
# 6 102 2 NA NA NA NA 1 1
# 7 102 3 NA 1 1 1 1 1
# 8 102 4 0 0 0 0 1 1
# 9 103 1 NA 1 NA 1 1 0
# 10 103 2 NA 0 0 0 1 1
There is an alternative approach which aggregates in a non-equi self-join. This approach has the benefit that it works even with unordered data.
library(data.table)
# coerce to data.table
bp <- setDT(fbwb)[
# non equi self-join and aggregate within the join
fbwb, on = .(id, year < year), as.integer(any(cbully)), by = .EACHI][]
# append new column
fbwb[, bully_past := bp$V1][]
id year bully1 bully2 bully3 cbully bully_ever bully_past
1: 100 1 NA 1 NA 1 1 NA
2: 100 2 1 1 NA 1 1 1
3: 100 3 NA 0 NA 0 1 1
4: 101 1 NA NA 1 1 1 NA
5: 102 1 NA 1 NA 1 1 NA
6: 102 2 NA NA NA NA 1 1
7: 102 3 NA 1 1 1 1 1
8: 102 4 0 0 0 0 1 1
9: 103 1 NA 1 NA 1 1 NA
10: 103 2 NA 0 0 0 1 1
The non-equi join condition considers only previous years. So, the first year for each id is NA as requested by the OP.
The any() function returns TRUE if at least one of the values is TRUE (after coersion to type logical). In R, the integer value 1L corresponds to the logical value TRUE.
I have a set of variables that contain data about whether or not a person has ever had certain health conditions. For example, "have you ever had a heart attack?"
If they say "yes" at observation 2, then the answer is still yes at observations 3 and 4. But, it is not necessarily yes at observation 1. The heart attack could have occurred between observation 1 and 2.
If they say "no" at observation 2, then the answer is no at observations 1. But, it is not necessarily no at observations 3 or 4.
Here is a reproducible example:
df <- tibble(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA)
)
df
id obs mi_ever
1 1 1 NA
2 1 2 0
3 1 3 1
4 1 4 NA
5 2 1 NA
6 2 2 0
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 1
11 3 3 NA
12 3 4 NA
It's trivial to carry my 0's (No's) backward or carry my 1's (Yes's) forward using zoo::na.locf. However, I'm not sure how to carry 0's backward and 1's forward. Ideally, I'd like the following result:
id obs mi_ever mi_ever_2
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
I've checked out the following posts, but none seem to cover exactly what I'm asking here.
Carry last Factor observation forward and backward in group of rows in R
Forward and backward fill data frame in R
making a "dropdown" function in R
Any help is appreciated.
Basically I'm marking the items in sequence after the first 1 to become 1 and the ones before the last 0 to become 0.
ever <- function (x) min( which( x == 1))
NA_1 <- function(x) seq_along(x) > ever(x) #could have done in one function
# check to see if working
ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
[1] NA 0 1 1 NA 0 NA NA NA 1 1 1
NA_0 <- function(x) seq_along(x) < not_yet(x)
not_yet <- function(x){ max( which( x==0)) }
# make temporary version of 1-modified column
temp1 <- ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
df$ever2 <- ave(temp1, df$id, FUN= function(x){ x[NA_0(x) ] <- 0; x})
# then make final version; could have done it "in place" I suppose.
df
# A tibble: 12 x 4
id obs mi_ever ever2
<int> <int> <dbl> <dbl>
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
If you need to suppress the warnings that should be possible.
I took the answer from #42- above (Thank you!), and tweaked it a little bit to further suit my needs. Specifically, I:
Took care of the warning "no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Inf".
Combined the separate functions into a single function (although the separate functions were extremely useful for learning).
Added an optional check_logic argument. When TRUE, the function will return 9's if a 0 comes after a 1. This represents a data error or logic flaw that warrants further investigation.
Added an example of using the function with data.table, and on multiple variables at once. This more accurately represents how I'm using the function in real life, and I thought it may be useful to others.
The function:
distribute_ever <- function(x, check_logic = TRUE, ...) {
if (check_logic) {
if (length(which(x == 1)) > 0 & length(which(x == 0)) > 0) {
if (min(which(x == 1)) < max(which(x == 0))) {
x <- 9 # Set x to 9 if zero comes after 1
}
}
}
ones <- which(x == 1) # Get indices for 1's
if (length(ones) > 0) { # Prevents warning
first_1_by_group <- min(which(x == 1)) # Index first 1 by group
x[seq_along(x) > first_1_by_group] <- 1 # Set x at subsequent indices to 1
}
zeros <- which(x == 0) # Get indices for 0's
if (length(zeros) > 0) { # Prevents warning
last_0_by_group <- max(which(x == 0)) # Index last 0 by group
x[seq_along(x) < last_0_by_group] <- 0 # Set x at previous indices to 0
}
x
}
A new reproducible example with multiple "ever" variables and some cases with 0 after 1:
dt <- data.table(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA),
diab_ever = c(0, NA, NA, 1, 1, NA, NA, 0, 0, NA, NA, NA)
)
Iterate over multiple variables quickly using data.table (with by group processing):
ever_vars <- c("mi_ever", "diab_ever")
dt[, paste0(ever_vars, "_2") := lapply(.SD, distribute_ever),
.SDcols = ever_vars,
by = id][]
Results:
id obs mi_ever diab_ever mi_ever_2 diab_ever_2
1: 1 1 NA 0 0 0
2: 1 2 0 NA 0 NA
3: 1 3 1 NA 1 NA
4: 1 4 NA 1 1 1
5: 2 1 NA 1 0 9
6: 2 2 0 NA 0 9
7: 2 3 NA NA NA 9
8: 2 4 NA 0 NA 9
9: 3 1 NA 0 NA 0
10: 3 2 1 NA 1 NA
11: 3 3 NA NA 1 NA
12: 3 4 NA NA 1 NA
For each input "ever" variable, we have:
Created a new variable with "_2" appended to the end of the input variable name. You could also edit "in place" as 42- pointed out, but I like being able to double check my data.
Zeroes are carried backward and ones are carried forward in time.
NA's after zeros and before ones (within id) are returned unchanged.
If there is a 0 (No, I've never had ...) after a 1 (Yes, I've had ...), as is the case with person 2's responses regarding diabetes, then the function returns 9's.
If we were to set check_logic to FALSE, then 1's would win out and replace 0's
I would like to replace all consecutive NA values per row with zero but only if the number of consecutive NAs is less than a parmeter maxgap.
This is very similar to the function zoo::na.locf
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
zoo::na.locf(x, maxgap = 2, na.rm = FALSE)
gives
[1] NA 1 2 3 3 3 5 6 7 NA NA NA
There are two things different from my aim:
I would like to replace the leading NA too and I would like to replace the 2 consecutive NAs with 0 and not the last non-NA value.
I would like to get
0 1 2 3 0 0 5 6 7 NA NA NA
How can I do this in R. Can I use functions from the tidyverse?
If y is the result of the na.locf line then if y[i] is not NA but x[i] is NA then it was replaced so assign 0 to it. Also if it is a leading NA which occurs when the cumsum(...) term below is 0 then replace it as well.
replace(y, (!is.na(y) & is.na(x)) | cumsum(!is.na(y)) == 0, 0)
## [1] 0 1 2 3 0 0 5 6 7 NA NA NA
We can use rle to do this
f1 <- function(vec){
rl <- rle(is.na(vec))
lst <- within.list(rl, {
i1 <- seq_along(values)==1
i2 <- seq_along(values) != length(values)
values[!((lengths==2 & values & i2)|
(values & i1))] <- FALSE
})
vec[inverse.rle(lst)] <- 0
vec
}
f1(x)
#[1] 0 1 2 3 0 0 5 6 7 NA NA NA
You could e.g. do this:
require(data.table)
require(dplyr)
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
my_replace <- function(x, n, maxgap){
if(is.na(x[1]) && n <= maxgap){
x <- 0
}
x
}
data.frame(x, y=x) %>%
group_by(data.table::rleid(x)) %>%
mutate(x = my_replace(x, n(), 2), y = my_replace(y, n(), 1)) %>%
ungroup() %>%
select(x,y)
This allows you to set the maxgap columnwise: for x 2 for y 1.
This results in:
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 0 NA
6 0 NA
7 5 5
8 6 6
9 7 7
10 NA NA
11 NA NA
12 NA NA