Related
I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).
I have a data frame which looks like this:
df
colA colB
0 0
1 1
0 1
0 1
0 1
1 0
0 0
1 1
0 1
I would like to convert a certain proportion of the 0 in colA to NA and a certain proportion of 1 in colB to NA
if I do this:
df["colA"][df["colA"] == 0] <- NA
all the 0 in columns A will be converted to NA, however I just want half of them to be converted
Similarly, for colB I want only 1/3 of the 1 to be converted:
df["colB"][df["colB"] == 1] <- NA
Expected output:
colA colB
0 0
1 1
NA 1
0 1
NA 1
1 0
0 0
1 NA
NA NA
One way
tmp=which(df["colA"]==0)
df$colA[sample(tmp,round(length(tmp)/2))]=NA
similar for colB
tmp=which(df["colB"]==1)
df$colB[sample(tmp,round(length(tmp)/3))]=NA
You can use prodNA from the missForest package
set.seed(1)
library(missForest)
df[df$colA == 0, "colA"] <- prodNA(df[df$colA == 0, "colA", drop=F], noNA = 0.5)
df[df$colB == 1, "colB"] <- prodNA(df[df$colB == 1, "colB", drop=F], noNA = 1/3)
df
colA colB
1 NA 0
2 1 NA
3 0 NA
4 NA 1
5 NA 1
6 1 0
7 0 0
8 1 1
9 0 1
I'll contribute a tidyverse approach here.
library(tidyverse)
df %>% mutate(id_colA = ifelse(colA == 1, NA, 1:n()),
colA = ifelse(id_colA %in% sample(na.omit(id_colA), sum(!is.na(id_colA))/2), NA, colA),
id_colB = ifelse(colB == 0, NA, 1:n()),
colB = ifelse(id_colB %in% sample(na.omit(id_colB), sum(!is.na(id_colB))/3), NA, colB)) %>%
select(-starts_with("id_"))
I have two measures for the same object. The measure is binary (1,0) but many observations are also missing, such that the possible options are: 1, 0, NA.
Data Have:
Source1 Source2
NA NA
NA 0
NA 1
0 NA
0 0
0 1
1 NA
1 0
1 1
(Sources can contradict each other, ignore that for now).
I would like to create a third composite variable that summarizes the two variables, such that IF EITHER of the two sources = 1, then the composite variable should be equal to 1. Otherwise, if either of the sources is not missing, then the composite variable should be equal to zero. Lastly, only if both sources are missing, the composite variable should be set to missing.
Data Want:
Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1
I have tried different approaches but continue to have the same issue.
Attempt 1:
df<- df %>% mutate(combined = ifelse(df$source1==1 | df$source2==1, 1,
ifelse(df$source1==0 | df$source2==0, 0, NA)))
Attempt 2:
df2<- df %>% mutate(combined = ifelse(is.na(df$source1) & is.na(df$source2), NA,
ifelse(df$source1 == 1 | df$source2 ==1, 1, 0)))
Attempt 3:
df3<- df %>% mutate(combined = ifelse(df$source1==1, 1,
ifelse(df$source1==0 & df$source2==1, 1,
ifelse(df$source1==0 & df$source2==0, 0,
ifelse(df$source1==0 & is.na(df$source2), 0,
ifelse(is.na(df$source1) & df$source2'==1, 1,
ifelse(is.na(df$source1) & df$source2==0, 0, NA)))))))
The codes identify whether there is a 1 in either source, but the rest of the values are all missing regardless of there being a 0 or not.
Actual Output:
Source1 Source2 Composite
NA NA NA
NA 0 NA
NA 1 1
0 NA NA
0 0 NA
0 1 1
1 NA 1
1 0 1
1 1 1
Assuming both Source1 and Source2 columns are composed of 0's,1's, and NA's (as you noted). You could use this as a base R solution. I.e., this uses do.call() to call pmax() over each of the relevant columns in your dataframe.
cols = paste0("Source", 1:2)
df$newcol = do.call(pmax, c(df[cols], na.rm = TRUE))
# equivalent to: pmax(df$Source1, df$Source2, na.rm = TRUE)
df
Source1 Source2 Composite newcol
1 NA NA NA NA
2 NA 0 0 0
3 NA 1 1 1
4 0 NA 0 0
5 0 0 0 0
6 0 1 1 1
7 1 NA 1 1
8 1 0 1 1
9 1 1 1 1
Data:
df = read.table(header = TRUE, text = "Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1")
One approach is to use case_when rather than if-else. It seems simplest to check for missing variables first, and then check the non-missing cases afterwards:
library(tidyverse)
df %>%
mutate(S1Miss = is.na(Source1),
S2Miss = is.na(Source2)) %>%
mutate(Composite = case_when(
S1Miss & S2Miss ~ NA,
S1Miss | S2Miss ~ 0,
Source1 == 1 & Source2 == 1 ~ 1,
TRUE ~ 0
)) %>%
select(Source1, Source2, Composite)
Note here I made it "easier to read" by first storing the variables in 1 call to mutate and remove these intermediary results using select.
this was fun but i wouldn't recommend doing it like this.
source1<-c(NA, NA, NA, 0, 0, 0, 1, 1, 1)
source2<-c(NA, 0, 1, NA, 0, 1, NA, 0, 1)
df<-data.frame(source1, source2)
df$composite<-ifelse(test = is.na(df$source1) & is.na(df$source2), yes = NA,
no = ifelse(test = is.na(df$source1) & !is.na(df$source2), yes = df$source2,
no = ifelse(is.na(df$source2) & !is.na(df$source1), yes = df$source1,
no = ifelse(df$source1 > df$source2, yes = df$source1,
no = df$source2))))
source1 source2 composite
1 NA NA NA
2 NA 0 0
3 NA 1 1
4 0 NA 0
5 0 0 0
6 0 1 1
7 1 NA 1
8 1 0 1
9 1 1 1
Consider the vector:
use = c(1,1,2,2,5,1,2,1,2,5,1)
I'm trying to replace all the numbers different from 5 to NA before the first number 5 shows up in the sequence:
ifelse(use != 5,NA,1).
After that the condition should be
ifelse(use != 5,0,1).
The output would be:
after = c(NA,NA,NA,NA,1,0,0,0,0,1,0)
Any tips?
You should try:
`is.na<-`(match(use, 5, 0), seq(match(5, use) - 1))
[1] NA NA NA NA 1 0 0 0 0 1 0
Here is a base R solution
after <- replace(v<- ifelse(use !=5,NA,1),
which(head(which(v==1),1)<seq_along(v) & is.na(v)),
0)
such that
> after
[1] NA NA NA NA 1 0 0 0 0 1 0
Weird subsetting:
c(NA[!cumsum(use == 5)], +(use[!!cumsum(use == 5)] == 5))
#[1] NA NA NA NA 1 0 0 0 0 1 0
We can use match
replace(use, seq_len(match(5, use) - 1), NA)
#[1] NA NA NA NA 5 1 2 1 2 5 1
Or as #M-- commented, this can be changed to binary with
+(replace(use, seq_len(match(5, use) - 1), NA)==5)
This will work if there's only one 5 in your vector
use = c(1,1,2,2,5,1,2,2,2)
use <- findInterval(use,5)*5
i <- which(use > 0)
if(i > 1) use[1:(i-1)] <- NA
Here is another variation. I through in some error handling in case there are no 5's in the vector.
test1 <- c(1,1,1,1,2,3,3)
test2 <- c(5,1,1,2,5,1,2,7,8)
test3 <- c(1,1,3,5,6,7,8,2)
test4 <- c(1,2,3,4,5,5,1,5,5,5,1,1,7,8,1)
find_and_replace <- function(vec, target){
tryCatch(
ifelse( seq_along(vec) %in% 1:{(which(vec == target)[[1]])-1}, NA, ifelse(vec == 5, 1, 0)),
error = function(x) {
warning(paste("Warning: No", target))
vec
}
)
}
find_and_replace(test1, 5)
#> Warning: No 5
#> [1] 1 1 1 1 2 3 3
find_and_replace(test2, 5)
#> [1] NA 0 0 0 1 0 0 0 0
find_and_replace(test3, 5)
#> [1] NA NA NA 1 0 0 0 0
find_and_replace(test4, 5)
#> [1] NA NA NA NA 1 1 0 1 1 1 0 0 0 0 0
The following code solves the problem:
use[1:(which(use == 5)[1]-1)] = NA
use[(which(use == 5)[1]+1):length(use)] = 0
use[which(use == 5)[1]] = 1
use
[1] NA NA NA NA 1 0 0 0 0
You can use which to find the location of the target, and then case_when
use <- c(1,1,2,2,5,1,2,1,2)
first_five <- min(which(use == 5))
dplyr::case_when(
seq_along(use) < first_five ~ NA_real_,
seq_along(use) == first_five ~ 1,
TRUE ~ 0
)
#> [1] NA NA NA NA 1 0 0 0 0
use
#> [1] 1 1 2 2 5 1 2 1 2
Created on 2020-01-14 by the reprex package (v0.3.0)
You could detect the first 5,
first_pos <- which(use==5)
and, if such elements exist, set all entries before the first occurence to NA:
if(length(first_pos)>0) {
use[seq(1,first_pos[1]-1)] <- NA
use[seq(1,first_pos[1])] <- 1
use[seq(first_pos[1]+1, length(use)] <- 0
}
Note that first_pos[1] is called in case there are more than one 5.
I have the following data.table
library(data.table)
dt <- data.table(V1=c(1,3,1,0,NA,0),
V2=c(1,0,1,0,1,3),
Q1=c(3,5,10,14,0,3),
Q2=c(0,1,8,NA,0,NA))
and i want to add a new column that will have value 1:
if any of the columns V1,V2 has value larger than 2,
and
if any of the columns Q1,Q2 has value larger than 0
So in the end i want to up with something like this:
> dt
V1 V2 Q1 Q2 new
1: 1 1 3 0 0
2: 3 0 5 1 1
3: 1 1 10 8 0
4: 0 0 14 NA 0
5: NA 1 0 0 0
6: 0 3 3 NA 1
EDIT
In principle i would like to have 2 vectors of column names, so something like v_columms <- names(dt)[names(dt) %like%"V"] and q_columms <- names(dt)[names(dt) %like%"q"] and use these
We can use melt to process multiple columns by specifying the patterns in measure to convert it to 'long' format and then apply the condition
dt[, new := melt(dt, measure = patterns("V", "Q"))[,
+(any(value1 > 2) & any(value2 > 0)),rowid(variable)]$V1]
dt
# V1 V2 Q1 Q2 new
#1: 1 1 3 0 0
#2: 3 0 5 1 1
#3: 1 1 10 8 0
#4: 0 0 14 NA 0
#5: NA 1 0 0 0
#6: 0 3 3 NA 1
Or without melt, if there are only two groups of columns, then
vs <- grep("V", names(dt))
qs <- grep("Q", names(dt))
dt[, new := +(Reduce(`|`, lapply(.SD[, ..vs], `>`, 2)) &
Reduce(`|`, lapply(.SD[, ..qs], `>`, 0)))]
Using dplyr and either case_when or if_else:
dt %>%
mutate(new = case_when((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2) > 0 ~ 1,
TRUE ~ 0))
dt %>%
mutate(new = if_else((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2 > 0), 1 , 0))
V1 V2 Q1 Q2 new
1 1 1 3 0 0
2 3 0 5 1 1
3 1 1 10 8 0
4 0 0 14 NA 0
5 NA 1 0 0 0
6 0 3 3 NA 1
Here's another approach with some helper functions:
foo <- function(.dt, cols, vals, na.rm = TRUE) {
rowSums(.dt[, cols, with=FALSE] > vals, na.rm = na.rm) > 0
}
bar <- function(.dt, cols_list, vals_list) {
as.integer(Reduce("&", Map(function(cols, vals) foo(.dt, cols, vals), cols_list, vals_list)))
}
dt[, new := bar(.SD, list(v_columms, q_columms), list(2, 0))]