Consider the vector:
use = c(1,1,2,2,5,1,2,1,2,5,1)
I'm trying to replace all the numbers different from 5 to NA before the first number 5 shows up in the sequence:
ifelse(use != 5,NA,1).
After that the condition should be
ifelse(use != 5,0,1).
The output would be:
after = c(NA,NA,NA,NA,1,0,0,0,0,1,0)
Any tips?
You should try:
`is.na<-`(match(use, 5, 0), seq(match(5, use) - 1))
[1] NA NA NA NA 1 0 0 0 0 1 0
Here is a base R solution
after <- replace(v<- ifelse(use !=5,NA,1),
which(head(which(v==1),1)<seq_along(v) & is.na(v)),
0)
such that
> after
[1] NA NA NA NA 1 0 0 0 0 1 0
Weird subsetting:
c(NA[!cumsum(use == 5)], +(use[!!cumsum(use == 5)] == 5))
#[1] NA NA NA NA 1 0 0 0 0 1 0
We can use match
replace(use, seq_len(match(5, use) - 1), NA)
#[1] NA NA NA NA 5 1 2 1 2 5 1
Or as #M-- commented, this can be changed to binary with
+(replace(use, seq_len(match(5, use) - 1), NA)==5)
This will work if there's only one 5 in your vector
use = c(1,1,2,2,5,1,2,2,2)
use <- findInterval(use,5)*5
i <- which(use > 0)
if(i > 1) use[1:(i-1)] <- NA
Here is another variation. I through in some error handling in case there are no 5's in the vector.
test1 <- c(1,1,1,1,2,3,3)
test2 <- c(5,1,1,2,5,1,2,7,8)
test3 <- c(1,1,3,5,6,7,8,2)
test4 <- c(1,2,3,4,5,5,1,5,5,5,1,1,7,8,1)
find_and_replace <- function(vec, target){
tryCatch(
ifelse( seq_along(vec) %in% 1:{(which(vec == target)[[1]])-1}, NA, ifelse(vec == 5, 1, 0)),
error = function(x) {
warning(paste("Warning: No", target))
vec
}
)
}
find_and_replace(test1, 5)
#> Warning: No 5
#> [1] 1 1 1 1 2 3 3
find_and_replace(test2, 5)
#> [1] NA 0 0 0 1 0 0 0 0
find_and_replace(test3, 5)
#> [1] NA NA NA 1 0 0 0 0
find_and_replace(test4, 5)
#> [1] NA NA NA NA 1 1 0 1 1 1 0 0 0 0 0
The following code solves the problem:
use[1:(which(use == 5)[1]-1)] = NA
use[(which(use == 5)[1]+1):length(use)] = 0
use[which(use == 5)[1]] = 1
use
[1] NA NA NA NA 1 0 0 0 0
You can use which to find the location of the target, and then case_when
use <- c(1,1,2,2,5,1,2,1,2)
first_five <- min(which(use == 5))
dplyr::case_when(
seq_along(use) < first_five ~ NA_real_,
seq_along(use) == first_five ~ 1,
TRUE ~ 0
)
#> [1] NA NA NA NA 1 0 0 0 0
use
#> [1] 1 1 2 2 5 1 2 1 2
Created on 2020-01-14 by the reprex package (v0.3.0)
You could detect the first 5,
first_pos <- which(use==5)
and, if such elements exist, set all entries before the first occurence to NA:
if(length(first_pos)>0) {
use[seq(1,first_pos[1]-1)] <- NA
use[seq(1,first_pos[1])] <- 1
use[seq(first_pos[1]+1, length(use)] <- 0
}
Note that first_pos[1] is called in case there are more than one 5.
Related
I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).
I have two measures for the same object. The measure is binary (1,0) but many observations are also missing, such that the possible options are: 1, 0, NA.
Data Have:
Source1 Source2
NA NA
NA 0
NA 1
0 NA
0 0
0 1
1 NA
1 0
1 1
(Sources can contradict each other, ignore that for now).
I would like to create a third composite variable that summarizes the two variables, such that IF EITHER of the two sources = 1, then the composite variable should be equal to 1. Otherwise, if either of the sources is not missing, then the composite variable should be equal to zero. Lastly, only if both sources are missing, the composite variable should be set to missing.
Data Want:
Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1
I have tried different approaches but continue to have the same issue.
Attempt 1:
df<- df %>% mutate(combined = ifelse(df$source1==1 | df$source2==1, 1,
ifelse(df$source1==0 | df$source2==0, 0, NA)))
Attempt 2:
df2<- df %>% mutate(combined = ifelse(is.na(df$source1) & is.na(df$source2), NA,
ifelse(df$source1 == 1 | df$source2 ==1, 1, 0)))
Attempt 3:
df3<- df %>% mutate(combined = ifelse(df$source1==1, 1,
ifelse(df$source1==0 & df$source2==1, 1,
ifelse(df$source1==0 & df$source2==0, 0,
ifelse(df$source1==0 & is.na(df$source2), 0,
ifelse(is.na(df$source1) & df$source2'==1, 1,
ifelse(is.na(df$source1) & df$source2==0, 0, NA)))))))
The codes identify whether there is a 1 in either source, but the rest of the values are all missing regardless of there being a 0 or not.
Actual Output:
Source1 Source2 Composite
NA NA NA
NA 0 NA
NA 1 1
0 NA NA
0 0 NA
0 1 1
1 NA 1
1 0 1
1 1 1
Assuming both Source1 and Source2 columns are composed of 0's,1's, and NA's (as you noted). You could use this as a base R solution. I.e., this uses do.call() to call pmax() over each of the relevant columns in your dataframe.
cols = paste0("Source", 1:2)
df$newcol = do.call(pmax, c(df[cols], na.rm = TRUE))
# equivalent to: pmax(df$Source1, df$Source2, na.rm = TRUE)
df
Source1 Source2 Composite newcol
1 NA NA NA NA
2 NA 0 0 0
3 NA 1 1 1
4 0 NA 0 0
5 0 0 0 0
6 0 1 1 1
7 1 NA 1 1
8 1 0 1 1
9 1 1 1 1
Data:
df = read.table(header = TRUE, text = "Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1")
One approach is to use case_when rather than if-else. It seems simplest to check for missing variables first, and then check the non-missing cases afterwards:
library(tidyverse)
df %>%
mutate(S1Miss = is.na(Source1),
S2Miss = is.na(Source2)) %>%
mutate(Composite = case_when(
S1Miss & S2Miss ~ NA,
S1Miss | S2Miss ~ 0,
Source1 == 1 & Source2 == 1 ~ 1,
TRUE ~ 0
)) %>%
select(Source1, Source2, Composite)
Note here I made it "easier to read" by first storing the variables in 1 call to mutate and remove these intermediary results using select.
this was fun but i wouldn't recommend doing it like this.
source1<-c(NA, NA, NA, 0, 0, 0, 1, 1, 1)
source2<-c(NA, 0, 1, NA, 0, 1, NA, 0, 1)
df<-data.frame(source1, source2)
df$composite<-ifelse(test = is.na(df$source1) & is.na(df$source2), yes = NA,
no = ifelse(test = is.na(df$source1) & !is.na(df$source2), yes = df$source2,
no = ifelse(is.na(df$source2) & !is.na(df$source1), yes = df$source1,
no = ifelse(df$source1 > df$source2, yes = df$source1,
no = df$source2))))
source1 source2 composite
1 NA NA NA
2 NA 0 0
3 NA 1 1
4 0 NA 0
5 0 0 0
6 0 1 1
7 1 NA 1
8 1 0 1
9 1 1 1
I would like to replace all consecutive NA values per row with zero but only if the number of consecutive NAs is less than a parmeter maxgap.
This is very similar to the function zoo::na.locf
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
zoo::na.locf(x, maxgap = 2, na.rm = FALSE)
gives
[1] NA 1 2 3 3 3 5 6 7 NA NA NA
There are two things different from my aim:
I would like to replace the leading NA too and I would like to replace the 2 consecutive NAs with 0 and not the last non-NA value.
I would like to get
0 1 2 3 0 0 5 6 7 NA NA NA
How can I do this in R. Can I use functions from the tidyverse?
If y is the result of the na.locf line then if y[i] is not NA but x[i] is NA then it was replaced so assign 0 to it. Also if it is a leading NA which occurs when the cumsum(...) term below is 0 then replace it as well.
replace(y, (!is.na(y) & is.na(x)) | cumsum(!is.na(y)) == 0, 0)
## [1] 0 1 2 3 0 0 5 6 7 NA NA NA
We can use rle to do this
f1 <- function(vec){
rl <- rle(is.na(vec))
lst <- within.list(rl, {
i1 <- seq_along(values)==1
i2 <- seq_along(values) != length(values)
values[!((lengths==2 & values & i2)|
(values & i1))] <- FALSE
})
vec[inverse.rle(lst)] <- 0
vec
}
f1(x)
#[1] 0 1 2 3 0 0 5 6 7 NA NA NA
You could e.g. do this:
require(data.table)
require(dplyr)
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
my_replace <- function(x, n, maxgap){
if(is.na(x[1]) && n <= maxgap){
x <- 0
}
x
}
data.frame(x, y=x) %>%
group_by(data.table::rleid(x)) %>%
mutate(x = my_replace(x, n(), 2), y = my_replace(y, n(), 1)) %>%
ungroup() %>%
select(x,y)
This allows you to set the maxgap columnwise: for x 2 for y 1.
This results in:
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 0 NA
6 0 NA
7 5 5
8 6 6
9 7 7
10 NA NA
11 NA NA
12 NA NA
This question is slightly similar to this question with a more theoretical component.
Given df below:
varA <- c(1,0,0,NA,NA)
varB <- c(NA,NA,NA,1,0)
df <- data.frame(varA, varB)
varA varB
1 NA
0 NA
0 NA
NA 1
NA 0
What's the most elegant method to generate var (with consideration given to NA) which combines the information from varA and varB?
varA varB var
1 NA 1
0 NA 0
0 NA 0
NA 1 1
NA 0 0
My approach, right now, is as follows:
df$var[df$varA == 1 | df$varB == 1] <- 1
df$var[df$varA == 0 | df$varB == 0] <- 0
As a side question, how does R handle NA in ifelse statements? For example, if I write the following code, it does not produce the output I intended.
df$var <- ifelse(df$varA == 1 | df$varB == 1, 1,
ifelse(df$varA == 0 | df$varB == 0, 0, NA)
combines the information from varA and varB
Seems like you are looking for coalesce:
library(dplyr)
df %>% mutate(var = coalesce(varA, varB))
# varA varB var
#1 1 NA 1
#2 0 NA 0
#3 0 NA 0
#4 NA 1 1
#5 NA 0 0
For your purposes, NA is equivalent to 0, so why not convert them to 0?
df[is.na(df)] <- 0
df$var <- with(df, as.integer(varA | varB))
> df
varA varB var
1 1 0 1
2 0 0 0
3 0 0 0
4 0 1 1
5 0 0 0
We can use pmax
df$var <- do.call(pmax, c(df, na.rm = TRUE))
df$var
#[1] 1 0 0 1 0
I am working with a matrix containing a large number of NA. I would like to record the length of each sequence of NA in a new matrix.
The following example should be more plain.
#Generating a random 5x5 population matrix with 15 NA
M=matrix(sample(1:9,25,T),5)
M[sample(1:length(M),15,F)]=NA
dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5
City1 2 NA NA NA NA
City2 NA NA NA 6 8
City3 1 NA NA 6 NA
City4 NA 5 NA NA 1
City5 8 NA 1 NA 2
The desired output is the following. e.g. 4 4 4 4 denotes a sequence of 4 consecutive NA.
Year1 Year2 Year3 Year4 Year5
City1 0 4 4 4 4
City2 3 3 3 0 0
City3 0 2 2 0 1
City4 1 0 2 2 0
City5 0 1 0 1 0
Do you have an idea of how I could go about that?
Not the most efficient code ever:
r1=c(1,1,NA,1,1)
r2=c(1,NA,NA,1,1)
r3=c(1,NA,NA,NA,1)
r4=c(NA,NA,1,1,1)
r5=c(1,1,1,NA,NA)
M=rbind(r1,r2,r3,r4,r5)
like #Pascal pointed out, your approach will convert the entire matrix to characters, so you can assign the 1s to 0s instead and do this:
M[M == 1] <- 0
(xx <- t(apply(M, 1, function(x) {
s <- sum(is.na(x))
if (is.na(x[1])) x[is.na(x)] <- rep(4, s) else
if (is.na(tail(x, 1))) x[is.na(x)] <- rep(5, s) else
x[is.na(x)] <- s
x
})))
# [,1] [,2] [,3] [,4] [,5]
# r1 0 0 1 0 0
# r2 0 2 2 0 0
# r3 0 3 3 3 0
# r4 4 4 0 0 0
# r5 0 0 0 5 5
This is your desired output. If you don't believe me, convert the 0s back to 1s and assign the letters based on the integers
xx[xx > 0] <- letters[xx[xx > 0]]
xx[xx == '0'] <- 1
r1=c(1,1,"a",1,1)
r2=c(1,"b","b",1,1)
r3=c(1,"c","c","c",1)
r4=c("d","d",1,1,1)
r5=c(1,1,1,"e","e")
R=rbind(r1,r2,r3,r4,r5)
identical(R, xx)
# [1] TRUE
This is another basis for a function that would be applied over each row. I tried, but couldn't avoid a for loop:
x = c(1,NA,1,NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA)
#Find the Start and End of each sequence of NA's (Vectorized)
(start <- is.na(x) * c(T,!is.na(x[-length(x)])))
#> [1] 0 1 0 1 0 0 1 0 0 0 1 0 0 0
(end <- is.na(x) * c(!is.na(x[-1]),T))
#> [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
# The difference betweeen the start and end of the sequence +1 is the sequence length
wStart <- which(!!start)
wEnd <- which(!!end)
sequenceLength <- wEnd[i] - wStart[i] + 1
# replace the sequence of NA's with it's class
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[sequenceLength]
x
#> [1] "1" "a" "1" "b" "b" "1" "c" "c" "c" "1" "d" "d" "d" "d"
as in:
(xx <- t(apply(M, 1, function(x) {
wStart <- which(!!(is.na(x) * c(T,!is.na(x[-length(x)]))))
wEnd <- which(!!is.na(x) * c(!is.na(x[-1]),T))
sequenceLength <-
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[wEnd[i] - wStart[i] + 1]
return(x)
})))