Conditionally replace columns with NA [duplicate] - r

This question already has answers here:
How to conditionally replace values with NA across multiple columns
(2 answers)
Closed 2 years ago.
here is an example of my data:
m <- data.frame(swim = c(0,1,0,0), time1 = c(1,2,3,4), time2 = c(2,3,4,5))
I want to replace all numbers in columns time1 and time2 with NA after the row where there is a 1 in m$swim. It should look like this:
n <- data.frame(swim = c(0,1,0,0), time1 = c(1,2,NA,NA), time2 = c(2,3,NA,NA))
Thank you!

In dplyr you can do :
library(dplyr)
m %>%
mutate(across(starts_with('time'),
~replace(., row_number() > match(1, swim), NA)))
A base R option however, would be more efficient.
cols <- grep('time', names(m))
inds <- match(1, m$swim)
m[(inds + 1):nrow(m), cols] <- NA
m
# swim time1 time2
#1 0 1 2
#2 1 2 3
#3 0 NA NA
#4 0 NA NA

A base R solution would be:
#Data
m <- data.frame(swim = c(0,1,0,0), time1 = c(1,2,3,4), time2 = c(2,3,4,5))
#Detect position
index <- min(which(m$swim==1))
#Replace
m[(index+1):dim(m)[1],-1] <- NA
Output:
swim time1 time2
1 0 1 2
2 1 2 3
3 0 NA NA
4 0 NA NA

Using data.table, the result would be as follows:
library(data.table)
setDT(m)
#Start after the row with the 1
stop.here <- which(m$swim == 1)+1
these_rows <- seq(stop.here,length(m$swim),1)
m <- m[these_rows,time1:=NA]
m <- m[these_rows,time2:=NA]

Related

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

Find unique max values for each row in DF with equal numbers

I have a data frame that looks like this:
A <- rep(1, times = 3)
B <- 1:3
C <- c(1,3,2)
DF <- data.frame(A,B,C)
Which makes:
> DF
A B C
1 1 1 1
2 1 2 3
3 1 3 2
I would like to create a new column that indicates the columname in which the max value for each row can be found but only if they are unique, otherwise I would like to give it an NA.
I have tried various options, however this one for example would always use the first column name in which the value was found as the max:
DF$max <- colnames(DF)[max.col(DF, ties.method = "first")]
Reulting in:
A C B
I would like to have
NA C B
You can count the number of max values in each row using rowSums, turn the output to NA if they are more than 1.
col <- colnames(DF)[max.col(DF)]
col[rowSums(DF == do.call(pmax, DF)) > 1] <- NA
DF$max <- col
DF
# A B C max
#1 1 1 1 <NA>
#2 1 2 3 C
#3 1 3 2 B
You can test if the result of ties.method = "first" is equal to the result when ties.method = "last" is used.
i <- max.col(DF, ties.method = "first")
j <- max.col(DF, ties.method = "last")
DF$max <- colnames(DF)[i]
DF$max[i != j] <- NA
DF
# A B C max
#1 1 1 1 <NA>
#2 1 2 3 C
#3 1 3 2 B
We can also use pmap for this purpose:
library(dplyr)
library(purrr)
DF %>%
mutate(Max = pmap_chr(DF, ~ {
x <- c(...)
if(sum(x == max(x, na.rm = TRUE)) > 1) {
NA_character_
} else {
names(DF)[which(x == max(x, na.rm = TRUE))]
}
}
))
A B C Max
1 1 1 1 <NA>
2 1 2 3 C
3 1 3 2 B
We can use
DF$max <- names(DF)[max.col(DF, "first")*NA^(rowSums(DF == do.call(pmax, DF)) > 1)]
DF$max
[1] NA "C" "B"

Dispatch values in list column to separate columns

I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0

Remove rows with zero-variance in R

I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R
If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995
Stealing #AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.

Replacing 0 values with NA in data.frame conditionally

dat <- data.frame(A=c("name1", "name2", "name3"),
B=c(0,1,0), C=c(0,0,5), D= c(4,4,0), E=c(1,0,0), F=c(4,0,0) )
desiredresult <- data.frame(A=c("name1", "name2", "name3"),
B=c(NA,1,NA), C=c(NA,0,5), D= c(4,4,0), E=c(1,0,NA), F=c(4,NA,NA))
I want to replace 0 values with NA in every row until a positive value is encountered (no negative values in dataset). In addition to that I want to replace all values if their ending are all zeros leaving first 0 in place after last positive value. etc 5,0,0,0 -> 5,0,NA,NA
Provided example data with desiredresult. I was trying approach something like this, but there would need to be 5+ conditions to cover it all. Is there a better way to do this? Maybe with data.table?
dat$B[dat$B == 0 & (dat$C!=0 | dat$D!=0)] <- NA
dat$C[dat$C == 0 & dat$D!=0 & is.na(dat$B)] <- NA
Using the data.table-package, you could approach this as follows:
cols <- names(dat)[2:6]
library(data.table)
setDT(dat)[, (cols) := {x <- unlist(.SD);
x[cumsum(x)==0] <- NA;
l <- c(tail(cumsum(rev(x)),-1),1) == 0;
x[rev(l)] <- NA;
names(x) <- cols;
as.list(x)},
by = A]
you get:
> dat
A B C D E F
1: name1 NA NA 4 1 4
2: name2 1 0 4 0 NA
3: name3 NA 5 0 NA NA
The same kind of thinking, but then with base R:
dl <- as.data.frame(t(dat[,-1]))
idx1 <- cumsum(dl) == 0
idx2 <- sapply(dl, function(x) {
l <- c(tail(cumsum(rev(x)),-1),1) == 0
l[is.na(l)] <- FALSE
rev(l)
})
dl[idx1 | idx2] <- NA
dat[,-1] <- t(dl)
which will get you the same result:
> dat
A B C D E F
1 name1 NA NA 4 1 4
2 name2 1 0 0 4 0
3 name3 NA 5 0 NA NA
New example data:
dat <- data.frame(A=c("name1", "name2", "name3"),
B=c(0,1,0), C=c(0,0,5), D=c(4,0,0), E=c(1,4,0), F=c(4,0,0) )
This should work:
#Apply the first rule: convert 0 to NA until we find a non negative
res1<-t(apply(dat[,-1], 1, function(x) {
xc <- cumsum(x) #cumulative sum
x[xc==0]<-NA #NA where cumulative sum iz 0
x
}))
# Apply the second rule
res2<-t(apply(res1, 1, function(x) {
xc <- cumsum(rev(x)) #reverse the sum
xc<-c(tail(xc,-1),1) # shift the sum
res<-rev(x) #reverse the vector
res[xc==0]<-NA
rev(res)
}))
#Reconstruct the data frame
cbind(data.frame(name=dat[,1]),res2)
# name B C D E F
#1 name1 NA NA 4 1 4
#2 name2 1 0 4 0 NA
#3 name3 NA 5 0 NA NA

Resources