Combining Survey Items in R/ Recoding NAs - r

I have two lists (from a multi-wave survey) that look like this:
X1 X2
1 NA
NA 2
NA NA
How can I easily combine this into a third item, where the third column always takes the non-NA value of column X1 or X2, and codes NA when both values are NA?

Combining Gavin's use of within and Prasad's use of ifelse gives us a simpler answer.
within(df, x3 <- ifelse(is.na(x1), x2, x1))
Multiple ifelse calls are not needed - when both values are NA, you can just take one of the values directly.

Another way using ifelse:
df <- data.frame(x1 = c(1, NA, NA, 3), x2 = c(NA, 2, NA, 4))
> df
x1 x2
1 1 NA
2 NA 2
3 NA NA
4 3 4
> transform(df, x3 = ifelse(is.na(x1), ifelse(is.na(x2), NA, x2), x1))
x1 x2 x3
1 1 NA 1
2 NA 2 2
3 NA NA NA
4 3 4 3

This needs a little extra finesse-ing due to the possibility of both X1 and X2 being NA, but this function can be used to solve your problem:
foo <- function(x) {
if(all(nas <- is.na(x))) {
NA
} else {
x[!nas]
}
}
We use the function foo by applying it to each row of your data (here I have your data in an object named dat):
> apply(dat, 1, foo)
[1] 1 2 NA
So this gives us what we want. To include this inside your object, we do this:
> dat <- within(dat, X3 <- apply(dat, 1, foo))
> dat
X1 X2 X3
1 1 NA 1
2 NA 2 2
3 NA NA NA

You didn't say what you wanted done when both were valid numbers, but you can use either pmax or pmin with the na.rm argument:
pmax(df$x1, df$x2, na.rm=TRUE)
# [1] 1 2 NA 4

Related

Select columns, excluding some which are all NA

Suppose I have this dataframe
df <- data.frame(keep = c(1, NA, 2),
also_want = c(NA, NA, NA),
maybe = c(1, 2, NA),
maybe_2 = c(NA, NA, NA))
Edit: In the actual dataframe there are many columns I'd like to keep, so spelling them all out isn't viable. These columns are all the columns that do not start with maybe. The maybe columns, instead, do have a common naming like maybe, maybe_1 etc. that could work with grep or stringr::str_detect
I want to select keep, and also_want. I also want any of the maybe columns that have values other than NA
desired_df
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I can use select_if to get all columns that have non-NA values, but then I lose also_want
library(dplyr)
df %>%
select_if(~sum(!is.na(.)) > 0)
keep maybe
1 1 1
2 NA 2
3 2 NA
Thoughts?
With dplyr 1.0.0 you can use the where function inside a select statement to test for conditions that your variables have to satisfy, but first you specify the variables you also want to keep.
EDIT
I've inserted the condition that only the "maybe" variables have to contain values other than NA; before, we select every column that does not start with "maybe".
df %>%
select(!starts_with("maybe"), starts_with("maybe") & where(~sum(!is.na(.)) > 0))
Output
# keep also_want maybe
# 1 1 NA 1
# 2 NA NA 2
# 3 2 NA NA
following your comments, in Base-R we can use
df[,!apply(
rbind(
grepl("maybe",colnames(df)),
!apply(df, 2, function(x) !all(is.na(x)))
)
,2,all)]
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
Or if you prefer seeing the same code all on 1 line:
df[,!apply(rbind(grepl("maybe",colnames(df)),!apply(df, 2, function(x) !all(is.na(x)))),2,all)]
I eventually figured this out. Using str_detect to select all non-maybe columns, and then using a one-liner inside sapply to also select any other columns (i.e. any maybe columns) that have non-NA values.
library(dplyr)
library(stringr)
df %>%
select_if(stringr::str_detect(names(.), "maybe", negate = TRUE) |
sapply(., function(x) {
sum(!is.na(x))
} > 0))
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA

Is there a way to recode an SPSS function in R to create a single new variable?

Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}

Subtract columns in R data frame but keep values of var1 or var2 when the other is NA

I wanted to subtract one column from the other in R and this turned out more complicated than I thought.
Suppose this is my data (columns a and b) and column c is what I want, namely a - b but keeping a when b==NA and vice versa:
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
Now I tried different things but most of the time it returned NA when at least one column was NA. For example:
matrixStats::rowDiffs(data, na.rm=T) # only works for matrix-format, and returns NA's
dat$c <- dat$a - dat$b + ifelse(is.na(dat$b),dat$a,0) + ifelse(is.na(dat$a),dat$b,0) # seems like a desparately basic solution, but not even this does the trick as it also returns NA's
apply(dat[,(1:2)], MARGIN = 1,FUN = diff, na.rm=T) # returns NA's
dat$b<-dat$b*(-1)
dat$c<-rowSums(dat,na.rm=T) # this kind of works but it's a really ugly workaround
Also, if you can think of a dplyr solution, please share your knowledge. I didn't even know what to try.
Will delete this question if you think it's a duplicate of an existing one, though none of the existing threads were particularly helpful.
Try this (Base R Solution):
If df$b is NA then simply take the value of df$a else if df$a is NA then simply take the value of df$b else do df$a-df$b
df$c=ifelse(is.na(df$b),df$a,ifelse(is.na(df$a),df$b,df$a-df$b))
Output:
df
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
You may try using the coalesce function from the dplyr package:
dat <- data.frame(a=c(2, 2, NA, NA), b=c(1, NA, 3, NA))
dat$c <- coalesce(dat$a - coalesce(dat$b, 0), dat$b)
dat$c
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
The idea here is to take a minus b, or a alone if b be NA. If that entire expression is still NA, then it implies that a is also NA, in which case we take b.
Here is one option with base R where we replace the NA elements with 0, Reduce it to a single vector by taking the rowwise difference and change the rows that have all NA elements to NA
df1$c <- abs(Reduce(`-`, replace(df1, is.na(df1), 0))) *
NA^ (!rowSums(!is.na(df1)) )
df1$c
#[1] 1 2 3 NA
Or using similar method with data.table
library(data.table)
setDT(df1)[!is.na(a) | !is.na(b), c := abs(Reduce(`-`,
replace(.SD, is.na(.SD), 0)))]
data
df1 <- structure(list(a = c(2L, 2L, NA, NA), b = c(1L, NA, 3L, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

Set values greater than index to be NA, per row

In R, I have a matrix df, and vector invalidAfterIndex. For the ith row, I would like to set all elements with index greater than invalidAfterIndex[i] to be NA. For example:
> df <- data.frame(matrix(rnorm(20), nrow=5))
> df
X1 X2 X3 X4
1 2.124042819 -0.2862224 0.1686977 2.14838198
2 0.777763004 0.2949123 -0.4331421 -0.81278586
3 -0.003226624 -0.2326152 -1.5779695 -1.23193913
4 0.165975919 -0.1879981 -0.8214994 -1.40267202
5 1.299195865 -0.9418217 -1.5302512 0.03164781
> invalidAfterIndex <- c(2,3,1,4,1)
I would like to have:
X1 X2 X3 X4
1 2.124042819 -0.2862224 NA NA
2 0.777763004 0.2949123 -0.4331421 NA
3 -0.003226624 NA NA NA
4 0.165975919 -0.1879981 -0.8214994 -1.40267202
5 1.299195865 NA NA NA
How can I do this without a for loop?
You can do
is.na(df) <- col(df) > invalidAfterIndex
Or, as #digEmAll suggested
df[col(df) > invalidAfterIndex] <- NA
Here is an option with Map, with which you can pass the column and column position to a function where you can replace elements with index surpassing the invalidAfterIndex with NA:
df[] <- Map(function(col, index) replace(col, index > invalidAfterIndex, NA), df, seq_along(df))
df
# X1 X2 X3 X4
#1 2.124042819 -0.2862224 NA NA
#2 0.777763004 0.2949123 -0.4331421 NA
#3 -0.003226624 NA NA NA
#4 0.165975919 -0.1879981 -0.8214994 -1.402672
#5 1.299195865 NA NA NA

passing positive results from multiple columns into a single new column in r

I am trying to work out a way to create a single column from multiple columns in R. What I want to do is for R to go through all rows for multiple columns and if it finds a positive result in one of those columns, to pass that result into an 'amalgam' column (sorry I don't know a better word for it).
See the toy dataset below
x <- c(NA, NA, NA, NA, NA, 1)
y <- c(NA, NA, 1, NA, NA, NA)
z <- c(NA, 1, NA, NA, NA, NA)
df <- data.frame(cbind(x, y, z))
df[, "compCol"] <- NA
df
x y z compCol
1 NA NA NA NA
2 NA NA 1 NA
3 NA 1 NA NA
4 NA NA NA NA
5 NA NA NA NA
6 1 NA NA NA
I need to pass positive results from each of the columns into the compCol column while changing negative results to 0. So that it looks like this.
x y z compCol
1 NA NA NA 0
2 NA NA 1 3
3 NA 1 NA 2
4 NA NA NA 0
5 NA NA NA 0
6 1 NA NA 1
I know if probably requires an if else statement nested inside a for loop but all the ways I have tried result in errors that I don't understand.
I tried the following just for a single column
for (i in 1:length(x)) {
if (df$x[i] == 1) {
df$compCol[i] <- df$x[i]
}
}
But it didn't work at all.
I got the message 'Error in if (df$x[i] == 1) { : missing value where TRUE/FALSE needed'
And that makes sense but I can't see where to put the TRUE/FALSE statement
You can also use reshaping with NA removal
library(dplyr)
library(tidyr)
df.id = df %>% mutate(ID = 1:n() )
df.id %>%
gather(variable, value,
x, y, z,
na.rm = TRUE) %>%
left_join(df.id)
We can use max.col. Create a logical matrix by checking whether the selected columns are greater than 0 and are not NA ('ind'). We use max.col to get the column index for each row and multiply with rowSums of 'ind' so that if there is 0 TRUE values for a row, it will be 0.
ind <- df > 0 & !is.na(df)
df$compCol <- max.col(ind) *rowSums(ind)
df$compCol
#[1] 0 3 2 0 0 1
Or another option is pmax after multiplying with the col(df)
do.call(pmax,col(df)*replace(df, is.na(df), 0))
#[1] 0 3 2 0 0 1
NOTE: I used the dataset before creating the 'compCol' in the OP's post.

Resources