At some point in time, I encountered this problem...and solved it. However, as it is a recurring problem and I've now forgotten the solution, hopefully this question will offer clarification to others as well as me :)
I am creating a variable that is based answers to several questions. Each question can have three values: 1, 2, or NA. 1's and 2's are mutually exclusive for each observation.
I simply want to create a variable that is a composite of the choice coded with "1" for each person, and give it a value based on that code.
So let's say I have this df:
ID var1 var2 var3 var4
1 1 2 NA NA
2 NA NA 2 1
3 2 1 NA NA
4 2 NA 1 NA
I then try to recode based on the following statement:
df$var <-
ifelse(
as.numeric(df$var1) == 1,
"Gut instinct",
ifelse(
as.numeric(df$var2) == 1,
"Data",
ifelse(
as.numeric(df$var3) == 1,
"Science",
ifelse(
as.numeric(df$var4) == 1,
"Philosophy",
NA
)
)
)
)
However, this code only PARTIALLY codes based on the "ifelse". For example, df$var might have observation of 'Gut instinct' and 'Philosophy', but the codings for when var2 and var3==1 are still NA.
Any thoughts on why this might be happening?
An alternative that will be quicker than apply (using #MrFlick's data):
vals <- c("Gut", "Data", "Science", "Phil")
intm <- dd[-1]==1 & !is.na(dd[-1])
dd$resp <- NA
dd$resp[row(intm)[intm]] <- vals[col(intm)[intm]]
How much quicker? On 1 million rows:
#row/col assignment
user system elapsed
0.99 0.02 1.02
#apply
user system elapsed
11.98 0.04 12.30
And giving the same results when tried on identical datasets:
identical(flick$resp,latemail$resp)
#[1] TRUE
This is because ifelse (and ==) has special behavior for NA. Specifically, R doesn't want to tell you that NA is different from 1 (or anything else), because often NA is used to represent a value that could be anything, maybe even 1.
> 1 == NA
[1] NA
> ifelse(NA == 1, "yes", "no")
[1] NA
With your code, if an NA occurs before a 1 (like for ID 2), then that ifelse statement will just return NA, and the nested FALSE ifelse will never be called.
Here's a way to do with without the nested ifelse statements
#your data
dd<-data.frame(ID = 1:4,
var1 = c(1, NA, 2, 2),
var2 = c(2, NA, 1, NA),
var3 = c(NA, 2, NA, 2),
var4 = c(NA, 1, NA, NA)
)
resp <- c("Gut","Data","Sci","Phil")[apply(dd[,-1]==1,1,function(x) which(x)[1])]
cbind(dd, resp)
I use apply to scan across the rows to find the first 1 and use that index to subset the response values. Using which helps to deal with the NA values.
To answer your question it is due to the NAs in your data. This should sort your problem out
df <- data.frame( ID=1:4, var1= c(1, NA, 2, 2), var2= c(2, NA, 1, NA),
var3=c(NA,2,NA,2), var4=c(NA, 1, NA, NA))
df$var<-ifelse(as.numeric(df$var1)==1&!is.na(df$var1),"Gut instinct",
ifelse(as.numeric(df$var2)==1&!is.na(df$var2),"Data",
ifelse(as.numeric(df$var3)==1&!is.na(df$var3),"Science",
ifelse(as.numeric(df$var4)==1&!is.na(df$var4),"Philosophy",NA))))
However, I would find it easier to reshape the data into a 'matrix' rather than a table and do it using a vector.
data <- df
library(reshape2)
long <- melt(data, id.vars="ID")
long
This would give you a matrix. Convert the var titles to something more meaningful.
library(stringr)
long$variable <- str_replace(long$variable, "var1", "Gut Instinct")
long$variable <- str_replace(long$variable, "var2", "Data")
long$variable <- str_replace(long$variable, "var3", "Science")
long$variable <- str_replace(long$variable, "var4", "Philosophy")
And now you can decide what to do based on each result
long$var <- ifelse(long$value==1, long$variable, NA)
and convert it back to something like the original if you want it that way
reshape(data=long, timevar="ID",idvar=c("var", "variable"), v.names = "value", direction="wide")
HTH
Related
probably a stupid question but I clearly can't see it and would appreciate your help.
Here is a fictional dataset:
dat <- data.frame(ID = c(101, 202, 303, 404),
var1 = c(1, NA, 0, 1),
var2 = c(NA, NA, 0, 1))
now I need to create a variable that sums the values up, per subject. The following works but ignores when var1 and var2 are NA:
try1 <- apply(dat[,c(2:3)], MARGIN=1, function(x) {sum(x==1, na.rm=TRUE)})
I would like the script to write NA if both var1 and var2 are NA, but if one of the two variables has an actual value, I'd like the script to treat the NA as 0. I have tried this:
check1 <- apply(dat[,2:3], MARGIN=1, function(x)
{ifelse(x== is.na(dat$var1) & is.na(dat$var2), NA, {sum(x==1, na.rm=TRUE)})})
This, however, produces a 4x4 matrix (int[1:4,1:4]). The real dataset has hundreds of observations so that just became a mess...Does anybody see where I go wrong?
Thank you!
Here's a working version:
apply(dat[,2:3], MARGIN=1, function(x)
{
if(all(is.na(x))) {
NA
} else {
sum(x==1, na.rm=TRUE)
}
}
)
#[1] 1 NA 0 2
Issues with yours:
Inside your function(x), x is the var1 and var2 values for a particular row. You don't want to go back and reference dat$var1 and dat$var2, which is the whole column! Just use x.
x== is.na(dat$var1) & is.na(dat$var2) is strange. It's trying to check whether x is the same as is.na(dat$var1)?
For a given row, we want to check whether all the values are NA. ifelse is vectorized and will return a vector - but we don't want a vector, we want a single TRUE or FALSE indicating whether all values are NA. So we use all(is.na()). And if() instead of ifelse.
So I'm trying to remove rows that have missing data in some columns, but not those that have missing data in all columns.
using rowSums alongside !is.na() gave me 1000's of rows of NA at the bottom of my dataset. The top answer here provided a good way of solving my issue using complete.cases:
Remove rows with all or some NAs (missing values) in data.frame
i.e.
data_set1 <- data_set1[complete.cases(data_set1[11:103]), ]
However, that only allows me to remove rows with any missing data in the specified columns. I'm struggling to get complete.cases to play along with rowSums and stop it from removing rows with all missing data.
Any advice very much appreciated!
Try using rowSums like :
cols <- 11:103
vals <- rowSums(is.na(data_set1[cols]))
data_set2 <- data_set1[!(vals > 0 & vals < length(cols)), ]
Or with complete.cases and rowSums
data_set1[complete.cases(data_set1[cols]) |
rowSums(is.na(data_set1[cols])) == length(cols) , ]
With reproducible example,
df <- data.frame(a = c(1, 2, 3, NA, 1), b = c(NA, 2, 3, NA, NA), c = 1:5)
cols <- 1:2
vals <- rowSums(is.na(df[cols]))
df[!(vals > 0 & vals < length(cols)), ]
# a b c
#2 2 2 2
#3 3 3 3
#4 NA NA 4
This might be slightly silly but I would appreciate a better way to deal with this problem. I have a dataframe as the following
a <- matrix(1,5,3)
a[1:2,2] <- NA
a[1,c(1,3)] <- NA
a[3:5,2] <- 2
a[2:5,3] <- 3
a <- data.frame(a)
colnames(a) = c("First", "Second", "Third")
I want to sum only some of, say, the columns but I would like to keep the NAs when all elements in the summed columns are NA. In short, if I sum First and Second columns I want to get something like
mySum <- c(NA, 1, 3, 3, 3)
Neither of the two options below provides what I want
rowSums(a[, c("First", "Second")])
rowSums(a[, c("First", "Second")], na.rm=TRUE)
but on the positive side I have resolved this by using a combination of is.na and all
mySum <- rowSums(a[, c("First", "Second")], na.rm=TRUE)
iNA = apply(a[, c("First", "Second")], 2, is.na)
iAllNA = apply(iNA, 1, all)
mySum[iAllNA] = NA
This feels slightly awkward though so I was wondering if there is a smarter way to handle this.
Using apply with margin = 1 for every row if all the row elements are NA we return NA or else we return the sum of them.
apply(a[c("First", "Second")], 1, function(x)
ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE)))
#[1] NA 1 3 3 3
mycols = c("First", "Second")
replace(x = rowSums(a[mycols], na.rm = TRUE),
list = rowSums(is.na(a[mycols])) == length(mycols),
values = NA)
#[1] NA 1 3 3 3
I'm a newbie to R and data.table and but I'm trying to collapse a customer data set that takes the following format - although it extends across 90 columns:
frame <- data.frame(
customer_id = c(123, 123, 123),
time = c(1, 2, 3),
rec_type = c('contact', 'appointment', 'sale'),
variable_1 = c('Yes', NA, "Yes"),
variable_2 = c(NA, 'No', NA),
variable_3 = c(NA, NA, 'complete'),
variable_4 = NA, stringsAsFactors = FALSE)
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 1 contact Yes NA NA NA
123 2 appointment NA No NA NA
123 3 sale Yes NA complete NA
I asked before - What's the best way to collapse sparse data into single rows in R? - how to collapse the data for each customer into a single row and got two useful answers in data.table and dplyr.
However, those answers couldn't handle multiple values such as the 'rec_type' field or where are multiple instances of the same value variable_1.
I'd like to lapply a function which works across columns and returns a row vector in which each field is either the single value for each field, NA if all column values are blank or 'multiple'
In this case: my output would be
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 multiple multiple Yes No complete NA
I worked out how to count the unique values across columns:
unique_values <- function(x){
uniques <- dt[contact_no == x,][,lapply(.SD, uniqueN)]
uniques
}
lapply(dt$contact_no, unique_values)
But couldn't work how to use the results from uniques to return the results I'd like.
Can anyone suggest an approach I can use?
Is there a simpler way of tackling the problem?
Here is one data.table method.
setDT(frame)[, lapply(.SD, function(x)
{x <- unique(x[!is.na(x)])
if(length(x) == 1) as.character(x)
else if(length(x) == 0) NA_character_
else "multiple"}),
by=customer_id]
The idea is to use lapply to apply an anonymous function to all variables and construct the function in a manner that returns the desired results. This function strips out NA values and duplicates and then checks the length of the resulting vector. The output of each is cast as a character in order to comply with the possibility of "multiple" occurring for another customer_id.
this returns
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
1: 123 multiple multiple Yes No complete NA
Given the example data set below:
df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA,
7, NA, 9, 10, NA, NA), nrow=2, ncol=6))
names(df) <- c( "varA", "varB", "varC", "varD", "varE", "varF")
print(df)
varA varB varC varD varE varF
1 1 3 5 7 9 NA
2 2 NA NA NA 10 NA
I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question.
The manual version of what I'd like is:
kmeans_model <- kmeans(df[, -c(2:4, 6)], 10)
And the pseudo-code would be:
kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10)
Also, I don't want to delete the data from df. Thanks in advance.
(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set)
Here are two options without sapply:
kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10)
Or
kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10)
Explanation:
colSums(is.na(df)) counts the number of NAs per column, resulting in:
colSums(is.na(df))
#varA varB varC varD varE varF
# 0 1 1 1 0 2
And then
colSums(is.na(df)) == 0 # converts to logical TRUE/FALSE
#varA varB varC varD varE varF
#TRUE FALSE FALSE FALSE TRUE FALSE
is the same as
!colSums(is.na(df))
#varA varB varC varD varE varF
#TRUE FALSE FALSE FALSE TRUE FALSE
Both methods can be used to subset only those columns where the logical value is TRUE
This is the generic approach that I use for listing column names and their count of NAs:
sort(colSums(is.na(df)> 0), decreasing = T)
If you want to use sapply, you can refer this code snippet as well:
flights_NA_cols <- sapply(flights, function(x) sum(is.na(x)))
flights_NA_cols[flights_NA_cols>0]