This might be slightly silly but I would appreciate a better way to deal with this problem. I have a dataframe as the following
a <- matrix(1,5,3)
a[1:2,2] <- NA
a[1,c(1,3)] <- NA
a[3:5,2] <- 2
a[2:5,3] <- 3
a <- data.frame(a)
colnames(a) = c("First", "Second", "Third")
I want to sum only some of, say, the columns but I would like to keep the NAs when all elements in the summed columns are NA. In short, if I sum First and Second columns I want to get something like
mySum <- c(NA, 1, 3, 3, 3)
Neither of the two options below provides what I want
rowSums(a[, c("First", "Second")])
rowSums(a[, c("First", "Second")], na.rm=TRUE)
but on the positive side I have resolved this by using a combination of is.na and all
mySum <- rowSums(a[, c("First", "Second")], na.rm=TRUE)
iNA = apply(a[, c("First", "Second")], 2, is.na)
iAllNA = apply(iNA, 1, all)
mySum[iAllNA] = NA
This feels slightly awkward though so I was wondering if there is a smarter way to handle this.
Using apply with margin = 1 for every row if all the row elements are NA we return NA or else we return the sum of them.
apply(a[c("First", "Second")], 1, function(x)
ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE)))
#[1] NA 1 3 3 3
mycols = c("First", "Second")
replace(x = rowSums(a[mycols], na.rm = TRUE),
list = rowSums(is.na(a[mycols])) == length(mycols),
values = NA)
#[1] NA 1 3 3 3
Related
I am trying to replace NA values by column with values predetermined from a vector. For example, I have vector containing the values (1,5,3) and a dataframe df, and want to replace all NA values from column one of df with 1, column two NA's with 5, and column three NA's with 3.
I tried a formula I saw that took
df[is.na(df)] = vector
but didn't seem to work due to "wrong length". Both the vector and #columns in df are also the same length.
You can use which to get row/column index of NA values and replace it directly.
mat <- which(is.na(df), arr.ind = TRUE)
df[mat] <- vector[mat[, 2]]
We can use Map to replace the corresponding columns in the dataset with the value in the vector and replace it directly and this would almost all the time and it is a single step replacement and is concise
df[] <- Map(function(x, y) replace(x, is.na(x), y), df, vec)
df
# col1 col2 col3
#1 1 5 2
#2 3 2 3
#3 1 5 3
Or another option is to make the lengths same, and then use pmax
df[] <- pmax(as.matrix(df), is.na(df) * vec[col(df)], na.rm = TRUE)
or another option with replace
df <- replace(df, is.na(df), rep(vec, colSums(is.na(df))))
NOTE: All the solutions above are one-liner
Or using data.table with set
library(data.table)
setDT(df)
for(j in seq_along(df)) set(df, i = which(is.na(df[[j]])), j = j, value = vec[j])
data
df <- data.frame(col1 = c(1, 3, NA), col2 = c(NA, 2, NA), col3 = c(2, NA, NA))
vec <- c(1, 5, 3)
So I'm trying to remove rows that have missing data in some columns, but not those that have missing data in all columns.
using rowSums alongside !is.na() gave me 1000's of rows of NA at the bottom of my dataset. The top answer here provided a good way of solving my issue using complete.cases:
Remove rows with all or some NAs (missing values) in data.frame
i.e.
data_set1 <- data_set1[complete.cases(data_set1[11:103]), ]
However, that only allows me to remove rows with any missing data in the specified columns. I'm struggling to get complete.cases to play along with rowSums and stop it from removing rows with all missing data.
Any advice very much appreciated!
Try using rowSums like :
cols <- 11:103
vals <- rowSums(is.na(data_set1[cols]))
data_set2 <- data_set1[!(vals > 0 & vals < length(cols)), ]
Or with complete.cases and rowSums
data_set1[complete.cases(data_set1[cols]) |
rowSums(is.na(data_set1[cols])) == length(cols) , ]
With reproducible example,
df <- data.frame(a = c(1, 2, 3, NA, 1), b = c(NA, 2, 3, NA, NA), c = 1:5)
cols <- 1:2
vals <- rowSums(is.na(df[cols]))
df[!(vals > 0 & vals < length(cols)), ]
# a b c
#2 2 2 2
#3 3 3 3
#4 NA NA 4
Imagine an array of numbers called A. At each level of A, you want to find the most recent item with a matching value. You could easily do this with a for loop as follows:
A = c(1, 1, 2, 2, 1, 2, 2)
for(i in 1:length(A)){
if(i > 1 & sum(A[1:i-1] == A[i]) > 0){
answer[i] = max(which(A[1:i-1] == A[i]))
}else{
answer[i] = NA
}
}
But I want vectorize this for loop (because I'll be applying this principle on a very large data set). I tried using sapply:
answer = sapply(A, FUN = function(x){max(which(A == x))})
As you can see, I need some way of reducing the array to only values that come before x. Any advice?
We can use seq_along to loop over the index of each element and then subset it and get the max index where the value last occured.
c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
#[1] NA 1 -Inf 3 2 4 6
We can change the -Inf to NA if needed in that format
inds <- c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
inds[is.infinite(inds)] <- NA
inds
#[1] NA 1 NA 3 2 4 6
The above method gives a warning, to remove the warning we can perform an additional check of the length
c(NA, sapply(seq_along(A)[-1], function(x) {
inds <- which(A[1:(x-1)] == A[x])
if (length(inds) > 0)
max(inds)
else
NA
}))
#[1] NA 1 NA 3 2 4 6
Here's an approach with dplyr which is more verbose, but easier for me to grok. We start with recording the row_number, make a group for each number we encounter, then record the prior matching row.
library(dplyr)
A2 <- A %>%
as_tibble() %>%
mutate(row = row_number()) %>%
group_by(value) %>%
mutate(last_match = lag(row)) %>%
ungroup()
You can do:
sapply(seq_along(A)-1, function(x)ifelse(any(a<-A[x+1]==A[sequence(x)]),max(which(a)),NA))
[1] NA 1 NA 3 2 4 6
Here's a function that I made (based upon Ronak's answer):
lastMatch = function(A){
uniqueItems = unique(A)
firstInstances = sapply(uniqueItems, function(x){min(which(A == x))}) #for NA
notFirstInstances = setdiff(seq(A),firstInstances)
lastMatch_notFirstInstances = sapply(notFirstInstances, function(x) max(which(A[1:(x-1)] == A[x])))
X = array(0, dim = c(0, length(A)))
X[firstInstances] = NA
X[notFirstInstances] = lastMatch_notFirstInstances
return(X)
}
I have this data:
x = c(1,1,3, 3, 2)
y = c(1,2,1, 1, 2)
z = c(1,1,2, 3, 7)
data <- data.frame(x, y, z)
And I would like to get a vector indicating the column number of the highest value in each row; whilst removing ties; or indicate ties with NA.
I have tried which.max:
HighestIncludingTies <- apply(data, 1, which.max)
Although this does not mark ties with NA (or something similar).
Thanks a lot for any help or guidance!
Here's an attempt using max.col:
HighsNoTies <- max.col(data,"first")
replace(HighsNoTies, HighsNoTies != max.col(data,"last"), NA)
#[1] NA 2 1 NA 3
At some point in time, I encountered this problem...and solved it. However, as it is a recurring problem and I've now forgotten the solution, hopefully this question will offer clarification to others as well as me :)
I am creating a variable that is based answers to several questions. Each question can have three values: 1, 2, or NA. 1's and 2's are mutually exclusive for each observation.
I simply want to create a variable that is a composite of the choice coded with "1" for each person, and give it a value based on that code.
So let's say I have this df:
ID var1 var2 var3 var4
1 1 2 NA NA
2 NA NA 2 1
3 2 1 NA NA
4 2 NA 1 NA
I then try to recode based on the following statement:
df$var <-
ifelse(
as.numeric(df$var1) == 1,
"Gut instinct",
ifelse(
as.numeric(df$var2) == 1,
"Data",
ifelse(
as.numeric(df$var3) == 1,
"Science",
ifelse(
as.numeric(df$var4) == 1,
"Philosophy",
NA
)
)
)
)
However, this code only PARTIALLY codes based on the "ifelse". For example, df$var might have observation of 'Gut instinct' and 'Philosophy', but the codings for when var2 and var3==1 are still NA.
Any thoughts on why this might be happening?
An alternative that will be quicker than apply (using #MrFlick's data):
vals <- c("Gut", "Data", "Science", "Phil")
intm <- dd[-1]==1 & !is.na(dd[-1])
dd$resp <- NA
dd$resp[row(intm)[intm]] <- vals[col(intm)[intm]]
How much quicker? On 1 million rows:
#row/col assignment
user system elapsed
0.99 0.02 1.02
#apply
user system elapsed
11.98 0.04 12.30
And giving the same results when tried on identical datasets:
identical(flick$resp,latemail$resp)
#[1] TRUE
This is because ifelse (and ==) has special behavior for NA. Specifically, R doesn't want to tell you that NA is different from 1 (or anything else), because often NA is used to represent a value that could be anything, maybe even 1.
> 1 == NA
[1] NA
> ifelse(NA == 1, "yes", "no")
[1] NA
With your code, if an NA occurs before a 1 (like for ID 2), then that ifelse statement will just return NA, and the nested FALSE ifelse will never be called.
Here's a way to do with without the nested ifelse statements
#your data
dd<-data.frame(ID = 1:4,
var1 = c(1, NA, 2, 2),
var2 = c(2, NA, 1, NA),
var3 = c(NA, 2, NA, 2),
var4 = c(NA, 1, NA, NA)
)
resp <- c("Gut","Data","Sci","Phil")[apply(dd[,-1]==1,1,function(x) which(x)[1])]
cbind(dd, resp)
I use apply to scan across the rows to find the first 1 and use that index to subset the response values. Using which helps to deal with the NA values.
To answer your question it is due to the NAs in your data. This should sort your problem out
df <- data.frame( ID=1:4, var1= c(1, NA, 2, 2), var2= c(2, NA, 1, NA),
var3=c(NA,2,NA,2), var4=c(NA, 1, NA, NA))
df$var<-ifelse(as.numeric(df$var1)==1&!is.na(df$var1),"Gut instinct",
ifelse(as.numeric(df$var2)==1&!is.na(df$var2),"Data",
ifelse(as.numeric(df$var3)==1&!is.na(df$var3),"Science",
ifelse(as.numeric(df$var4)==1&!is.na(df$var4),"Philosophy",NA))))
However, I would find it easier to reshape the data into a 'matrix' rather than a table and do it using a vector.
data <- df
library(reshape2)
long <- melt(data, id.vars="ID")
long
This would give you a matrix. Convert the var titles to something more meaningful.
library(stringr)
long$variable <- str_replace(long$variable, "var1", "Gut Instinct")
long$variable <- str_replace(long$variable, "var2", "Data")
long$variable <- str_replace(long$variable, "var3", "Science")
long$variable <- str_replace(long$variable, "var4", "Philosophy")
And now you can decide what to do based on each result
long$var <- ifelse(long$value==1, long$variable, NA)
and convert it back to something like the original if you want it that way
reshape(data=long, timevar="ID",idvar=c("var", "variable"), v.names = "value", direction="wide")
HTH