Why are NAs in newly created data.frame when using logical selection?

Why are NAs in newly created data.frame when using logical selection? - r

I'm trying to get rid of NAs in an R data.frame. I was trying to create a new df that included only rows whose cluster was "texas" in this example.
> newdf <- df[df$cluster == "texas",]
> summary(newdf$cluster)
texas oklahoma NA's
510 0 719
I had found other questions that address getting rid of NAs, but in this case, I was only selecting those whose "cluster" column is equal to "texas" -- how did an NAs come along for the ride?
Is there a better way of doing what I want?

As #MrFlick suggests above, NA values are handled in slightly (subtly?) different ways depending on how you index.
Test data:
dd <- data.frame(cluster=c("oklahoma","texas",NA))
logical indexing: a TRUE value in the index vector selects the corresponding value, FALSE drops it, and NA results in NA.
dd$cluster=="oklahoma"
## [1] TRUE FALSE NA
summary(dd[dd$cluster=="oklahoma",])
## oklahoma texas NA's
## 1 0 1
In principle you could use dd$cluster=="oklahoma" & !is.na(dd$cluster) as your criterion - since FALSE & NA is FALSE - but that's rather awkward. (Since we have specified a single-column data frame, without saying drop=FALSE, the result gets simplified to a vector before being summarized.)
subset: although it is sometimes deprecated for non-interactive use, subset has the convenient property that it drops values where the criterion evaluates to NA. (Also, subset always returns a data frame even if the result is only one column wide.)
summary(subset(dd,cluster=="oklahoma"))
## cluster
## oklahoma:1
## texas :0
which:
which() only returns indices for TRUE values, not for NA values:
which(dd$cluster=="oklahoma")
## [1] 1
summary(dd[which(dd$cluster=="oklahoma"),])
## oklahoma texas
## 1 0

Related

how to fill missing values in a vector with the mean of value before and after the missing one

Currently I am trying to impute values in a vector in R. The conditions
of the imputation are.
Find all NA values
Then check if they have an existing value before and after them
Also check if the value which follows the NA is larger than
the value before the NA
If the conditions are met, calculate a mean taking the values before
and after.
Replace the NA value with the imputed one
# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)
# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)
# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)
I started out to write code to detect the values which can
be imputed. But I got stuck with the following.
# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]),
rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)
This however only detects the NAs which might be
imputable and it only works with example one. It is incomplete and
unfortunately super hard to read and understand.
Any help with this would be highly appreciated.

We can use dplyrs lag and lead functions for that:
input_three = c(NA,NA,3,4,NA,6,NA,NA)
library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
(lag(input_three) + lead(input_three))/ 2,
input_three)
Retrurns:
[1] NA NA 3 4 5 6 NA NA
Edit
Explanation:
We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors.
First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:
lag offsets a vector to the right (default is 1 step):
lag(1:5)
Returns:
[1] NA 1 2 3 4
lead offsets a vector to the left:
lead(1:5)
Returns:
[1] 2 3 4 5 NA
Now to the 'test' clause of ifelse:
is.na(input_three) & lead(input_three) > lag(input_three)
Which returns:
[1] NA NA FALSE FALSE TRUE FALSE NA NA
Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element

Here's an example using the imputeTS library. It takes account of more than one NA in the sequence, ensures that the mean is calculated if the next valid observation is greater than the last valid observation and also ignores NA at the beginning and end.
library(imputeTS)
myimpute <- function(series) {
# Find where each NA is
nalocations <- is.na(series)
# Find the last and the previous observation for each row
last1 <- lag(series)
next1 <- lead(series)
# Carry forward the last and next observations over sequences of NA
# Each row will then get a last and next that can be averaged
cflast <- na_locf(last1, na_remaining = 'keep')
cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
# Make a data frame
df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
# Calculate the mean where there is currently a NA
# making sure that the next is greater than the last
df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
#list(df, imputedseries) # comment this in and return it to see the intermediate data frame for debugging
imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))
# [1] NA NA 3 4 5 5 6 7 7 8 NA 7 8 8 9 10 11 NA NA

There is also the na_ma function in the imputeTS package for imputing moving averages.
In your case this would be with the following settings:
na_ma(x, k = 1, weighting = "simple")
k = 1 (meaning 1 value before and 1 after the NA are taken into account)
weighting = "simple" (the mean of these two values is calculated)
This can be applied quite easy with basically 1 line of code:
library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple")
You could also choose to take more values before and after the NA into account e.g. k=3. Interesting feature if you take more than 1 value to each side into account is the possibility to choose a different weighting e.g. with weighting = "linear" weights decrease in arithmetical progression (a Linear Weighted Moving Average) - meaning the further they values are away from the NA the less impact they have.

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.
For example,
iris_test = iris
iris_test$Sepal.Length[1] = NA
iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA
It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?

Your assumption is kind of correct that is you get NA values when there is NA in the data.
The comparison yields NA values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA it returns NA. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)

Adding to #Ronak's
The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:
# Is unknown greater than 0? Result is unknown (NA)
NA > 0
#NA
# Is unknown less than 0? Output is unknown (NA).
NA < 0
# NA
# Is unknown equal to unknown? Output is unknown(NA).
NA == NA
# NA
Getting back to your data, when you do:
iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).
The question is "Is unknown less than 0?".
The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).
There is a function called is.na() which I'm sure you're aware of to handle missing values.
Hope that adds some insight to your question.

R subset exclusion based on string creates extra column

I have a data set as such below
salaries <- read.csv('salaries.csv', header=TRUE)
print(salaries)
Name Job Salary CompanyExperience IndustryExperience
John Engineer 50000 3 12
Adam Manager 55000 6 7
Alice Manager #N/A 6 6
Bob Engineer 65000 5 #N/A
Carl Engineer 70000 #N/A 10
I would like to plot some of this information, however I would need to exclude any data points with "#N/A" by removing any rows where there is an "#N/A" text string (produced by MS Excel spreadsheet exported to CSV) to make a plot of Salary ~ CompanyExperience.
My code to subset is as follows:
salaries <-salaries[salaries$CompanyExperience!="#N/A" &
salaries$Salary!="#N/A",]
#write.csv(salaries, "salaries2.csv")
#salaries <- read.csv('salaries2.csv', header=TRUE)
print(salaries)
Now this seems to work without any issue, producing:
Name Job Salary CompanyExperience IndustryExperience
1 John Engineer 50000 3 12
2 Adam Manager 55000 6 7
4 Bob Engineer 65000 5 #N/A
Which seems fine, however as soon as I try to put this data subset into a linear regression, I get an error:
> salarylinear <- lm(salaries$CompanyExperience ~ salaries$Salary)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Now if I've done some experimenting and have found that if I subset the data using things like "!=10000" or "<50", I dont get this error. Also, I've found that when I write this new subset into a CSV file and read it again (by removing the # tags in the code above, the data set will have added a mysterious "X" column at the front and wont have the error when trying to run a linear regression:
X Name Job Salary CompanyExperience IndustryExperience
1 1 John Engineer 50000 3 12
2 2 Adam Manager 55000 6 7
3 4 Bob Engineer 65000 5 #N/A
I've searched the web and cant find any reason why this is happening. Is there a way I can produce a useable subset by excluding "#N/A" strings without having to resort to writing the data to disk and reading into memory again?

Most likely what is happening is that columns of data that you think are numeric are not in fact numeric. Two things are leading to this:
read.csv() doesn't know that "#N/A" means "missing" and as a result, it is reading in "#N/A" as a string (not a number), causing it to think that the whole columns of Salary, CompanyExperience, and IndustryExperience are string variables.
read.csv() has a notorious default to read in strings as factors. If you're unfamiliar with factors, one good resource is this.
This combination of events is why lm() thinks your dependent variable is a factor and is throwing an error.
The solution is to add na.strings = "#N/A" as an argument to read.csv(). Then your data will be read in as numeric. You can proceed straight to running your regression because lm() will drop rows with NA's automatically.
However, to be a bit more explicit, you may also want to add stringsAsFactors = FALSE as an argument to read.csv() just in case you have any other things that mean "missing" but are coded as, say, a blank. And, if you want to handle the NAs manually before running your regression, you can drop rows with NAs using complete.cases() or something like salaries[!is.na(Salary),]

Follow-up to our discussion in the comments about what happens when you subset a data.frame with a matrix:
First, we create a 3x2 dataframe to work with:
df <- data.frame(x=1:3, y=4:6)
Then, let's create a vector of TRUE/FALSE for the rows we want to keep when we subset our dataframe.
v <- c(T,T,F)
Here, v has 2 TRUEs followed by 1 FALSE so if we subset our 3-row dataframe with v, we will be selecting the first 2 rows and omitting the 3rd row:
df[v,]
x y
1 1 4
2 2 5
Great, that works as expected. But what about if we subset with a matrix? We create matrix m that has the same 3x2 dimensions as our dataframe. m is full of TRUEs except for 2 FALSEs in cells (1,1) and (3,2).
m <- matrix(c(F,T,T,T,T,F), ncol=2)
m
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
[3,] TRUE FALSE
Now, if we try to subset our dataframe with m, we might at first think that we're gong to only get row 2 back, because m has a FALSE in its first and third row. That, of course, isn't what happens.
df[m,]
x y
2 2 5
3 3 6
NA NA NA
NA.1 NA NA
The trick to understanding this is to know that a matrix in R is just a vector with a dimension attribute. The dimension is as expected, because we created m:
dim(m)
[1] 3 2
But as a vector, what does m look like:
as.vector(m)
[1] FALSE TRUE TRUE TRUE TRUE FALSE
We see that m-as-a-vector is just the columns of m, repeated one after the other (because R "fills in" matrices column-wise). Let me re-write m with the original cells identified, in case my description isn't clear:
[1] FALSE TRUE TRUE TRUE TRUE FALSE
(1,1) (2,1) (3,1) (1,2) (2,2) (3,2)
So when we try to subset our dataframe with m, it's like using this length-6 vector, and this length-6 vector says to select rows 2:5. So when we write df[m, ] R faithfully selects rows 2 and 3, and then when it tries to select rows 4 and 5, they don't "exist" so R fills them in with NAs. This is why we get more rows in our subset than in our original dataframe.
Lastly, we saw that df[m, ] has funny rownames like NA.1. Rownames must be unique, so R calls the row 4 of the "subset" 'NA' and it calls row 5 of the subset 'NA.1'.
I hope this clears it up for you. Happy coding!

Different results for 2 subset data methods in R

I'm subseting my data, and I'm getting different results for the following codes:
subset(df, x==1)
df[df$x==1,]
x's type is integer
Am I doing something wrong?
Thank you in advance

Without example data, it is difficult to say what your problem is. However, my hunch is that the following probably explains your problem:
df <- data.frame(quantity=c(1:3, NA), item=c("Coffee", "Americano", "Espresso", "Decaf"))
df
quantity item
1 Coffee
2 Americano
3 Espresso
NA Decaf
Let's subset with [
df[df$quantity == 2,]
quantity item
2 Americano
NA <NA>
Now let's subset with subset:
subset(df, quantity == 2)
quantity item
2 Americano
We see that there is a difference in sub-setting output depending on how NA values are treated. I think of this as follows: With subset, you are explicitly stating you want the subset for which the condition is verifiably true. df$quantity==2 produces a vector of true/false-statements, but where quantity is missing, it is impossible to assign TRUE or FALSE. This is why we get the following output with an NA at the end:
df$quantity==2
[1] FALSE TRUE FALSE NA
The function [ takes this vector but does not understand what to do with NA, which is why instead of NA Decaf we get NA <NA>. If you prefer using [, you could use the following instead:
df[which(df$quantity == 2),]
quantity item
2 Americano
This translates the logical condition df$quantity == 2 into a vector or row numbers where the logical condition is "verifiably" satisfied.

Using R to count patterns in columns

I have a matrix in R containing 1000 columns and 4 rows. Each cell in the matrix contains an integer between 1-4. I want to know two things:
1) What is the number of columns that contain a "1", "2", "3", and "4" in any order? Ideally, I would like the code to not require that I input each possible combination of 1,2,3,4 to perform its count.
2) What is the number of columns that contain 3 of the possible integers, but not all 4?

Solution 1
The most obvious approach is to run apply() over the columns and test for the required tabulation of the column vector using tabulate(). This requires first building a factor() out of the column vector to normalize its storage representation to an integer vector based from 1. And since you don't care about order, we must run sort() before comparing it against the expected tabulation.
For the "4 of 4" problem the expected tabulation will be four 1s, while for the "3 of 4" problem the expected tabulation will be two 1s and one 2.
## generate data
set.seed(1L); NR <- 4L; NC <- 1e3L; m <- matrix(sample(1:4,NR*NC,T),NR);
sum(apply(m,2L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x))))));
## [1] 107
sum(apply(m,2L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x))))));
## [1] 545
Solution 2
v <- c(1L,2L,4L,8L);
sum(colSums(matrix(v[m],nrow(m)))==15L);
## [1] 107
v <- c(1L,3L,9L,27L);
s3 <- c(14L,32L,38L,16L,34L,22L,58L,46L,64L,42L,48L,66L);
sum(colSums(matrix(v[m],nrow(m)))%in%s3);
## [1] 545
Here's a slightly weird solution.
I was looking into how to use colSums() or colMeans() to try to find a quick test for columns that have 4 of 4 or 3 of 4 of the possible cell values. The problem is, there are multiple combinations of the 4 values that sum to the same total. For example, 1+2+3+4 == 10, but 1+1+4+4 == 10 as well, so just getting a column sum of 10 is not enough.
I realized that one possible solution would be to change the set of values that we're summing, such that our target combinations would sum to unambiguous values. We can achieve this by spreading out the original set from 1:4 to something more diffuse. Furthermore, the original set of values of 1:4 is perfect for indexing a precomputed vector of values, so this seemed like a particularly logical approach for your problem.
I wasn't sure what degree of diffusion would be required to make unique the sums of the target combinations. Some ad hoc testing seemed to indicate that multiplication by a fixed multiplier would not be sufficient to disambiguate the sums, so I moved up to exponentiation. I wrote the following code to facilitate the testing of different bases to identify the minimal bases necessary for this disambiguation.
tryBaseForTabulation <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## get the sums of good and bad combinations
hs <- rowSums(g[good,,drop=F]);
ns <- rowSums(g[-good,,drop=F]);
## return the number of ambiguous sums; we need to get zero!
sum(hs%in%ns);
}; ## end tryBaseForTabulation()
The function takes the size of the set (4 for us), the required tabulation (as returned by tabulate()) in sorted order (as revealed earlier, this is four 1s for the "4 of 4" problem, two 1s and one 2 for the "3 of 4" problem), and the test base. This is the result for a base of 2 for the "4 of 4" problem:
tryBaseForTabulation(4L,rep(1L,4L),2L);
## [1] 0
So we get the result we need right away; a base of 2 is sufficient for the "4 of 4" problem. But for the "3 of 4" problem, it takes one more attempt:
tryBaseForTabulation(4L,c(1L,1L,2L),2L);
## [1] 7
tryBaseForTabulation(4L,c(1L,1L,2L),3L);
## [1] 0
So we need a base of 3 for the "3 of 4" problem.
Note that, although we are using exponentiation as the tool to diffuse the set, we don't actually need to perform any exponentiation at solution run-time, because we can simply index a precomputed vector of powers to transform the value space. Unfortunately, indexing a vector with a matrix returns a flat vector result, losing the matrix structure. But we can easily rebuild the matrix structure with a call to matrix(), thus we don't lose very much with this idiosyncrasy.
The last step is to derive the destination value space and the set of sums that satisfy the problem condition. The value spaces are easy; we can just compute the power sequence as done within tryBaseForTabulation():
2L^(1:4-1L);
## [1] 1 2 4 8
3L^(1:4-1L);
## [1] 1 3 9 27
The set of sums was computed as hs in the tryBaseForTabulation() function. Hence we can write a new similar function for these:
getBaseSums <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## return the sums of good combinations
rowSums(g[good,,drop=F]);
}; ## end getBaseSums()
Giving:
getBaseSums(4L,rep(1L,4L),2L);
## [1] 15
getBaseSums(4L,c(1L,1L,2L),3L);
## [1] 14 32 38 16 34 22 58 46 64 42 48 66
Now that the solution is complete, I realize that the cost of the vector index operation, rebuilding the matrix, and the %in% operation for the second problem may render it inferior to other potential solutions. But in any case, it's one possible solution, and I thought it was an interesting idea to explore.
Solution 3
Another possible solution is to precompute an N-dimensional lookup table that stores which combinations match the problem condition and which don't. The input matrix can then be used directly as an index matrix into the lookup table (well, almost directly; we'll need a single t() call, since its combinations are laid across columns instead of rows).
For a large set of values, or for long vectors, this could easily become impractical. For example, if we had 8 possible cell values with 8 rows then we would need a lookup table of size 8^8 == 16777216. But fortunately for the sizing given in the question we only need 4^4 == 256, which is completely manageable.
To facilitate the creation of the lookup table, I wrote the following function, which stands for "N-dimensional combinations":
NDcomb <- function(N,f) {
x <- seq_len(N);
g <- do.call(expand.grid,rep(list(x),N));
array(apply(g,1L,f),rep(N,N));
}; ## end NDcomb()
Once the lookup table is computed, the solution is easy:
v <- NDcomb(4L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 107
v <- NDcomb(4L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 545

We can use colSums. Loop over 1:4, convert the matrix to a logical matrix, get the colSums, check whether it is not equal to 0 and sum it.
sapply(1:4, function(i) sum(colSums(m1==i)!=0))
#[1] 6 6 9 5
If we need the number of columns that contain 3 and not have 4
sum(colSums(m1!=4)!=0 & colSums(m1==3)!=0)
#[1] 9
data
set.seed(24)
m1 <- matrix(sample(1:4, 40, replace=TRUE), nrow=4)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why are NAs in newly created data.frame when using logical selection? - r

Related

how to fill missing values in a vector with the mean of value before and after the missing one

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

R subset exclusion based on string creates extra column

Different results for 2 subset data methods in R

Using R to count patterns in columns

Categories

Resources