The Problem
This a simple tapply example:
z=data.frame(s=as.character(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
On R (Another Canoe) v3.3.3 (2017-03-06) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA NA
On R (You Stupid Darkness) v3.4.0 (2017-04-21) for Windows, it brings:
# 1 2
# 1 NA NA
# 2 NA ""
R News References
According to
NEWS.R-3.4.0.:
tapply() gets new option default = NA allowing to change the previously hardcoded value.
In this instance instead, it seems like if it defaults to an empty string.
Inconsistencies Among Data Types
The new behavior is inconsistent with the numeric or logical version, where one still gets all NAs:
z=data.frame(s=as.numeric(NA), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
tapply(z$s, list(z$rows, z$cols), identity)
# 1 2
# 1 NA NA
# 2 NA NA
The same is for s=NA, which means s=as.logical(NA).
An Even Worse Case
In a more realistic context the character vector s in z has several values including NAs.
z=data.frame(s=c('a', NA, 'c'), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 c 1 2
# 1 2
# 1 "a" "c"
# 2 NA ""
In general, we might fix this setting missing values for combinations with no values:
m[!nzchar(m)]=NA; m
# 1 2
# 1 "a" "c"
# 2 NA NA
Now when there is no value, such as in (2,2), one correctly gets a NA, like in the old versions.
But what if the input of tapply already has some empty strings?
z=data.frame(s=c('a', NA, ''), rows=c(1,2,1), cols=c(1,1,2), stringsAsFactors=FALSE)
m=tapply(z$s, list(z$rows, z$cols), identity)
z;m
# s rows cols
# 1 a 1 1
# 2 <NA> 2 1
# 3 1 2
# 1 2
# 1 "a" ""
# 2 NA ""
Now there is no way to distinguish between the legal empty string in (1,2) and that artificially added in (2,2) in place of the NA by the new tapply. So we can't apply the fix.
Questions
Is really the new behavior the correct one?
That is, if there is no string for rows=2 and cols=2, why this is not reported as a missing value (NA) and why this is so only for character data types?
Can we rewrite the code above in such a way to get a consistent behavior across R versions?
Related
So I have this data frame df:
df<-data.frame(a=runif(10),b=sample(10),c=c(1,2,3,NA,1,2,1,4,5,3))
> head(df,2)
a b c
1 0.503718016 4 1
2 0.253538589 10 2
So for the case where:
>df$a_new<-NA
> head(df,2)
a b c a_new
1 0.503718016 4 1 NA
2 0.253538589 10 2 NA
I then thought of a quick and dirty solution of creating objects with peculiar names such as: df$XXX_new and why not attributing values via assign(Of course XXX represents a variable that runs over a vector of names, namely names(df):
for(ll in names(df))
assign(paste0("df$",ll,"_new"),NA)
I was expecting new columns to appear to my old df. This is not the case.
>head(df)
a b c a_new
1 0.503718016 4 1 NA
2 0.253538589 10 2 NA
Is there an explanation as to why this occurs?
In assign, the first argument is 'x' - which is a variable name given as a character string. Here, the "df" is the object name of the data.frame. As second argument, we assign, the new variable to the value i.e. NA
for(ll in names(df)) assign("df", `[<-`(df, paste0(ll, "_new"), value = NA))
head(df, 2)
# a b c a_new b_new c_new
#1 0.2925740 7 1 NA NA NA
#2 0.2248911 4 2 NA NA NA
data
set.seed(24)
df<-data.frame(a=runif(10),b=sample(10),c=c(1,2,3,NA,1,2,1,4,5,3))
I have a census dataset with some missing variables indicated with a ?,
When checking for incomplete cases in R it says there are none because R takes the ? as a valid character. Is there any way to change all the ? to NAs? I would like to run multiple imputation using the mice package to fill in the missing data after.
Data frames. You may need to fiddle with the quotation marks. I have not tested this.
df[df == "?"] <- NA
Creating data frame df
df <- data.frame(A=c("?",1,2),B=c(2,3,"?"))
df
# A B
# 1 ? 2
# 2 1 3
# 3 2 ?
I. Using replace() function
replace(df,df == "?",NA)
# A B
# 1 <NA> 2
# 2 1 3
# 3 2 <NA>
II. While importing a file with ?
data <- read.table("xyz.csv",sep=",",header=T,na.strings=c("?",NA))
data
# A B
# 1 1 NA
# 2 2 3
# 3 3 4
# 4 NA NA
# 5 NA NA
# 6 4 5
Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.
This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
I'm a newbie to R, but really like it and want to improve constantly. Now, after searching for a while, I need to ask you for help.
This is the given case:
1) I have sentences (sentence.1 and sentence.2 - all words are already lower-case) and create the sorted frequency lists of their words:
sentence.1 <- "bob buys this car, although his old car is still fine." # saves the sentence into sentence.1
sentence.2 <- "a car can cost you very much per month."
sentence.1.list <- strsplit(sentence.1, "\\W+", perl=T) #(I have these following commands thanks to Stefan Gries) we split the sentence at non-word characters
sentence.2.list <- strsplit(sentence.2, "\\W+", perl=T)
sentence.1.vector <- unlist(sentence.1.list) # then we create a vector of the list
sentence.2.vector <- unlist(sentence.2.list) # vectorizes the list
sentence.1.freq <- table(sentence.1.vector) # and finally create the frequency lists for
sentence.2.freq <- table(sentence.2.vector)
These are the results:
sentence.1.freq:
although bob buys car fine his is old still this
1 1 1 2 1 1 1 1 1 1
sentence.2.freq:
a can car cost month much per very you
1 1 1 1 1 1 1 1 1
Now, please, how could I combine these two frequency lists that I will have the following:
a although bob buys can car cost fine his is month much old per still this very you
NA 1 1 1 NA 2 NA 1 1 1 NA NA 1 NA 1 1 NA NA
1 NA NA NA 1 1 1 NA NA NA 1 1 NA 1 NA NA 1 1
Thus, this "table" should be "flexible" so that in case of entering a new sentence with the word, e.g. "and", the table would add the column with the label "and" between "a" and "although".
I thought of just adding new sentences into a new row and putting all not word that are not yet in the list column-wise (here, "and" would be to the right of "you") and sort the list again. However, I haven't managed this as already the sorting of the new sentence's words' frequencies according to the existing labels haven't been working (when there is e.g., "car" again, the new sentence's frequency of car should be written into the new sentence's row and the column of "car", but when there is e.g. "you" for the 1st time, its frequency should be written into the new sentence's row and a new column labeled "you").
This isn't exactly what you describe, but what you're aiming for makes more sense to me organized by row, rather than by column (and R handles data organized this way a bit more easily anyway).
#Convert tables to data frames
a1 <- as.data.frame(sentence.1.freq)
a2 <- as.data.frame(sentence.2.freq)
#There are other options here, see note below
colnames(a1) <- colnames(a2) <- c('word','freq')
#Then merge
merge(a1,a2,by = "word",all = TRUE)
word freq.x freq.y
1 although 1 NA
2 bob 1 NA
3 buys 1 NA
4 car 2 1
5 fine 1 NA
6 his 1 NA
7 is 1 NA
8 old 1 NA
9 still 1 NA
10 this 1 NA
11 a NA 1
12 can NA 1
13 cost NA 1
14 month NA 1
15 much NA 1
16 per NA 1
17 very NA 1
18 you NA 1
You can then keep using merge to add more sentences. I converted the column names for simplicity, but there are other options. Using the by.x and by.y arguments instead of just by in merge can indicate the specific columns merge on if the names aren't the same in each data frame. Also, the suffix argument in merge will control how the count columns are given unique names. The default is to append .x and .y but you can change that.
Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.
> # A toy example
> a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
> a
col1 col2 col3 col4
1 1 2 1 4
2 2 NA 2 3
3 3 2 3 2
4 4 3 4 1
>
> # Bummer, there's an NA in my condition
> a$col2==2
[1] TRUE NA TRUE FALSE
>
> # Why is this a good thing to do?
> # It NA'd the whole row, and kept it
> a[a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
NA NA NA NA NA
3 3 2 3 2
>
> # Yes, this is the right way to do it
> a[!is.na(a$col2) & a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
>
> # Subset seems designed to avoid this problem
> subset(a, col2 == 2)
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
Can someone explain why the behavior you get without the is.na check would ever be good or useful?
I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as
non-comparable even to themselves, so comparisons involving them
will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match() function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
Update: How is NA handled when there are multiple logical conditions?
NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE