median() coming back as NA for df of integers? - r

Have a data frame of numerical data and using apply with median along columns. I'm getting NA for the median even though there are some non-zero entries in the columns. I did str(df) to ensure all of the df is integer and it is. What does it mean when R says the median is NA? Thanks.
v1 v2 v3.....
1 3 4
0 0 0
. . .
Also, I got a bunch warnings like this:
"1: In mean.default(sort(x, partial = half + 0L:1L)[half + ... :
argument is not numeric or logical: returning NA"

My solution it is trivial but maybe there are some NAs you did not see. Try to use apply with the na.rm = FALSE in the last argument (the ellipsis).
Using the code provided by akrun.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:5, 10*5, replace=TRUE), ncol=5))
apply(df1, 2, median)
I add some NA
df1[ 3 , "V2" ] <- NA
and then use sapply (which is the same due to the fact that a data frame is a type of list )
sapply(df1, median, c(na.rm = TRUE))
edit:
consider that str(df1) return int even if there is an NA at row 3 column V2.

Related

Creating/Populating Empty Data Frames in R

I am working with R. I found this link here on creating empty data frames in R: Create an empty data.frame .
I tried to do something similar:
df <- data.frame(Date=as.Date(character()),
country=factor(),
total=numeric(),
stringsAsFactors=FALSE)
Yet, when I try to populate it:
df$total = 7
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, total, value = 7) :
replacement has 1 row, data has 0
df[1, "total"] <- rnorm(100,100,100)
Error in `[<-.data.frame`(`*tmp*`, 1, "total", value = c(-79.4584309347689, :
replacement has 100 rows, data has 1
Does anyone know how to fix this error?
Thanks
An option is to specify the row index
df[1, "total"] <- 7
-output
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ Date : Date, format: NA
# $ country: Factor w/ 0 levels: NA
# $ total : num 7
The issue is that when we select a single column and assign on a 0 row dataset, it is not automatically expanding the row for other columns. By specifying the row index, other columns will automatically filled with default NA
Regarding the second question (updated), a standard data.frame column is a vector and the length of the vector should be the same as the index we are specifying. Suppose, we want to expand to 100 rows, change the index accordingly
df[1:100, "total"] <- rnorm(100, 100, 100) # length is 100 here
dim(df)
#[1] 100 3
Or if we need to cram everything in a single row, then wrap the rnorm in a list
df[1, "total"] <- list(rnorm(100, 100, 100))
In short, the lhs should be of the same length as the rhs. Another case is when we are assigning from a different dataset
df[seq_along(aa$bb), "total"] <- aa$bb
This can also be done without initialization i.e.
df <- data.frame(total = aa$bb)

filter variables whose column names contain pattern

I am trying to filter out NA, NaN and Inf values out of a tbl using dyplr's filter function.
The trick is that I only want to apply the filter to columns whose names contain a specific pattern. The pattern is: r1, r2, r3, etc.
I have tried to combine grep and filter to achieve this, but can't get it to work. My current code looks like this:
filter_(!is.na(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.infinite(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.nan(grep("r[1-9]", colnames(DF), value = TRUE)))
However, this code returns a warning message: "Truncating vector to length 1."
And the data returned is unfiltered.
I suspect that it's the is.na functions here that are causing the problem, because I've seen an example online where you can apply grep to filter using a normal condition (i.e. condition == value) and not a condition based on is.na
dplyr provides matches() that is useful for this
Example 1: How matches() work?
library(dplyr)
# remove columns that start with "mp"
mtcars %>% select(-matches("mp"))
# keep columns that start with "mp"
mtcars %>% select(matches("mp"))
Example 2: Using matches() in the context of your request but using a MWE
# Create a dummy dataset
data = tibble(id = c("John","Paul","George","Ringo"),
r1 = c(1,2,NA,NA),
r2 = c(1,2,NA,4),
s1 = c(1,NA,3,4))
# Filter NAs in columns that start with r followed by a number
data %>% filter_at(vars(matches("r[0-9]")), all_vars(!is.na(.)))
Here is a base R method to filter rows, comparing specific columns.
# sample data
set.seed(1234)
dat <- data.frame(r1=c(NA, 1,NaN, 5, Inf), r2=c(NA, 1,NaN, NA, Inf), d=rnorm(5))
this data set looks like
dat
r1 r2 d
1 NA NA -1.2070657
2 1 1 0.2774292
3 NaN NaN 1.0844412
4 5 NA -2.3456977
5 Inf Inf 0.4291247
We will check the first two columns and ignore the third column. Notice that the only row that should remain is row 2.
dat[Reduce("&", lapply(dat[grep("^r", names(dat))], is.finite)),]
r1 r2 d
2 1 1 0.2774292
Here, a data.frame that is subset using grep to select the appropriate columns (1 and 2) is fed to lapply. The regex "^r" says only include variables whose names that start with "r". In the lapply loop, each vector is checked using is.finite. This function returns FALSE for NA, NaN, and Inf. The resulting list of logical vectors is fed to Reduce` which returns a logical vector the length of the number of rows of the data.frame where an element is TRUE if and only if every element in a row is finite.
With dplyr, you can use the filter_at function:
dat %>% filter_at(vars(matches("^r[1-9]")), all_vars(is.finite(.)))
Using #lmo's sample data, the result is:
r1 r2 d
1 1 1 0.2774292

Importing Excel csv file into RStudio & converting factors to numeric, I get either NAs or new data; tried eliminating commas but still get NAs

the Excel csv data file (called ff) has 54 columns & 788 rows of normalized data between 0 & 1, that looks like this: 0.39 0.16 0.27 0.60 ...
> str(ff)
'data.frame': 788 obs. of 54 variables:
$ V1 : Factor w/ 66 levels " - "," 0.05 ",..: 25 36 33 44 36 37 39 20
> dd <- as.numeric(as.character(ff))
Warning message:
NAs introduced by coercion
> dd <- gsub(".","",ff)
> de <- as.numeric(as.character(dd))
> str(de)
num [1:54] NA NA NA NA NA NA NA NA NA NA ...
I'm at a loss. I saw that lots of folks (perhaps beginners like me) have posted somewhat similar questions, please accept my apologies for raising this matter again. My gratitude in advance for your suggestions.
I think one problem you're having is that you're running the as.numeric(as.character(.)) call on an entire data frame, rather than a particular column. The result is a vector, the length of which equals the number of columns in your data frame (note your output is a vector of length 54, rather than 788 like you'd be hoping for from a column of your original data frame). Here's why:
When you convert a data frame to character, you get a vector back:
df <- data.frame( V1 = c(1,2,3), V2 = c(4,5,6) )
as.character( df )
[1] "c(1, 2, 3)" "c(4, 5, 6)"
Note that each vector element is not a character vector (ie: c("1","2","3")), but is in fact the vector representing that column, converted to a character string (ie: "c(1, 2, 3)"). So when you apply as.numeric to that vector, you'll get a vector back (not a data frame), and since each element can't be converted to a number (or even a numeric vector), you get NAs back:
as.numeric( as.character( df ) )
[1] NA NA
What you're more likely looking for is a conversion for a single column, rather than the entire data frame. Try:
ff$V1 <- as.numeric( as.character( ff$V1 ) )
This way you're converting a vector to a vector, which should give you the result you're after. You can do this over every column using lapply, something like:
df <- lapply( df, function(x) as.numeric( as.character( x ) ) )
df <- as.data.frame( df )
(or better yet, set the colClasses when you read the file as per #s.brunel's comment, such that you don't need to worry about this conversion at all)
NOTE also #akrun's comment. You should expect a warning when converting a vector where some values can't be converted to the class you're wanting. In your case, you've got some " - " values, which can't be converted to numeric, so you'll get NAs in place of those.

Using apply to find max in a data frame with missing values and strings

I have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID N1 N2 N3 N4
1 2 3 4 5
11 NA -12 14 55
21 12 SON 34 14"))
I want to find out what is the max entry in each row. This would be, for example, 5 in the first row. Obviously, the situation is more complicated because of missing values ('NA') and a string ('SON').
I first tried the following command:
df$Result<-apply(df,1, max, na.rm= TRUE)
The results are [5,55, SON]! Not what I wanted. I therefore then tried:
checkd<- function(x) if(is.integer(x)== TRUE)max(x)
df$Result<-apply(df,1, checkd)
Funnily, it removed the last column df$Result. Does anyone know what did I do wrong? Also, what would be the solution to my problem?
Also, of I try the following code:
checkd<- function(x) if(is.integer(x)== TRUE)max(x)
df$Result<-apply(df,1, checkd, na.rm= TRUE)
it gives me Error in FUN(newX[, i], ...) : unused argument (na.rm = TRUE)! Why is that? My function checkd does generally not seem to cause any problems to R. Why does R reject na.rm= TRUE when I use checkd but not when I use max in apply?
Thanks,
Dom
One of the points of using a data frame is that everything in a column must have the same class. If you want to treat your data as numeric, then run as.numeric() on each column and the strings, like "SON", will be converted to NA.
Data frames are also focused on column-wise operations. If you want to go row-wise, a matrix probably makes more sense:
mat = sapply(df, function(x) as.numeric(as.character(x)))
# as.numeric(as.character()) is necessary when starting with a factor
mat
# ID N1 N2 N3 N4
# [1,] 1 2 3 4 5
# [2,] 11 NA -12 14 55
# [3,] 21 12 NA 34 14
apply(mat, 1, max, na.rm = T)
# [1] 5 55 34
Why does R reject na.rm= TRUE when I use checkd but not when I use max in apply
After the first three arguments, (X, MARGIN, FUN), apply just passes arguments on through to the function you pass to FUN. If you look at the help for ?max, you'll see that it is defined to take an argument called na.rm. Your definition for checkd has no such argument. If you want to add an na.rm argument to your function, you could do it like this:
checkd <- function(x, na.rm = TRUE) if(is.integer(x)) max(x, na.rm = na.rm)
# or even this
checkd <- function(x, ...) if(is.integer(x)) max(x, ...)
Note that this function probably doesn't do what you want - it checks to see if the vector you give it - a whole row in your example - consists only of integers, and if so it will return the max. Since a vector can only have one type, if you have any non-integer in there, is.integer(x) will be false and the the max won't be calculated.
I also deleted your == TRUE, which doesn't do anything.

dplyr join define NA values

Can I define a "fill" value for NA in dplyr join? For example in the join define that all NA values should be 1?
require(dplyr)
lookup <- data.frame(cbind(c("USD","MYR"),c(0.9,1.1)))
names(lookup) <- c("rate","value")
fx <- data.frame(c("USD","MYR","USD","MYR","XXX","YYY"))
names(fx)[1] <- "rate"
left_join(x=fx,y=lookup,by=c("rate"))
Above code will create NA for values "XXX" and "YYY". In my case I am joining a large number of columns and there will be a lot of non-matches. All non-matches should have the same value. I know I can do it in several steps but the question is can all be done in one?
Thanks!
First off, I would like to recommend not to use the combination data.frame(cbind(...)). Here's why: cbind creates a matrix by default if you only pass atomic vectors to it. And matrices in R can only have one type of data (think of matrices as a vector with dimension attribute, i.e. number of rows and columns). Therefore, your code
cbind(c("USD","MYR"),c(0.9,1.1))
creates a character matrix:
str(cbind(c("USD","MYR"),c(0.9,1.1)))
# chr [1:2, 1:2] "USD" "MYR" "0.9" "1.1"
although you probably expected a final data frame with a character or factor column (rate) and a numeric column (value). But what you get is:
str(data.frame(cbind(c("USD","MYR"),c(0.9,1.1))))
#'data.frame': 2 obs. of 2 variables:
# $ X1: Factor w/ 2 levels "MYR","USD": 2 1
# $ X2: Factor w/ 2 levels "0.9","1.1": 1 2
because strings (characters) are converted to factors when using data.frame by default (You can circumvent this by specifying stringsAsFactors = FALSE in the data.frame() call).
I suggest the following alternative approach to create the sample data (also note that you can easily specify the column names in the same call):
lookup <- data.frame(rate = c("USD","MYR"),
value = c(0.9,1.1))
fx <- data.frame(rate = c("USD","MYR","USD","MYR","XXX","YYY"))
Now, for you actual question, if I understand correctly, you want to replace all NAs with a 1 in the joined data. If that's correct, here's a custom function using left_join and mutate_each to do that:
library(dplyr)
left_join_NA <- function(x, y, ...) {
left_join(x = x, y = y, by = ...) %>%
mutate_each(funs(replace(., which(is.na(.)), 1)))
}
Now you can apply it to your data like this:
> left_join_NA(x = fx, y = lookup, by = "rate")
# rate value
#1 USD 0.9
#2 MYR 1.1
#3 USD 0.9
#4 MYR 1.1
#5 XXX 1.0
#6 YYY 1.0
#Warning message:
#joining factors with different levels, coercing to character vector
Note that you end up with a character column (rate) and a numeric column (value) and all NAs are replaced by 1.
str(left_join_NA(x = fx, y = lookup, by = "rate"))
#'data.frame': 6 obs. of 2 variables:
# $ rate : chr "USD" "MYR" "USD" "MYR" ...
# $ value: num 0.9 1.1 0.9 1.1 1 1
If you're using dplyr anyway, you might as well take advantage of dplyr::coalesce, and use the dplyr syntax to pass into that a 1 or 0. I think this looks nice...
... %>%
mutate_if(is.numeric,coalesce,0)
Where the 0 is the arg passed to dplyr::coalesce to replace NAs.
In the example in the question, there are dataframes with factors. I feel confident one would not have FX rates as factors, or another vector in which you'd replace NA with zero, so I go ahead and add that step below just to make the answer executable after the provided example.
# replace NAs with zeros for all numeric columns
#
# ... code from question above
left_join(x=fx,y=lookup,by=c("rate")) %>%
# ignore if factors in value column are because it's a toy example
mutate(value = as.numeric(as.character(value))) %>%
# the good stuff here
mutate_if(is.numeric,coalesce,0)
I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)
left_join0 <- function(x, y, fill = 0L){
z <- left_join(x, y)
tmp <- setdiff(names(z), names(x))
z <- replace_na(z, setNames(as.list(rep(fill, length(tmp))), tmp))
z
}
Originally answered at: R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table
A tidyverse solution is to use tidyr::replace_na after the join:
left_join(x = fx, y = lookup, by = c("rate")) %>%
replace_na(list(value = 0))
Or, for more general cases:
left_join(x = fx, y = lookup, by = c("rate")) %>%
mutate(across(where(is.numeric), ~ replace_na(.x, 0)))

Resources