I'm trying to use an ifelse on an array called "OutComes" but it's giving me some trouble.
> PersonNumber Risk_Factor OC_Death OnsetAge Clinical CS_Death Cure AC_Death
>[1,] 1 1 99.69098 NA NA NA NA NA
>[2,] 2 1 60.68009 NA NA NA NA NA
>[3,] 3 0 88.67483 NA NA NA NA NA
>[4,] 4 0 87.60846 NA NA NA NA NA
>[5,] 5 0 78.23118 NA NA NA NA NA
Now I will try to use an apply to analyse this table's Risk_Factor Column and apply one of two functions to replace the OnsetAge column's NA's.
I've been using an apply function -
apply(OutComes, 1, function(x)ifelse(OutComes[,"Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x))
However this obviously won't work as the ifelse itself won't work. the error being -
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
I'm not sure what's going on in this ifelse or what the x and y lengths are.
There is a mistake in your apply function. You are applying a function with argument x (one row of OutComes), but then whithin ifelse, you use a vector OutComes[,"Risk_Factor"] which is a column of the original matrix, not a single number. One simple solution is to do
apply(OutComes, 1, function(x) ifelse(x["Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x)))
But when dealing with a scalar, there is no real need to use ifelse, so it may be more efficient to write
apply(OutComes, 1, function(x) if (x["Risk_Factor"] == 1) HighOnsetFunction(x) else OnsetFunction(x)))
Related
I have some problems with NA value cause my dataset from excel is not same column number so It showed NA. It deleted all row containing NA value when make calculation Similarity Index function Psicalc in RInSp package.
B F
4 7
5 6
6 8
7 5
NA 4
NA 3
NA 2
Do you know how to handle with NA or remove it but not delete all row or not affect to package?. Beside when I import.RinSP it has message
In if (class(filename) == "character") { :
the condition has length > 1 and only the first element will be used
Thank you so much
Many R functions ( specifically base R ) have an na.rm argument, which is FALSE by default. That means if you omit this argument, and your data has NA, your "calculation" will result in NA. To remove these in the calculations, include an na.rm argument and assign it to TRUE.
Example:
x <- c(4,5,6,7,NA,NA)
mean(x) # Oops!
[1] NA
mean(x, na.rm=TRUE)
[1] 5.5
In the if_else() function in dplyr, it requires that both the if:TRUE and if:FALSE elements are of the same class.
I wish to return NA from my if_else() statement.
But e.g.
if_else(mtcars$cyl > 5, NA, 1)
returns
Error: false has type 'double' not 'logical'
Because simply reading in NA is logical, and 1 is numeric (double).
Wrapping as.numeric() around the NA works fine: e.g.
if_else(mtcars$cyl > 5, as.numeric(NA), 1)
returns
1 NA NA 1 NA NA NA NA 1 1 NA NA NA NA NA NA NA NA 1 1 1 1 NA NA NA NA 1 1 1 NA NA NA 1
As what I am hoping for.
But this feels kinda silly/unnecessary. Is there a better way of inputting NA as a "numeric NA" than wrapping it like this?
NB this only applies to the stricter dplyr::if_else not base::ifelse.
you can use NA_real_
if_else(mtcars$cyl > 5, NA_real_, 1)
try the base function
ifelse(mtcars$cyl > 5, NA, 1)
Or you can use if_else_ from the package hablar. It is as rigid as if_else from dplyr about types, but allows for generic NA. See,
library(hablar)
if_else_(mtcars$cyl > 5, NA, 1)
I have a dataset archivo containing the rates of bonds for every duration of the government auctions since 2003. The first few rows are:
Fecha 1 2 3 4 5 6 7 8 9 10 11 12 18 24
2003-01-02 NA NA NA NA NA 44.9999 NA NA 52.0002 NA NA NA NA NA
2003-01-03 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-07 NA NA NA NA NA 40.0000 NA NA 45.9900 NA NA NA NA NA
2003-01-08 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-09 NA NA NA NA NA 37.0000 NA NA 41.9999 NA NA NA NA NA
Every column named 1 to 24 corresponds to a different duration. (1 month, 2 months, ..., 24 months). Not all durations are sold on the auction date. That's why I have NAs.
I need to calculate the NAs (missing) rates with a log fitting curve for every row that has at least more than 1 value. For the rows that has all NAs I just can use the preceeding constructed curve.
I'm aware I could run a code like:
x<-colnames(archivo[,-1]) # to keep the durations
y<-t(archivo[1,-1])
estimacion<-lm(y ~ log(x))
param<-estimacion$coefficients
and get the coefficients for the first row. Then run a loop and do it for every row.
Is there any way to do it directly with the entire dataset and obtain the parameters of every row (every log fitting) without doing a loop?
Hope the question is clear enough.
Thanks in advance!
Try:
dat <- as.data.frame(t(archivo[,-1])) ## transpose you data frame
## a function to fit a model `y ~ log(x)` for response vector `y`
fit_model <- function (y) {
non_NA <- which(!is.na(y)) ## non-NA rows index
if (length(non_NA) > 1) {
## there are at least 2 data points, a linear model is possible
lm.fit(cbind(1, log(non_NA)), y[non_NA])$coef
} else {
## not sufficient number of data, return c(NA, NA)
c(NA, NA)
}
}
## fit linear model column-by-column
result <- sapply(dat, FUN = fit_model)
Note that I am using lm.fit(), the kernel fitting routine called by lm(). Have a read on ?lm.fit if you are not familiar with it. It takes 2 essential arguments:
The first is the model matrix. The model matrix for your model y ~ log(x), is matrix(c(rep(1,24), log(1:24)), ncol = 2). You can also construct it via model.matrix(~log(x), data = data.frame(x = 1:24)).
The second is the response vector. For you problem it is a column of dat.
Unlike lm() which can handle NA, lm.fit() can not. So we need to remove NA rows from model matrix and response vector ourselves. The non_NA variable is doing this. Note, your model y ~ log(x) involves 2 parameters / coefficients, so at least 2 data are required for fitting. If there are not enough data, model fitting is impossible and we return c(NA, NA).
Finally, I use sapply() to fit a linear model column by column, retaining coefficients only by $coef.
Test
I am using the example rows you posted in your question. Using the above code, I get:
# V1 V2 V3 V4 V5 V6
# x1 14.06542 NA NA 13.53005 NA 14.90533
# x2 17.26486 NA NA 14.77316 NA 12.33127
Each column gives coefficients for each column of dat (or each row of archivo).
Update
Originally I used matrix(rep(1,24), log(1:24))[non_NA, ] for model matrix in lm.fit(). This is not efficient though. It first generates the complete model matrix then drops rows with NA. A double thought shows this is better: cbind(1, log(non_NA)).
This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10
I'm following the swirl tutorial, and one of the parts has a vector x defined as:
> x
[1] 1.91177824 0.93941777 -0.72325856 0.26998371 NA NA
[7] -0.17709161 NA NA 1.98079386 -1.97167684 -0.32590760
[13] 0.23359408 -0.19229380 NA NA 1.21102697 NA
[19] 0.78323515 NA 0.07512655 NA 0.39457671 0.64705874
[25] NA 0.70421548 -0.59875008 NA 1.75842059 NA
[31] NA NA NA NA NA NA
[37] -0.74265585 NA -0.57353603 NA
Then when we type x[is.na(x)] we get a vector of all NA's
> x[is.na(x)]
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why does this happen? My confusion is that is.na(x) itself returns a vector of length 40 with True or False in each entry of the vector depending on whether that entry is NA or not. Why does "wrapping" this vector with x[ ] suddenly subset to the NA's themselves?
This is called logical indexing. It's a very common and neat R idiom.
Yes, is.na(x) gives a boolean ("logical") vector of same length as your vector.
Using that logical vector for indexing is called logical indexing.
Obviously x[is.na(x)] accesses the vector of all NA entries in x, and is totally pointless unless you intend to reassign them to some other value, e.g. impute the median (or anything else)
x[is.na(x)] <- median(x, na.rm=T)
Notes:
whereas x[!is.na(x)] accesses all non-NA entries in x
or compare also to the na.omit(x) function, which is way more clunky
The way R's builtin functions historically do (or don't) handle NAs (by default or customizably) is a patchwork-quilt mess, that's why the x[is.na(x)] idiom is so crucial)
many useful functions (mean, median, sum, sd, cor) are NA-aware, i.e. they support an na.rm=TRUE option to ignore NA values. See here. Also for how to define table_, mode_, clamp_