I'm trying to understand why this R code does a certain transformation.
Df[,"cutoff"] = as.numeric(levels(Df[,"cutoff"]))[Df[,"cutoff"]]
Previously, Df[,"cutoff"] is a factor with 49 levels and now after this operation, it's a vector. I just don't understand this syntax at all. Is there an explanation behind what having as.numeric(levels(Df[,"cutoff"])) does to a factor?
Thanks!
If for any reason you get the numbers as factors, some R functions do not interpret those as numbers even though you see numbers. For example summary will count the number of cases instead the usual six numbers.
See:
Df=data.frame(cutoff=factor(rep(c(2:6),2)),y=runif(10,12,15))
str(Df)
summary(Df[,"cutoff"])
2 3 4 5 6
2 2 2 2 2
#If you want the levels as numbers
Df[,"cutoff"] = as.numeric(levels(Df[,"cutoff"]))[Df[,"cutoff"]]
summary(Df[,"cutoff"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 3 4 4 5 6
It's a vector of NA, if the factor was not a displayed numeric.
df <- data.frame(cutoff = letters[1:26])
as.numeric(levels(df[,"cutoff"]))[df[,"cutoff"]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion
Let's break it down, this shows you the levels of the factor, returning a character string:
levels(df[,"cutoff"])
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" ...
This tries to convert a character string to numeric (which it can't, and therefore returns NA)
as.numeric(levels(df[,"cutoff"]))
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion
Now, adding the last element [df[,"cutoff"]], all this does is subset the result by the factor df[,"cutoff"], but since every element is NA, you wouldn't see any difference. In practice this would likely change the order of the result in unexpected (read: useless) ways.
as.numeric(levels(df[,"cutoff"]))[df[,"cutoff"]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion
Related
I've created a formula that calculates the exponential moving average of data:
myEMA <- function(price, n) {
ema <- c()
data_start <- which(!is.na(price))[1]
ema[1:data_start+n-2] <- NA
ema[data_start+n-1] <- mean(price[data_start:(data_start+n-1)])
beta <- 2/(n+1)
for(i in (data_start+n):length(price)) {
ema[i] <- beta*price[i] +
(1-beta)*ema[i-1]
}
ema <- reclass(ema,price)
return(ema)
}
The data I'm using is:
pricesupdated <- data.frame(a = seq(1,100), b = seq(1,200,2), c = c(NA,NA,NA,seq(1,97)))
I would like to create a dataframe where I apply the formula to each variable in my above data.frame. My attempt was:
frameddata <- data.frame(myEMA(pricesupdated,12))
But the error message that I get is:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': undefined columns selected
I'm able to print the answer that I want, but not create a dataframe...
Can you help me?
First of all myEMA() is a function, not a formula. Check out help("function") and help("formula") for details on what the distinction is.
The myEMA() function takes a numeric vector as its first argument and returns a numeric vector with the same dimensions as its first argument.
A data.frame object is bascially just a list of vectors with a special class attribute. The most common way to repeat a function call across each element in a list is to use one of the *apply family of functions. For example, you can use lapply(), which will calls myEMA once on each variable in pricesupdated and returns a list with one element per function call containing that function call's returned value (a numeric vector). This list can be easily converted back to data.frame() since all its elements have the same length:
results <- lapply(pricesupdated, myEMA, n = 12)
# look at the structure of the results object
> str(results)
List of 3
$ a: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ b: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ c: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
frameddata <- as.data.frame(results)
# look at the top 15 records in this object
> head(frameddata, 15)
a b c
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 6.5 12 NA
13 7.5 14 NA
14 8.5 16 NA
15 9.5 18 6.5
The question is likely a duplicate, ...
but the apply-family might help, e.g.
sapply(pricesupdated, myEMA, n=12)
for reproducibilty, it would be benificial to add require(pec)
I have a dataset archivo containing the rates of bonds for every duration of the government auctions since 2003. The first few rows are:
Fecha 1 2 3 4 5 6 7 8 9 10 11 12 18 24
2003-01-02 NA NA NA NA NA 44.9999 NA NA 52.0002 NA NA NA NA NA
2003-01-03 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-07 NA NA NA NA NA 40.0000 NA NA 45.9900 NA NA NA NA NA
2003-01-08 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-09 NA NA NA NA NA 37.0000 NA NA 41.9999 NA NA NA NA NA
Every column named 1 to 24 corresponds to a different duration. (1 month, 2 months, ..., 24 months). Not all durations are sold on the auction date. That's why I have NAs.
I need to calculate the NAs (missing) rates with a log fitting curve for every row that has at least more than 1 value. For the rows that has all NAs I just can use the preceeding constructed curve.
I'm aware I could run a code like:
x<-colnames(archivo[,-1]) # to keep the durations
y<-t(archivo[1,-1])
estimacion<-lm(y ~ log(x))
param<-estimacion$coefficients
and get the coefficients for the first row. Then run a loop and do it for every row.
Is there any way to do it directly with the entire dataset and obtain the parameters of every row (every log fitting) without doing a loop?
Hope the question is clear enough.
Thanks in advance!
Try:
dat <- as.data.frame(t(archivo[,-1])) ## transpose you data frame
## a function to fit a model `y ~ log(x)` for response vector `y`
fit_model <- function (y) {
non_NA <- which(!is.na(y)) ## non-NA rows index
if (length(non_NA) > 1) {
## there are at least 2 data points, a linear model is possible
lm.fit(cbind(1, log(non_NA)), y[non_NA])$coef
} else {
## not sufficient number of data, return c(NA, NA)
c(NA, NA)
}
}
## fit linear model column-by-column
result <- sapply(dat, FUN = fit_model)
Note that I am using lm.fit(), the kernel fitting routine called by lm(). Have a read on ?lm.fit if you are not familiar with it. It takes 2 essential arguments:
The first is the model matrix. The model matrix for your model y ~ log(x), is matrix(c(rep(1,24), log(1:24)), ncol = 2). You can also construct it via model.matrix(~log(x), data = data.frame(x = 1:24)).
The second is the response vector. For you problem it is a column of dat.
Unlike lm() which can handle NA, lm.fit() can not. So we need to remove NA rows from model matrix and response vector ourselves. The non_NA variable is doing this. Note, your model y ~ log(x) involves 2 parameters / coefficients, so at least 2 data are required for fitting. If there are not enough data, model fitting is impossible and we return c(NA, NA).
Finally, I use sapply() to fit a linear model column by column, retaining coefficients only by $coef.
Test
I am using the example rows you posted in your question. Using the above code, I get:
# V1 V2 V3 V4 V5 V6
# x1 14.06542 NA NA 13.53005 NA 14.90533
# x2 17.26486 NA NA 14.77316 NA 12.33127
Each column gives coefficients for each column of dat (or each row of archivo).
Update
Originally I used matrix(rep(1,24), log(1:24))[non_NA, ] for model matrix in lm.fit(). This is not efficient though. It first generates the complete model matrix then drops rows with NA. A double thought shows this is better: cbind(1, log(non_NA)).
I assigned a matrix to a name which varies with j:
j <- 2L
assign(paste0("pca", j,".FAVAR_fcst", sep=""), matrix(ncol=24, nrow=12))
This works very neat. Then I try to access a column of that matrix
paste0("pca", j,".FAVAR_fcst", sep="")[,2]
and get the following error:
Error in paste0("pca", j, ".FAVAR_fcst", sep = "")[, 2] :
incorrect number of dimensions
I've tried several variations and combinations with cat(), print() and capture.output(), but nothing seems to work. I'm not sure what I have to search exactly for and couldn't find a solution. Can you help me?
You can use get :
get(paste0("pca", j,".FAVAR_fcst", sep="")) # for the matrix
get(paste0("pca", j,".FAVAR_fcst", sep=""))[,2] # for the column
# [1] NA NA NA NA NA NA NA NA NA NA NA NA
An other solution would be to combine eval and as.symbol :
eval(as.symbol(paste0("pca", j,".FAVAR_fcst", sep="")))[,2]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA
I'm following the swirl tutorial, and one of the parts has a vector x defined as:
> x
[1] 1.91177824 0.93941777 -0.72325856 0.26998371 NA NA
[7] -0.17709161 NA NA 1.98079386 -1.97167684 -0.32590760
[13] 0.23359408 -0.19229380 NA NA 1.21102697 NA
[19] 0.78323515 NA 0.07512655 NA 0.39457671 0.64705874
[25] NA 0.70421548 -0.59875008 NA 1.75842059 NA
[31] NA NA NA NA NA NA
[37] -0.74265585 NA -0.57353603 NA
Then when we type x[is.na(x)] we get a vector of all NA's
> x[is.na(x)]
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why does this happen? My confusion is that is.na(x) itself returns a vector of length 40 with True or False in each entry of the vector depending on whether that entry is NA or not. Why does "wrapping" this vector with x[ ] suddenly subset to the NA's themselves?
This is called logical indexing. It's a very common and neat R idiom.
Yes, is.na(x) gives a boolean ("logical") vector of same length as your vector.
Using that logical vector for indexing is called logical indexing.
Obviously x[is.na(x)] accesses the vector of all NA entries in x, and is totally pointless unless you intend to reassign them to some other value, e.g. impute the median (or anything else)
x[is.na(x)] <- median(x, na.rm=T)
Notes:
whereas x[!is.na(x)] accesses all non-NA entries in x
or compare also to the na.omit(x) function, which is way more clunky
The way R's builtin functions historically do (or don't) handle NAs (by default or customizably) is a patchwork-quilt mess, that's why the x[is.na(x)] idiom is so crucial)
many useful functions (mean, median, sum, sd, cor) are NA-aware, i.e. they support an na.rm=TRUE option to ignore NA values. See here. Also for how to define table_, mode_, clamp_
I cannot seem to add two columns in R.
when I try
dat$V1 + dat$V2
I get
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
lots of other questions suggest to do as I have done, however as you can see this does not work for me. what is the problem?
Try to convert your factor columns to numeric: If V1 and V2 are 1st two columns.
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
For example:
dat <- data.frame(V1= factor(1:5), V2= factor(6:10))
dat$V1+dat$V2
#[1] NA NA NA NA NA
#Warning message:
#In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
#[1] 7 9 11 13 15