How is as.numeric used here? - r

I've been trying to figure out how to mimic a piecewise linear regression model developed in the pricing software Emblem, using R. I did that using #Roland's answer in the below post.
https://stats.stackexchange.com/questions/61805/standard-error-of-slopes-in-piecewise-linear-regression-with-known-breakpoints
So to get the slopes, thanks to #Roland, I used the as.numeric((variable < X)) to get the slope of the second segment in the predictor variables.
What is going on here? Why does the "as.numeric" give me the correct answer? I can't find documentation on it and I would like to understand why this works.

It converts a boolean (TRUE / FALSE) value to numeric (1 / 0).
(The R-y name for boolean is "logical": is.logical(TRUE) returns TRUE.)
x < 10 # TRUE if x is less than 10, FALSE if x is 10 or more
as.numeric(x<10) # 1 if x is less than 10, 0 if x is 10 or more
This being said, you don't really need an as.numeric there. What you could do instead is:
# will also work:
mod2 <- lm(y~I((x<9.6)*x)+(x<9.6)+I((x>=9.6)*x)+(x>=9.6)-1)
This version will use the boolean values directly -- these are converted implicitly to factors, and how a factor functions within lm is that it is converted into k-1 dichotomous variables where k is the number of levels. So that's why, if you use the code above, you'll see variable names like x < 9.6TRUE in the lm output.
Then again, technically, as.numeric is a hack, and a more transparent way to do it may be something like ifelse(x<9.6,1,0). But hacks are not necessarily bad, so you might also prefer a hackier hack such as (x<9.6)*1 but that won't work within a formula because * has a special meaning in formulas, so you'd have to use I around it: I((x<9.6)*1) - I'd say as.numeric looks cleaner.

Related

Implementing equations in R

I am new to R (also not too good at math) and I am trying to calculate this equation in R with some difficulties:
X is some integer data I have, with 550 samples.
Any help is appreciated since I am unsure how to do this. I think I have to use a for loop and the sum() function but other than that I don;t know.
R supports vectorisation, which means you very rarely need to implement for loops.
For example, you can solve your equation like so:
## I'm just making up a long numerical vector for x - obviously you can use anything
x <- 1:1000
solution <- sum(20/x)^0.5
Unless the brackets denote the integral, rather than the sum? In which case:
solution <- sum( (20/x)^0.5 )

Subselection of a variable

I have a problem with selecting a variable that should contain a certain range of values. I want to split up my variable into 3 categories. Namely; small, medium and big. A piece of context. I have a variable named obj_hid_woonopp which is (size in m2) and it goes from 16-375. And my dataset is called datalogitvar.
I'm sorry I have no reproduceable code. But since I think it's a rather simple question I hope it can be answered nonetheless. The code that I'm using is as follows
datalogitvar$size_small<- as.numeric(obj_hid_WOONOPP>="15" & obj_hid_WOONOPP<="75" )
datalogitvar$size_medium<- as.numeric(obj_hid_WOONOPP>="76" & obj_hid_WOONOPP<="100" )
datalogitvar$size_large<- as.numeric(obj_hid_WOONOPP>="101")
When I run this, I do get a result. Just not the result I'm hoping for. For example the small category also contains very high numbers. It seems that (since i define "75") it also takes values of "175" since it contains "75". I've been thinking about it and I feel it reads my data as text and not as numbers. However I do say as.numeric so I'm a bit confused. Can someone explain to me how I make sure I create these 3 variables with the proper range? I feel I'm close but the result is useless so far.
Thank you so much for helping.
For a question like this you can replicate your problem with a publicly available dataset like mtcars.
And regarding your code
1) you will need to name the dataset for DATASET$obj_hid_WOONOPP on the right side of your code.
2) Why are you using quotes around your numeric values? These quotes prevent the numbers from being treated as numbers. They are instead treated as string values.
I think you want to use something like the code I've written below.
mtcars$mpg_small <- as.numeric(mtcars$mpg >= 15 & mtcars$mpg <= 20)
mtcars$mpg_medium <- as.numeric(mtcars$mpg > 20 & mtcars$mpg <= 25)
mtcars$mpg_large <- as.numeric(mtcars$mpg > 25)
Just to illustrate your problem:
a <- "75"
b <- "175"
a > b
TRUE (75 > 175)
a < b
FALSE (75 < 175)
Strings don't compare as you'd expect them to.
Two ideas come to mind, though an example of code would be helpful.
First, look into the documentation for cut(), which can be used to convert numeric vector into factors based on cut-points that you set.
Second, as #MrFlick points out, your code could be rewritten so that as.numeric() is run on a character vector containing strings that you want to convert to numeric values THEN perform Boolean comparisons such as > or &.
To build on #Joe
mtcars$mpg_small <- (as.numeric(mtcars$mpg) >= 15 &
(as.numeric(mtcars$mpg) <= 20))
Also be careful, if your vector of strings obj_hid_WOONOPP contains some values that cannot be coerced into numerics, they will become NA.

t-test doesn't work in function - variable lengths differ

Bit of a R novice here, so it might be a very simple problem.
I've got a dataset with GENDER (being a binary variable) and a whole lot of numerical variables. I wanted to write a simple function that checks for equality of variance and then performs the appropriate t-test.
So my first attempt was this:
genderttest<-function(x){ # x = outcome variable
attach(Dataset)
on.exit(detach(Dataset))
VARIANCE<-var.test(Dataset[GENDER=="Male",x], Dataset[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
This works well outside of a function (replacing the x, of course), but gave me an error here because variable lengths differ.
So I thought it might be handling the NA cases strangely and I should clean up the dataset first and then perform the tests:
genderttest<-function(x){ # x = outcome variable
Dataset2v<-subset(Dataset,select=c("GENDER",x))
Dataset_complete<-na.omit(Dataset2v)
attach(Dataset_complete)
on.exit(detach(Dataset_complete))
VARIANCE<-var.test(Dataset_complete[GENDER=="Male",x], Dataset_complete[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
But this gives me the same error.
I'd appreciate if anyone could point out my (probably stupid) mistake.
I believe the problem is that when you call t.test(x~GENDER), it's evaluating the variable x within the scope of Dataset rather than the scope of your function. So it's trying to compare values of x between the two genders, and is confused because Dataset doesn't have a variable called x in it.
A solution that should work is to call:
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), data=Dataset))
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), var.equal=T, data=Dataset))
which will call t.test() and pass the value of x as part of the formula argument rather than the character x (i.e score ~ GENDER instead of x ~ GENDER).
The reason for the particular error you saw is that Dataset$GENDER has length equal to the number of rows in Dataset, while Dataset$x has length = 0.

How to use negative values in which() statement in R?

I want to exclude the rows in which x has values less than or equal to -10, so I wrote this:
newdata <- data[which(data$x> -10), ]
Is this right or I need to put -10 in double quotation marks?
Thank you.
(Decided to upgrade this from a comment to an answer.)
Using double quotation marks is not wise: it will mess you up in some quite surprising ways. For example, 1 > "-10" is FALSE (!!) because of the way in which R compares strings.
R's use of <- for assignment may get you in trouble; if you want x<-10 to do the comparison rather than assign the value 10 to x, you need either spaces x < -10 or parentheses (x<(-10)). However, this doesn't arise with the > comparison.
You can always use parentheses if you're worried (x > (-10)); the only drawback is that things get harder to read if you use too many (e.g., data[(which(((data$x)>(-10)))),])).
As pointed out in the comments, R is an interactive environment; if you can't figure something like this out from the documentation or other help sources, you should just try a small example and convince yourself that it works.
For example:
x <- c(-20,-15,-10,-4,0)
x[x>-10]
## -4 0

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources