t-test doesn't work in function - variable lengths differ - r

Bit of a R novice here, so it might be a very simple problem.
I've got a dataset with GENDER (being a binary variable) and a whole lot of numerical variables. I wanted to write a simple function that checks for equality of variance and then performs the appropriate t-test.
So my first attempt was this:
genderttest<-function(x){ # x = outcome variable
attach(Dataset)
on.exit(detach(Dataset))
VARIANCE<-var.test(Dataset[GENDER=="Male",x], Dataset[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
This works well outside of a function (replacing the x, of course), but gave me an error here because variable lengths differ.
So I thought it might be handling the NA cases strangely and I should clean up the dataset first and then perform the tests:
genderttest<-function(x){ # x = outcome variable
Dataset2v<-subset(Dataset,select=c("GENDER",x))
Dataset_complete<-na.omit(Dataset2v)
attach(Dataset_complete)
on.exit(detach(Dataset_complete))
VARIANCE<-var.test(Dataset_complete[GENDER=="Male",x], Dataset_complete[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
But this gives me the same error.
I'd appreciate if anyone could point out my (probably stupid) mistake.

I believe the problem is that when you call t.test(x~GENDER), it's evaluating the variable x within the scope of Dataset rather than the scope of your function. So it's trying to compare values of x between the two genders, and is confused because Dataset doesn't have a variable called x in it.
A solution that should work is to call:
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), data=Dataset))
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), var.equal=T, data=Dataset))
which will call t.test() and pass the value of x as part of the formula argument rather than the character x (i.e score ~ GENDER instead of x ~ GENDER).
The reason for the particular error you saw is that Dataset$GENDER has length equal to the number of rows in Dataset, while Dataset$x has length = 0.

Related

Different variable lengths when looping over a string vector which corresponds to data frame columns

I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!

How to counter the 'non-numeric matrix extent' error in R?

I'm trying to generate a data frame of simulated values from the student's t distribution using the standard stochastic equation. The function I use is as follows:
matgen<-function(means,chi,covariancematrix)
{
cols<-ncol(means);
normals<-mvrnorm(n=500,mu=means,Sigma = covariancematrix);
invgammas<-rigamma(n=500,alpha=chi/2,beta=chi/2);
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
i<-1;
while(i<=500)
{
gen[i,]<-t(means)+normals[i,]*sqrt(invgammas[i]);
i<=i+1;
}
return(gen);
}
If it's not clear, I'm trying to create an empty data frame, that takes in values in cols number of columns and 500 rows. The values are numeric, of course, and R tells me that in the 9th row:
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
There's an error: 'non-numeric matrix extent'.
I remember using as.data.frame() to convert matrices into data frames in the past, and it worked quite smoothly. Even with numbers. I have been out of touch for a while, though, and can't seem to recollect or find online a solution to this problem. I tried is.numeric(), as.numeric(), 0s instead of NA there, but nothing works.
As Roland pointed out, one problem is, that col doesn't seem to be numeric. Please check if means is a dataframe or matrix, e.g. str(means). If it is, your code should not result in the error: 'non-numeric matrix extent'.
You also have some other issues in your code. I created a simplified example and pointed out the bugs I found as comments in the code:
library(MASS)
library(LearnBayes)
means <- cbind(c(1,2,3),c(4,5,6))
chi <- 10
matgen<-function(means,chi,covariancematrix)
{
cols <- ncol(means) # if means is a dataframe or matrix, this should work
normals <- rnorm(n=20,mean=100,sd=10) # changed example for simplification
# normals<-mvrnorm(n=20,mu=means,Sigma = covariancematrix)
# input to mu of mvrnorm should be a vector, see ?mvrnorm; but this means that ncol(means) is always 1 !?
invgammas<-rigamma(n=20,a=chi/2,b=chi/2) # changed alpha= to a and beta= to b
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=20))
i<-1
while(i<=20)
{
gen[i,]<-t(means)+normals[i]*sqrt(invgammas[i]) # changed normals[i,] to normals [i], because it is a vector
i<-i+1 # changed <= to <-
}
return(gen)
}
matgen(means,chi,covariancematrix)
I hope this helps.
P.S. You don't need ";" at the end of every line in R

applying lm to multiple datasets

Below are 4 datasets (I've just created them randomly for the sake of providing a reproducible code). I created a list of these so I could apply "lm" to these multiple datasets at once :
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C<-data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R<-data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E<-data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
dsets<-list(H,C,R,E)
models<-lapply(dsets,function(x)lm(X1~.,data=x))
lapply(models,summary)
The variables in each of the datasets are different (in count as well as names. However,if you run the code they will all be x1,x2..and so on). The first column/variable in each will be the response and rest would be the independent variables.
This code works but not on my actual dataset. Since my datasets have actual names for variables, I used the position of the variable instead as below:
dsets<-list(H,C,R,E)
models<lapply(dsets,function(x)lm(x[,1]~.,data=x))
lapply(models,summary)
Using the above, the results are messed up. It also includes the response variable as the independent variable.
Could anyone assist?
EDIT: I realized that x[,1] is calling the whole column and not the column name
models<lapply(dsets,function(x)lm(colnames(x)[1]~.,data=x))
lapply(models,summary)
but this doesn't work either. I get the following error
Error in model.frame.default(formula = colnames(H[1]) ~ ., data = H, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Var1')
models <- lapply(dsets,
function(data){
lm(reformulate(termlabels=".", response=names(data)[1]), data)
})
reformulate allows you to construct a formula from character strings.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources