applying lm to multiple datasets - r

Below are 4 datasets (I've just created them randomly for the sake of providing a reproducible code). I created a list of these so I could apply "lm" to these multiple datasets at once :
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
C<-data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
R<-data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
E<-data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
dsets<-list(H,C,R,E)
models<-lapply(dsets,function(x)lm(X1~.,data=x))
lapply(models,summary)
The variables in each of the datasets are different (in count as well as names. However,if you run the code they will all be x1,x2..and so on). The first column/variable in each will be the response and rest would be the independent variables.
This code works but not on my actual dataset. Since my datasets have actual names for variables, I used the position of the variable instead as below:
dsets<-list(H,C,R,E)
models<lapply(dsets,function(x)lm(x[,1]~.,data=x))
lapply(models,summary)
Using the above, the results are messed up. It also includes the response variable as the independent variable.
Could anyone assist?
EDIT: I realized that x[,1] is calling the whole column and not the column name
models<lapply(dsets,function(x)lm(colnames(x)[1]~.,data=x))
lapply(models,summary)
but this doesn't work either. I get the following error
Error in model.frame.default(formula = colnames(H[1]) ~ ., data = H, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Var1')

models <- lapply(dsets,
function(data){
lm(reformulate(termlabels=".", response=names(data)[1]), data)
})
reformulate allows you to construct a formula from character strings.

Related

Is there a way to run a wilcoxon test for variables with different lengths?

I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)

R bnlearn - parameter learning with naive.bayes() check.data() error

I have a graph structure, determined from another method, and I want to do parameter learning. The bnlearn methods, however, seem to do parameter learning directly on the dataset (strictly in a dataframe). I have two questions: how do I do parameter learning from an igraph or graphNEL structure with bnlearn?
Second question: I am getting a check.data() error when I try to do parameter learning using my dataset. Their example code works, and I can't understand why my dataset does not. See their code below and a reproducible example, below.
Here is their example code:
require(bnlearn)
require(Rgraphviz)
data(learning.test)
bn <- naive.bayes(learning.test, "A")
pred <- predict(bn, learning.test)
table(pred, learning.test[,"A"])
My reproducible example (errors on naive.bayes() call):
require(bnlearn, Rgraphviz)
data <- data <- matrix(sample.int(200, 61*252, TRUE), nrow=252, ncol=61)
data <- as.data.frame(matrix(as.numeric(as.matrix(data)), ncol=ncol(data),
byrow=TRUE))
bn <- naive.bayes(data, names(data)[1])
Error message:
Error in check.data(data, allowed.types = discrete.data.types) :
valid data types are:
* all variables must be unordered factors.
* all variables must be ordered factors.
* variables can be either ordered or unordered factors.
I do not think this error comes from detecting integers, because when I cast my data to a dataframe, I first cast it to numeric, because other methods in bnlearn require numeric or factored data. This dataset IS count data, but I want to use the method assuming I am using continuous datasets. Does this make sense?

specify model with selected terms using lm

A pretty straightforward for those with intimate knowledge of R
full <- lm(hello~., hellow)
In the above specification, linear regression is being used and hello is being modeled against all variables in dataset hellow.
I have 33 variables in hellow; I wish to specify some of those as independent variable. These variables have names that carry a meaning so I really don't want to rename them to x1 x2 etc.
How can I, without having to type the individual names of the variables (since that is pretty tedious), specify a select number of variables from the whole bunch?
I tried
full <- lm(hello~hellow[,c(2,5:9)]., hellow)
but it gave me an error "Error in model.frame.default(formula = hello ~ hellow[, : invalid type (list) for variable 'hellow[, c(2, 5:9)]'
reformulate will construct a formula given the names of the variables, so something like:
(Construct data first):
set.seed(101)
hellow <- setNames(as.data.frame(matrix(rnorm(1000),ncol=10)),
c("hello",paste0("v",1:9)))
Now run the code:
ff <- reformulate(names(hellow)[c(2,5,9)],response="hello")
full <- lm(ff, data=hellow)
should work. (Works fine with this example.)
An easier solution just occurred to me; just select the columns/variables you want first:
hellow_red <- hellow[,c(1,2,5,9)]
full2 <- lm(hello~., data=hellow_red)
all.equal(coef(full),coef(full2)) ## TRUE

t-test doesn't work in function - variable lengths differ

Bit of a R novice here, so it might be a very simple problem.
I've got a dataset with GENDER (being a binary variable) and a whole lot of numerical variables. I wanted to write a simple function that checks for equality of variance and then performs the appropriate t-test.
So my first attempt was this:
genderttest<-function(x){ # x = outcome variable
attach(Dataset)
on.exit(detach(Dataset))
VARIANCE<-var.test(Dataset[GENDER=="Male",x], Dataset[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
This works well outside of a function (replacing the x, of course), but gave me an error here because variable lengths differ.
So I thought it might be handling the NA cases strangely and I should clean up the dataset first and then perform the tests:
genderttest<-function(x){ # x = outcome variable
Dataset2v<-subset(Dataset,select=c("GENDER",x))
Dataset_complete<-na.omit(Dataset2v)
attach(Dataset_complete)
on.exit(detach(Dataset_complete))
VARIANCE<-var.test(Dataset_complete[GENDER=="Male",x], Dataset_complete[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
But this gives me the same error.
I'd appreciate if anyone could point out my (probably stupid) mistake.
I believe the problem is that when you call t.test(x~GENDER), it's evaluating the variable x within the scope of Dataset rather than the scope of your function. So it's trying to compare values of x between the two genders, and is confused because Dataset doesn't have a variable called x in it.
A solution that should work is to call:
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), data=Dataset))
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), var.equal=T, data=Dataset))
which will call t.test() and pass the value of x as part of the formula argument rather than the character x (i.e score ~ GENDER instead of x ~ GENDER).
The reason for the particular error you saw is that Dataset$GENDER has length equal to the number of rows in Dataset, while Dataset$x has length = 0.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Resources