Subset lineal regression in R - r

I have the following formula:
Reg_Total<- In_Bigdata2 %>%
lm(log(This_6) ~ This_1+This_2+This_3+This_4+
This_5+This_7+This_8+
This_12+This_13+This_14+This_15+This_16+This_17,This_18,data = .)
With that data
With only the variable This_18 as a subset, do you know why it gives me a perfect regression with an r2 of 1?

OK, this was a good puzzle.
You have to dig a little bit to find out what the subset= argument does, as it gets passed to the model.frame() function inside lm(). From ?model.frame():
subset: a specification of the rows to be used: defaults to all rows.
This can be any valid indexing vector (see ‘[.data.frame’)
for the rows of ‘data’ or if that is not supplied, a data
frame made up of the variables used in ‘formula’.
(emphasis added). Usually people specify a logical expression for subset= (e.g. This_5>2) to restrict the regression to particular cases. If you put in an integer vector, lm()/model.frame() will select the rows corresponding to those integers.
So ... what lm()/model.frame() have done is to construct a data set for the linear model that consists of rows of the original data set indexed by This_18. In other words, since the first few elements of This_18 are (2,3,4,3,3,2, ...), the first row of the new data set will be row 2 of the original data set; the second row will be row 3; the third row will be row 4; the fourth row will be another copy of row 3; and so on ...
head(model.frame(This_6~.-This_18, data=dd, subset=This_18))
## This_6 This_1 This_2 This_3 This_4 This_5 This_7 This_8 This_9 This_10 ...
## 2 2 5 3 3 3 3 3 2 3 1 ...
## 3 3 3 3 3 3 3 3 4 4 4 ...
## 4 1 3 3 3 3 3 3 2 1 2 ...
## 3.1 3 3 3 3 3 3 3 4 4 4 ...
## 3.2 3 3 3 3 3 3 3 4 4 4 ...
## 2.1 2 5 3 3 3 3 3 2 3 1 ...
(you can also get this object by running model.frame(fitted_model)).
Therefore, since the only values of This_18 are the integers 1-6, you get a regression run only on multiple copies of rows 1-6 of the original data set. Thus it's not surprising that you get a perfect fit, since there are only 6 unique response/sets of predictors.
The remaining question is ... what did you intend to do by using subset=This_18 ... ? "subset" refers to a subset of observations, not a subset of predictors.
If you want to do best subset regression (i.e. find the subset of predictors that maximize some criterion) there is not a single easy answer (and in fact there are some potential statistical pitfalls if you are interested in inference rather than prediction). Googling "R best subset regression" should help you, or searching for those keywords on Stack Overflow. (Or see the glmulti package, or the leaps package, or the stepAIC function in the MASS package, or r the MuMIn package, or ...)

Related

kNN algorithm predicts only one group

I am trying to make a model that will predict the group of a city according to the development level of it. I mean, the cities in the 1st group are the most developed cities and the ones in the 6th group are the least developed ones. I have 10 numerical variables in my data about each city.
First, I normalized them using max-min normalization. Then I generated the training and data sets. I have 81 cities.Dimensions of training and data sets are 20x10 and 61x10, respectively. I excluded the target variable from them. Then I made labels for them as training labels and test labels with dimensions 61x1 and 20x1.
Then I run the knn function like this
knn(train = Data.training, test = Data.test, cl = Data.trainLabels , k = 3)
its output is this
[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Levels: 1 2 3 4 5 6
But if I set the argument use.all to FALSE I get this output and that changes everytime I run the code
[1] 1 4 2 2 2 3 5 4 3 5 5 6 5 6 5 6 4 5 2 2
Levels: 1 2 3 4 5 6
I can't find the reason why my code gives the same prediction in the first place and what use.all has got to do with it.
As explained in the knn documentation :
use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.
In your case, all points have the same distances, so they all win as 'best neighbour' (use.all = True) or the algorithm picks k winners at random (use.all = False).
The problem seems to be in how you trained the algorithm or in the data itself. Since you did not post a sample of your data, I cannot help with that, but I suggest that you re-check it. You can also compute a few distances by hand, to see what is going on.
Also, check that you randomised your data before splitting it into training and testing sets. For example, say that the dataset is ordered by the label (the target variable). If you use the first 20 points to train the algorithm, it is likely that the algorithm will never see some of the labels during the training phase and therefore it will perform poorly on those during the testing phase.

Error with predict and newdata, dependent on number of predictor variable in model

I am trying to use predict to apply my model to data from one time period to see what might be the values for another time period. I did this
successfully for one dataset, and then tried on another with identical code
and got the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
The only difference between the two datasets was that my predictor model for the first dataset had two predictor variables and my model for the second dataset had only one. Why would this make a difference?
My dougfir.csv contains just two columns with thirty numbers in each,
labeled height and dryshoot.
my linear model is:
fitdougfir <- lm(dryshoot~height,data=dougfir)
It gets a little complicated (and messy, sorry! I am new to R) because I
then made a second .csv - the one I used to make my model contained values
from just June. My new .csv (called alldatadougfir.csv) includes values
from October as well, and also contains a date column that labels the
values either "june" or "october".
I did the following to separate the height data by date:
alldatadougfir[alldatadougfir$date=="june",c("height")]->junedatadougfir
alldatadougfir[alldatadougfir$date=="october",c("height")]->octoberdatadougfir
I then want to use my June model to predict my October dryshoots using
height as my variable and I did the following:
predict(fitdougfir, newdata=junedatadougfir)
predict(fitdougfir, newdata=octoberdatadougfir)
Again, I did this with an identical dataset successfully - the only
difference was that my model in the successful dataset had two predictor
variables instead of the one variable (height) I have in this dataset.
This is essentially a variation on R: numeric 'envir' arg not of length one in predict() , but it might not be obvious why. What's happening is that by selecting a single column of your data frame, you are triggering R's (often annoying/unwanted) default behaviour of collapsing the data frame to a numeric vector. This triggers issue #2 from the linked answer:
The predictor variable needs to be passed in as a named column in a data frame, so that predict() knows what the numbers [it's] been handed represent ... [emphasis added]
Watch this:
dd <- data.frame(x=1:20,y=1:20)
str(dd[dd$x<10,"y"]) ## select some rows and a single column
## int [1:9] 1 2 3 4 5 6 7 8 9
You could specify drop=FALSE, which gives you a data frame with a single column rather than just the column itself:
str(dd[dd$x<10,"y",drop=FALSE])
## 'data.frame': 9 obs. of 1 variable:
## $ y: int 1 2 3 4 5 6 7 8 9
Alternately, you don't have to leave out the predictor variable when selecting new data -- R will just ignore it.
str(dd[dd$x<10,])
## 'data.frame': 9 obs. of 2 variables:
## $ x: int 1 2 3 4 5 6 7 8 9
## $ y: int 1 2 3 4 5 6 7 8 9

How to change the names of confidence levels per variable in linear regression

I got the confidence levels per variable in linear regression.I wanted to use the results for sorting variables so I kept the result set as a data frame. However when I tried to do an str() function on one of the variables I got an error (written below).How can I store the result data set so I'll be able to work on it?
df <- read.table(text = "target birds wolfs
1 9 7
1 8 4
0 2 8
1 2 3 3
0 1 2
1 7 1
0 1 5
1 9 7
1 8 7
0 2 7
0 2 3
1 6 3
0 1 1
0 3 9
0 1 1 ",header = TRUE)
model<-lm(target~birds+wolfs,data=df)
confint(model)
2.5 % 97.5 %
(Intercept) -0.23133823 0.36256052
birds 0.10102771 0.18768505
wolfs -0.09698902 0.00812353
s<-as.data.frame(confint(model))
str(s$2.5%)
Error: unexpected numeric constant in "str(s$2.5"
The expression behind the $ operator must be a valid R identifier. 2.5% isn’t a valid R identifier, but there’s a simple way of making it one: put it into backticks: `2.5%`1. In addition, you need to pay attention that the column name matches exactly (or at least its prefix does). In other words, you need to add a space before the %:
str(s$`2.5 %`)
In general, a$b is the same as a[['b']] (with some subtleties; refer to the documentation). So you can also write:
str(s[['2.5 %']])
Alternatively, you could provide different column names for the data.frame that are valid identifiers, by just assigning different column names. Beware of make.names though: it makes your strings into valid R names, but at the cost of mangling them in ways that are not always obvious. Relying on it risks confusing readers of the code, because previously undeclared identifiers suddenly appear in the code. In the same vein, you should always specify check.names = FALSE with data.frame, otherwise R once again mangles your column names.
1 In fact, R also accepts single quotes here (s$'2.5 %'). However, I suggest you forget this immediately; it’s a historical accident of the R language, and treating identifiers and strings the same (especially since it’s done inconsistently) does more harm than good.

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

ChoiceModelR - Hierarchical Bayes Multinomial Logit Model

I hope that some of you are a bit experienced with the R package ChoiceModelR by Sermas and Colias, to estimate a Hierarchical Bayes Multinomial Logit Model. Actually, I am quite a newbie on both R and Hierarchical Bayes. However, I tried to get some estimates by using the script provided by Sermas and Colias in the help file. I have a data set in the same structure as they use (ID, choice set, alternative, independent variables, and choice variable). I have four independent variables all of them binary coded as categorical variables, none of them restricted. I have eight choice sets with three alternatives within each set as well as one no-choice-option as fourth alternative. I tried the following script:
library (ChoiceModelR)
data <- read.delim("Z:/KLU/CSR/CBC/mp3_vio.txt")
xcoding=c(0,0,0,0)
mcmc = list(R = 10, use = 10)
options = list(none=FALSE, save=TRUE, keep=1)
attlevels=c(2,2,2,2)
c1=matrix(c(0,0,0,0),2,2)
c2=matrix(c(0,0,0,0),2,2)
c3=matrix(c(0,0,0,0),2,2)
c4=matrix(c(0,0,0,0),2,2)
constraints = list(c1, c2, c3, c4)
out = choicemodelr(data, xcoding, mcmc = mcmc, options = options, constraints = constraints)
and have got the following error message:
Error in 1:nalts[i] : result would be too long a vector
In addition: There were 50 or more warnings (use warnings() to see the first 50). The mentioned warnings are of the following:
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
Actually, I have no idea what went wrong so far as I used the same data structure even I have more independent variables, more choice sets, and more alternatives within a choice set. I would be fantastic if anybody can shed some light into the darkness
I know that this may not be helpful since you posted so long ago, but if it comes up again in the future, this could prove useful.
One of the most common reasons for this error (in my experience) has been that either the scenario variable or the alternative variable is not in ascending order within your data.
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 3 1 4 2
1 3 2 5 0
2 1 4 3 1
2 1 5 1 0
2 2 1 4 2
2 2 2 3 0
This dataset will give you errors since the scenario and alternative variables must be ascending, and they must not skip any values. Just to fully reiterate what I mean, the scenario and alt variables must be reordered as follows in order to work:
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 2 1 4 2
1 2 2 5 0
2 1 1 3 1
2 1 2 1 0
2 2 1 4 2
2 2 2 3 0
I work with ChoiceModelR quite frequently, and this is what has caused these errors for me in the past. If you have a github account, you can also post your data (or modified data) there if you end up wanting to have other users take a look.

Resources