I am trying to make a model that will predict the group of a city according to the development level of it. I mean, the cities in the 1st group are the most developed cities and the ones in the 6th group are the least developed ones. I have 10 numerical variables in my data about each city.
First, I normalized them using max-min normalization. Then I generated the training and data sets. I have 81 cities.Dimensions of training and data sets are 20x10 and 61x10, respectively. I excluded the target variable from them. Then I made labels for them as training labels and test labels with dimensions 61x1 and 20x1.
Then I run the knn function like this
knn(train = Data.training, test = Data.test, cl = Data.trainLabels , k = 3)
its output is this
[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Levels: 1 2 3 4 5 6
But if I set the argument use.all to FALSE I get this output and that changes everytime I run the code
[1] 1 4 2 2 2 3 5 4 3 5 5 6 5 6 5 6 4 5 2 2
Levels: 1 2 3 4 5 6
I can't find the reason why my code gives the same prediction in the first place and what use.all has got to do with it.
As explained in the knn documentation :
use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.
In your case, all points have the same distances, so they all win as 'best neighbour' (use.all = True) or the algorithm picks k winners at random (use.all = False).
The problem seems to be in how you trained the algorithm or in the data itself. Since you did not post a sample of your data, I cannot help with that, but I suggest that you re-check it. You can also compute a few distances by hand, to see what is going on.
Also, check that you randomised your data before splitting it into training and testing sets. For example, say that the dataset is ordered by the label (the target variable). If you use the first 20 points to train the algorithm, it is likely that the algorithm will never see some of the labels during the training phase and therefore it will perform poorly on those during the testing phase.
Related
I have the following formula:
Reg_Total<- In_Bigdata2 %>%
lm(log(This_6) ~ This_1+This_2+This_3+This_4+
This_5+This_7+This_8+
This_12+This_13+This_14+This_15+This_16+This_17,This_18,data = .)
With that data
With only the variable This_18 as a subset, do you know why it gives me a perfect regression with an r2 of 1?
OK, this was a good puzzle.
You have to dig a little bit to find out what the subset= argument does, as it gets passed to the model.frame() function inside lm(). From ?model.frame():
subset: a specification of the rows to be used: defaults to all rows.
This can be any valid indexing vector (see ‘[.data.frame’)
for the rows of ‘data’ or if that is not supplied, a data
frame made up of the variables used in ‘formula’.
(emphasis added). Usually people specify a logical expression for subset= (e.g. This_5>2) to restrict the regression to particular cases. If you put in an integer vector, lm()/model.frame() will select the rows corresponding to those integers.
So ... what lm()/model.frame() have done is to construct a data set for the linear model that consists of rows of the original data set indexed by This_18. In other words, since the first few elements of This_18 are (2,3,4,3,3,2, ...), the first row of the new data set will be row 2 of the original data set; the second row will be row 3; the third row will be row 4; the fourth row will be another copy of row 3; and so on ...
head(model.frame(This_6~.-This_18, data=dd, subset=This_18))
## This_6 This_1 This_2 This_3 This_4 This_5 This_7 This_8 This_9 This_10 ...
## 2 2 5 3 3 3 3 3 2 3 1 ...
## 3 3 3 3 3 3 3 3 4 4 4 ...
## 4 1 3 3 3 3 3 3 2 1 2 ...
## 3.1 3 3 3 3 3 3 3 4 4 4 ...
## 3.2 3 3 3 3 3 3 3 4 4 4 ...
## 2.1 2 5 3 3 3 3 3 2 3 1 ...
(you can also get this object by running model.frame(fitted_model)).
Therefore, since the only values of This_18 are the integers 1-6, you get a regression run only on multiple copies of rows 1-6 of the original data set. Thus it's not surprising that you get a perfect fit, since there are only 6 unique response/sets of predictors.
The remaining question is ... what did you intend to do by using subset=This_18 ... ? "subset" refers to a subset of observations, not a subset of predictors.
If you want to do best subset regression (i.e. find the subset of predictors that maximize some criterion) there is not a single easy answer (and in fact there are some potential statistical pitfalls if you are interested in inference rather than prediction). Googling "R best subset regression" should help you, or searching for those keywords on Stack Overflow. (Or see the glmulti package, or the leaps package, or the stepAIC function in the MASS package, or r the MuMIn package, or ...)
I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!
It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.
Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected
I want to perform a survival analysis which includes time-varying covariates, using the aalen() function from an R package called timereg. However, I am still confused as to how the data should be presented in a dataframe, and how the model formula should be specified.
Here's a made up data set:
subject_id survival_time weight height outcome_indicator
1 3 65 1.8 0
1 4 68 1.8 0
1 7 70 1.8 1
2 2 55 1.6 0
2 9 53 1.6 0
3 2 62 1.7 0
3 3 65 1.7 0
3 5 64 1.7 0
3 6 66 1.7 0
And here are some interpretations:
There are 3 study subjects, identified by the subject_id variable, and they were followed up for 3, 2, 4 times, respectively.
weight is a time-varying covariate.
height is independent of time and so for each subject, it remained the same at each follow up.
Suppose the unit of survival_time is in years, then the interested event happened to subject 1 at year 7.
Both subject 2 and 3 are right censored cases.
Each follow up that belongs to the same subject can be ordered by survival_time.
Finally, a list of my questions (please don't hesitate to leave a comment even if you don't have all the answers, or if my solution is correct):
Am I right about the presentation of survival data that includes time-varying covariates?
If the answer to the first question is "no", then can you please point out what the problems are and provide some explanations?
Assuming the data set is alright, then how do I specify the model formula and fit the aalen model (or any other model that includes time-varying covariates)? Is it something like:
aalen(formula = Survf(survival_time, outcome_indicator) ~ const(height) + weight, data = data_set, id = data_set$subject_id)
where the Survf() function is used to combine the two outcome-related variables; const() is used to denote time-varying covariates, leaving other covariates as they are; data_set is the name of the dataframe; and the id parameter is used to associate different rows of the same subject?
This is likely not the right way to represent these data. Judging from the ordering of the variable survival_time, these are the cohort times at which the covariate changes. You need a lagged event time to indicate the "start" of observation, set to 0 for the first patient record. The way you have format the data now have squared the denominator time, reduced the incidence, and attenuated the hazard ratios toward the null.
Take the first participant: they are in fact observed from 0 to 7. The first record is 0 to 3, the next: 3 to 4, the last 4 to 7. Where have you told R this explicitly? R does not know these records belong to the same individual. R now believes there are 3 people followed for a cumulative of 3 + 4 + 7 = 14 years having 1 event rather than 7 years having 1 event (incidence goes from 14 ppy to 7 ppy).
I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)
I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.
I am dealing with a dataset that involves 9 different genotypes, arranged into 3 different classes. About 20 size measurements from each genotype have been recorded.
I tried running a two-way anova (after a one-way anova established that size differed significantly between genotypes), in order to analyse the differences between the 3 classes as well as the different genotypes.
I used the function aov(size~genotype*class,data=x)
The summary table I obtained only has a row for genotype, and I can't see class or genotype:class anywhere, as I would have expected.
The table I obtain is identical to the one I get when I just run aov(size~genotype, data=x)
What have I done wrong?
Even if the class/class:genotype wouldn't change the significance of the results, they should still show up in the anova summary table?
In short - you can't fit the model you're trying to fit... at least if I'm understanding your data correctly. My understanding is that you have something similar to this:
dat <- data.frame(size = rnorm(27), genotype = gl(9,3), class = gl(3, 9))
> dat <- data.frame(size = rnorm(27), genotype = gl(9,3), class = gl(3, 9))
> dat
size genotype class
1 1.44189249 1 1
2 1.05766532 1 1
3 0.08133568 1 1
4 0.36642288 2 1
5 0.93266571 2 1
6 -0.64031787 2 1
7 0.33361892 3 1
8 0.53315507 3 1
9 0.26851394 3 1
10 0.05062280 4 2
11 -0.30924511 4 2
12 -0.61460429 4 2
13 -0.18901238 5 2
14 0.58881858 5 2
15 0.58625502 5 2
16 0.52002793 6 2
17 1.23862937 6 2
18 -2.02333160 6 2
19 -0.09918607 7 3
20 0.65947932 7 3
21 -0.65440238 7 3
22 0.10923036 8 3
23 0.76845484 8 3
24 -0.24804574 8 3
25 -0.30890950 9 3
26 -2.82056870 9 3
27 0.56828147 9 3
(The main thing I'm looking at is how genotype and class relate - not the actual values for size or the sample sizes for each genotype*class combination)
If each genotype is entirely contained within a single class then you can't separate the genotype effect from the class effect. Hopefully this makes sense to you - if not let me illustrate with a smaller example. First off - since each genotype is entirely in one class we can't fit an interaction - that just doesn't make sense at all. The interaction would be useful if genotypes could be a part of at least two classes because it would allow us to attribute a different effect for genotype based on the class it's in for the observation. But since each genotype is only in one class... fitting a model with interactions is out.
Now to see why we can't fit a class effect just consider class 1 which contains genotypes 1-3. The thing to recognize is that with linear models (and ANOVA is just a special case of a linear model) the thing we're modeling is the conditional means in the different groups - and we try to partition this into certain effects if possible. So any model that gives us the same group means is essentially equivalent. Pretend for a second that the effect for class 1 is c, and the effects for genotypes 1-3 are x, y, and z (respectively). Then the value for the group genotype1/class1 = c+x, for genotype2/class1 = c+y, for genotype3/class1 = c+z. But notice here that we could just as easily said the class1 effect is 0 and then said the effects for genotypes 1-3 are c+x, c+y, c+z (respectively). So class is completely useless in this situation. There is no way to separate the class effect since the genotypes are completely nested inside of class. So we can only fit a model that has separate effects for genotypes if we want to fit a completely fixed effects model.