K-fold cross-validation using cv.lm() - r

I am new to R and trying to do K-fold cross validation using cv.lm()
Refer: http://www.statmethods.net/stats/regression.html
I am getting error indicating the length of my variable are different. Infact during my verification using length(), I found the size in fact the same.
The below are the minimal datasets to replicate the problem,
X Y
277 5.20
285 5.17
297 4.96
308 5.26
308 5.11
263 5.27
278 5.20
283 5.16
268 5.17
250 5.20
275 5.18
274 5.09
312 5.03
294 5.21
279 5.29
300 5.14
293 5.09
298 5.16
290 4.99
273 5.23
289 5.32
279 5.21
326 5.14
293 5.22
256 5.15
291 5.09
283 5.09
284 5.07
298 5.27
269 5.19
Used the below code to do the cross-validation
# K-fold cross-validation, with K=10
sampledata <- read.table("H:/sample.txt", header=TRUE)
y.1 <- sampledata$Y
x.1 <- sampledata$X
fit=lm(y.1 ~ x.1)
library(DAAG)
cv.lm(df=sampledata, fit, m=10)
The error on the terminal,
Error in model.frame.default(formula = form, data = df[rows.in, ], drop.unused.levels = TRUE) :
variable lengths differ (found for 'x.1')
Verification,
> length(x.1)
[1] 30
> length(y.1)
[1] 30
The above confirms the length are the same.
> str(x.1)
int [1:30] 277 285 297 308 308 263 278 283 268 250 ...
> str(y.1)
num [1:30] 5.2 5.17 4.96 5.26 5.11 5.27 5.2 5.16 5.17 5.2 ...
> is(y.1)
[1] "numeric" "vector"
> is(x.1)
[1] "integer" "numeric" "vector" "data.frameRowLabels"
Further check on the data set as above indicates one dataset is integer and another is numeric. But even when the data sets are converted the numeric to integer or integer to numeric, the same error pops up in the screen indicating issues with data length.
Can you guide me what should I do to correct the error?
I am unsuccessful in handling this since 2 days ago. Did not get any good lead from my research using internet.
Addional Related Query:
I see the fit works if we use the headers of the data set in the attributes,
fit=lm(Y ~ X, data=sampledata)
a) what is the difference of the above syntax with,
fit1=lm(sampledata$Y ~ sampledata$X)
Thought it is the same. In the below,
#fit 1 works
fit1=lm(Y ~ X, data=sampledata)
cv.lm(df=sampledata, fit1, m=10)
#fit 2 does not work
fit2=lm(sampledata$Y ~ sampledata$X)
cv.lm(df=sampledata, fit2, m=10)
The problem is at df=sampledata as the header "sampledata$Y" does not exist but only $Y exist. Tried to manupulate cv.lm to below it does not work too,
cv.lm(fit2, m=10)
b) How if we like to manipulate the variables, how to use it in cv.lm() for e.g
y.1 <- (sampledata$Y/sampledata$X)
x.1 <- (1/sampledata$X)
#fit 4 problem
fit4=lm(y.1 ~ x.1)
cv.lm(df=sampledata, fit4, m=10)
Is there a way I could reference y.1 and x.1 instead of the header Y ~ X in the function?
Thanks.

I'm not sure about why exactly this happens, but I've spotted that you do not specify data argument for lm(), so this was my first guess.
fit=lm(Y ~ X, data=sampledata)
Since the error is gone, this may be a sufficient answer.
UPD: The reason for the error is that y.1 and x.1 do not exist in sampledata, which is provided as df argument for cv.lm, so that formula y.1 ~ x.1 makes no sense in the cv.lm environment.

Related

Save survdiff output

I have run a logrank test with survdiff like below:
survdiff(formula = Surv(YearsToEvent, Event) ~ Cat, data = RegressionData)`
I get the following output:
N Observed Expected (O-E)^2/E (O-E)^2/V
0 30913 487 437.9 5.50 11.9
1 3755 56 23.2 46.19 48.0
2 3322 36 45.2 1.89 2.0
3 15796 260 332.6 15.85 27.3
Chisq= 71.9 on 3 degrees of freedom, p= 0.000000000000002
How can I save this (especially the p-value) to a .txt file? I am looping a bunch of regressions like this and want to save them all to a .text file.

Warning Errors on R studio when running trying to compute Akaike Information Criterion

I am new to Rstudio and Stack and am trying to calculate the AIC for my model with 8 variables (One dummy) but only 10 observations (Not sure if this is the issue).
The code I have run so far is:
library("MASS")
energy_data <- read.table('energy.txt', header = T)
full <- lm(y~., data = energy_data)
Following this, I get the warning
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
I have looked at other forum threads but I am not sure if they apply in my case.
I have also tried to run this as a CSV file but in this case I get the error:
Error in model.frame.default(formula = y ~ ., data = Energy, drop.unused.levels = TRUE) :
variable lengths differ (found for 'GSP') where GSP is x1
Again this has been asked before but I am unsure how to address in my case.txt data
CSV data
Thanks in advance!
The problem is the , in the y column, which causes it to be a factor. You can fix it by removing the , using gsub, but in order to do that you have to change it to character first. Once you remove the ,, you'll need to change it to numeric.
library(MASS)
energy_data <- read.table(text="x1 x2 x3 x4 x5 x6 x7 x8 y
361 79.7 2.983013699 1.683972603 4.35 2573.2 2048 0 77,465.59
369 86.9 2.814684932 1.701424658 4.7 2599 2094.5 0 76,705.04
375 100 2.780956284 2.006202186 3.7 2634.2 2141 0.5 72,704.99
384 119.3 3.594931507 1.448 2.75 2669.4 2266 1 70,061.93
392 123.5 3.354246575 1.522657534 2.5 2704.6 2391 0.5 69,587.52
402 114.5 3.17438562 1.653780822 2.1 2739.8 2418 0 70,138.85
410 109.3 3.691912568 1.346284153 1.75 2775 2445 0 70,405.99
418 120.8 3.725232877 1.818520548 1.5 2859.7 2586 0 70,879.50
421 136.6 3.681369863 1.79890411 1.5 2944.4 2726 0 70,436.26
431 135.2 3.960821918 1.959369863 1.2 3029.1 2867 0 70,161.83", header=T, sep="\t")
energy_data$y <- as.numeric(gsub(",","",as.character(energy_data$y)))
full <- lm(y~., data = energy_data)
You can also solve it by editing the file you read in to remove ,.
energy_data <- read.table(text="x1 x2 x3 x4 x5 x6 x7 x8 y
361 79.7 2.983013699 1.683972603 4.35 2573.2 2048 0 77465.59
369 86.9 2.814684932 1.701424658 4.7 2599 2094.5 0 76705.04
375 100 2.780956284 2.006202186 3.7 2634.2 2141 0.5 72704.99
384 119.3 3.594931507 1.448 2.75 2669.4 2266 1 70061.93
392 123.5 3.354246575 1.522657534 2.5 2704.6 2391 0.5 69587.52
402 114.5 3.17438562 1.653780822 2.1 2739.8 2418 0 70138.85
410 109.3 3.691912568 1.346284153 1.75 2775 2445 0 70405.99
418 120.8 3.725232877 1.818520548 1.5 2859.7 2586 0 70879.50
421 136.6 3.681369863 1.79890411 1.5 2944.4 2726 0 70436.26
431 135.2 3.960821918 1.959369863 1.2 3029.1 2867 0 70161.83", header=T, sep="\t")
full <- lm(y~., data = energy_data)

Reverse Johnson transformation

I want to perform a regression and I have a data set with a left-skewed target variable (Murder) like this:
data("USAArrests")
str(USAArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
hist(USAArrests&Murder)
Since the data is left-skewed. I can do a log transformation of the target in order to improve the performance of the model.
train = USArrests[1:30,]
train$Murder = log(train$Murder)
test = USArrests[31:50,]
If I want to apply this model on the test set a have to reverse the transformation to get the actual result. This I can do by exp.
fit = lm(Murder~., data = train)
pred = predict(fit, test)
exp(pred)
However, in my case, the log transformation is not enough to get a normal distribution of the target. So I used the Johnson transformation.
library(bestNormalize)
train$Murder = yeojohnson(train$Murder)$x.t
Is there a possibility to reverse this transformation like the log transformation like above?
As noted by Rui Barradas, the predict function can be used here. Instead of directly pulling out x.t from the yeojohnson function, you can do the following:
# Store the transformation object
yj_obj <- yeojohnson(train$Murder)
# Perform transformation
yj_vals <- predict(yj_obj)
# Reverse transformation
orig_vals <- predict(yj_obj, newdata = yj_vals, inverse = TRUE)
# Should be the same as the original values
all.equal(orig_vals, train$Murder)
The same workflow can be done with the log and exponentiation transformation via the log_x function (together with the predict function and the inverse = TRUE argument).

Sorting pairs by first coordinate

I have the following vectors:
X<-c(140,140,130,109,124,114,65,162,150,0)
Y<-c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
What I would like to do is assign each entry in X to the corresponding entry in Y, and then order them by X. For example, if I had
J<-c(10,40,20)
K<-c(9,9,2)
I would like it to give me
Jo = (10,20,40)
Ko = (9,2,9)
How do I do this in R? Thanks for the help.
Use the order() function:
X <- c(140,140,130,109,124,114,65,162,150,0)
Y <- c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
ord <- order(X)
(X2 <- X[ord])
## [1] 0 65 109 114 124 130 140 140 150 162
(Y2 <- Y[ord])
## [1] 1.61 3.23 11.29 3.23 3.23 17.74 30.65 6.45 14.52 8.06
(Don't really need to save ord if you re-order Y first; could use Y2 <- Y[order(X)]; X2 <- sort(X) instead.)

remove character columns from a numeric data frame

I have a data frame like the one you see here.
DRSi TP DOC DN date Turbidity Anions
158 5.9 3371 264 14/8/06 5.83 2246.02
217 4.7 2060 428 16/8/06 6.04 1632.29
181 10.6 1828 219 16/8/06 6.11 1005.00
397 5.3 1027 439 16/8/06 5.74 314.19
2204 81.2 11770 1827 15/8/06 9.64 2635.39
307 2.9 1954 589 15/8/06 6.12 2762.02
136 7.1 2712 157 14/8/06 5.83 2049.86
1502 15.3 4123 959 15/8/06 6.48 2648.12
1113 1.5 819 195 17/8/06 5.83 804.42
329 4.1 2264 434 16/8/06 6.19 2214.89
193 3.5 5691 251 17/8/06 5.64 1299.25
1152 3.5 2865 1075 15/8/06 5.66 2573.78
357 4.1 5664 509 16/8/06 6.06 1982.08
513 7.1 2485 586 15/8/06 6.24 2608.35
1645 6.5 4878 208 17/8/06 5.96 969.32
Before I got here i used the following code to remove those columns that had no values at all or some NA's.
rem = NULL
for(col.nr in 1:dim(E.3)[2]){
if(sum(is.na(E.3[, col.nr]) > 0 | all(is.na(E.3[,col.nr])))){
rem = c(rem, col.nr)
}
}
E.4 <- E.3[, -rem]
Now I need to remove the "date" column but not based on its column name, rather based on the fact that it's a character string.
I've seen here (Remove an entire column from a data.frame in R) already how to simply set it to NULL and other options but I want to use a different argument.
First use is.character to find all columns with class character. However, make sure that your date is really a character, not a Date or a factor. Otherwise use is.Date or is.factor instead of is.character.
Then just subset the columns that are not characters in the data.frame, e.g.
df[, !sapply(df, is.character)]
I was having a similar problem but the answer above isn't resolve it for a Date columns (that's what I needed), so I've found another solution:
df[,-grep ("Date|factor|character", sapply (df, class))]
Will return you your df without Date, character and factor columns.

Resources