Error when running boxcox on response variable - r

I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.

Related

R - sapply Over Columns, then lappy Over Elements

Likely because I've spent an hour on this, I'm curious if it is possible - I am trying to transform each element of each column in a dataframe, where the transformation applied to each element depends upon the mean and standard deviation of the column that the element is in. I wanted to use nested lapply or sapply to do this, but ran into some unforeseen issues. My current "solution" (although it does not work as expected) is:
scale_variables <- function(dframe, columns) {
means <- colMeans(dframe[sapply(dframe, is.numeric)])
sds <- colSds(as.matrix(dframe[sapply(dframe, is.numeric)]))
new_dframe <- lapply(seq_along(means), FUN = function(m) {
sapply(dframe[ , columns], FUN = function(x) {
sapply(x, FUN = helper_func, means[[m]], sds[m])
})
})
return(new_dframe)
}
So, I calculate the column means and SDs beforehand; then, I seq_along the index of each mean in means, then each of the columns with the first sapply, and then each element in the second sapply. I get the mean and SD of this particular column using index m, then pass the current element, mean, and SD to the helper function to work on.
Running this on the numeric variables in the iris dataset yields this monstrosity:
'data.frame': 150 obs. of 16 variables:
$ Sepal.Length : num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num -2.83 -3.43 -3.19 -3.31 -2.71 ...
$ Petal.Length : num -5.37 -5.37 -5.49 -5.25 -5.37 ...
$ Petal.Width : num -6.82 -6.82 -6.82 -6.82 -6.82 ...
$ Sepal.Length.1: num 4.69 4.23 3.77 3.54 4.46 ...
$ Sepal.Width.1 : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length.1: num -3.8 -3.8 -4.03 -3.57 -3.8 ...
$ Petal.Width.1 : num -6.56 -6.56 -6.56 -6.56 -6.56 ...
$ Sepal.Length.2: num 0.76 0.647 0.534 0.477 0.704 ...
$ Sepal.Width.2 : num -0.1462 -0.4294 -0.3161 -0.3727 -0.0895 ...
$ Petal.Length.2: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width.2 : num -2.02 -2.02 -2.02 -2.02 -2.02 ...
$ Sepal.Length.3: num 5.12 4.86 4.59 4.46 4.99 ...
$ Sepal.Width.3 : num 3.02 2.36 2.62 2.49 3.15 ...
$ Petal.Length.3: num 0.263 0.263 0.132 0.394 0.263 ...
$ Petal.Width.3 : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
I assume I am applying each mean in means to each column of the dataframe in turn, when I only want to use it for elements in the column it refers to, so I'm not sure that nesting apply functions in this way will do what I need - but can it be done like this?
I'm not sure what your helper_func, is, but I've made a toy example below
helper_func <- function(x,m,sd) (x-m)/sd
You can then adjust your scale_variables() function like this:
scale_variables <- function(dframe, columns) {
means <- apply(dframe[columns],2,mean, na.rm=T)
sds <- apply(dframe[columns],2,sd)
sapply(columns, \(col) helper_func(dframe[[col]], m=means[col], sd=sds[col]))
}
And call it like this:
scale_variables(iris,names(iris)[sapply(iris, is.numeric)])
Output: (first 6 of 150 rows)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 -1.01843718 1.24503015 -1.33575163 -1.3110521482
6 -0.53538397 1.93331463 -1.16580868 -1.0486667950

PLM is not recognizing my id variable name

I'm doing a regression analysis considering fixed effects using plm() from package plm. I have selected the twoways method to account for both time and individual effects. However, after runing the below code I keep receiving this message:
Error in pdata.frame(data, index) :
variable id does not exist (individual index)
Here the code:
pdata <- DATABASE[,c(2:4,13:21)]
pdata$id <- group_indices(pdata,ISO3.p,Productcode)
coutnin <- dcast.data.table(pdata,ISO3.p+Productcode~.,value.var = "id")
setcolorder(pdata,neworder=c("id","Year"))
pdata <- pdata.frame(pdata,index=c("id","Year"))
reg <- plm(pdata,diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), effect="twoways", model="within", index = c("id","Year"))
Please mind that pdata structure shows that there are multiple levels in the id variable which is in numeric form, I tried initially to use a string type variable but I keep receiving the same outcome:
Classes ‘data.table’ and 'data.frame': 1211800 obs. of 13 variables:
$ id : int 4835 6050 13158 15247 17164 18401 19564 23553 24895 27541 ...
$ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ Productcode: chr "101" "101" "101" "101" ...
$ ISO3.p : Factor w/ 171 levels "ABW","AFG","AGO",..: 8 9 20 22 27 28 29 34 37 40 ...
$ e : num 0.245 -0.238 1.624 0.693 0.31 ...
$ RERcp : num -0.14073 -0.16277 1.01262 0.03908 -0.00243 ...
$ RERpp : num -0.1712 NA NA NA -0.0952 ...
$ RER_GVC : num -3.44 NaN NA NA NaN ...
$ GDPR.p : num 27.5 26.6 23.5 20.3 27.8 ...
$ GDPR.r : num 30.4 30.4 30.4 30.4 30.4 ...
$ GVCPos : num 0.141 0.141 0.141 0.141 0.141 ...
$ GVCPar : num 0.436 0.436 0.436 0.436 0.436 ...
$ TV : num 17.1 17.1 17.1 17.1 17.1 ...
- attr(*, ".internal.selfref")=<externalptr>
When I convert the data.table into a pdata.frame I do not receive any warning, it happens only after I run the plm function. From running View(table(index(pdata), useNA = "ifany")) it displays no value larger than 1, therefore I assume I have no duplicates obs in my data.
Try to put the data argument at the second place in the plm statement. In case pdata has been converted to a pdata.frame already, leave out the index argument in the plm statement, i.e., try this:
reg <- plm(diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), data = pdata, effect = "twoways", model = "within")

PLSR in R with "pls" package

I'm trying to fit PLSR model, but I'm doing something wrong. Below, you can see how I created data frame and its structure.
reflektance <- read_excel("data/reflektance.xlsx", na = "NA")
reflektance <- dput(reflektance)
pH <- read_excel("data/rijen2016.xls", na = "NA")
pH <- na.omit(pH)
pH <- dput(pH)
reflektance<-aggregate(reflektance[, 2:753], list(reflektance$Vzorek), mean)
colnames(reflektance)[colnames(reflektance)=='Group.1']<-'Vzorek'
datapH <- merge(pH, reflektance, by="Vzorek")
datasetpH <- data.frame(pH=datapH[,2], ref=I(as.matrix(datapH[, 3:754], 22, 752)))
Problem is with using "plsr", because result is this error:
ph1<-plsr(pH ~ ref, ncomp = 5, data=datasetpH)
Error in pls::mvr(ref ~ pH, ncomp = 5, data = datasetpH, method = "kernelpls") :
Invalid number of components, ncomp
dput(reflectance):
https://jpst.it/RyyS
Here you can see structure of table datapH:
'data.frame': 22 obs. of 754 variables:
$ Vzorek: chr "5 - P01" "5 - P02" "5 - P03" "5 - R1 - A1" ...
$ pH/H2O: num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ 325 : num 0.017 0.0266 0.0191 0.0241 0.016 ...
$ 326 : num 0.021 0.0263 0.0154 0.0264 0.0179 ...
$ 327 : num 0.0223 0.0238 0.0147 0.028 0.0198 ...
...
And here structure of table datasetpH:
'data.frame': 22 obs. of 2 variables:
$ pH : num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ ref: AsIs [1:22, 1:752] 0.016983.... 0.026556.... 0.019059.... 0.024097.... 0.016000.... ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "325" "326" "327" "328" ...
Do you have any advice and solution? Thank you
The problem seems to come from one of your columns containing only NA's.
The last line of the output of names(df)gives:
[745] "1068" "1069" "1070" "1071" "1072" "1073" "1074" "1075" NA
Using your data + some randomly generated values for pH (which isn't in the reflektance dataframe, named df here):
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Error in matrix(0, ncol = ncomp, nrow = npred) :
invalid 'ncol' value (< 0)
Note that the indexing is a bit different from yours. I didn't have the second column in df (the one that contains pH in yours).
If I remove the last column which contains NA's :
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Partial least squares regression , fitted with the kernel algorithm.
Call:
plsr(formula = pH ~ ref, data = test)
Let me know if that fixes it.

Analysis of PCA

I'm using the rela package to check whether I can use PCA in my data.
paf.neur2 <- paf(neur2)
summary(paf.neur2)
# [1] "Your dataset is not a numeric object."
I want to see the KMO (The Kaiser-Meyer-Olkin measure of sampling adequacy test). How to do that?
Output of str(neur2)
'data.frame': 1457 obs. of 66 variables:
$ userid : int 200 387 458 649 931 991 1044 1075 1347 1360 ...
$ funct : num 3.73 3.79 3.54 3.04 3.81 ...
$ pronoun: num 2.26 2.55 2.49 1.98 2.71 ...
.
.
.
$ time : num 1.68 1.87 1.51 1.03 1.74 ...
$ work : num 0.7419 0.2311 -0.1985 -1.6094 -0.0619 ...
$ achieve: num 0.174 0.2469 0.1823 -0.478 -0.0513 ...
$ leisure: num 0.2852 0.0296 0.0583 -0.3567 -0.0408 ...
$ home : num -0.844 -0.58 -0.844 -2.207 -1.079 ...
.
Variables are all numeric.
According to ?paf, object is a numeric dataset (usually a coerced matrix from a prior data frame)
So you need to turn your data.frame neur2 into a matrix: as.matrix(neur2).
Here is a reproduction of your problem using the Seatbelts dataset:
library(rela)
Belts <- Seatbelts[,1:7]
class(Belts)
# [1] "mts" "ts" "matrix"
Belts <- as.data.frame(Belts)
# [1] "data.frame"
paf.belt <- paf(Belts)
[1] "Your dataset is not a numeric object."
Belts <- as.matrix(Belts)
class(Belts)
# [1] "matrix"
paf.belt <- paf(Belts) # Works
Two options which can do it for you:
kmo_DIY <- function(df){
csq = cor(df)^2
csumsq = (sum(csq)-dim(csq)[1])/2
library(corpcor)
pcsq = cor2pcor(cor(df))^2
pcsumsq = (sum(pcsq)-dim(pcsq)[1])/2
kmo = csumsq/(csumsq+pcsumsq)
return(kmo)
}
or
the function KMO() from the psych package.

Dealing with Zero Values in Principal Component Analysis

I've really been struggling to get my PCA working and I think it is because there are zero values in my data set. But I don't know how to resolve the issue.
The first problem is, the zero values are not missing values (they are areas with no employment in a certain sector), so I should probably keep them in there. I feel uncomfortable that they might be excluded because they are zero.
Secondly, even when I try remove all missing data I still get the same error message.
Starting with the following code, I get the following error message:
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Also, I can do this:
urban.cut<- na.omit(urban.cut)
> sum(is.na(urban.cut))
[1] 0
And then run it again and get the same issue.
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Is this a missing data issue? I've log transformed all of my variables according to this PCA tutorial. Here is the structure of my data.
> str(urban.cut)
'data.frame': 5490 obs. of 13 variables:
$ median.lt : num 2.45 2.57 2.53 2.6 2.31 ...
$ p.nga.lt : num 0.547 4.587 4.529 4.605 4.564 ...
$ p.mbps2.lt : num 1.66 4.17 4 3.9 4.2 ...
$ density.lt : num 3.24 3.44 3.85 3.21 4.28 ...
$ p_m_s.lt : num 4.54 4.61 4.56 4.61 4.61 ...
$ p_m_l.lt : num 1.87 -Inf 1.44 -Inf -Inf ...
$ p.tert.lt : num 4.59 4.61 4.55 4.61 4.61 ...
$ p.kibs.lt : num 4.25 3.05 3.12 3 3.03 ...
$ p.edu.lt : num 4.14 2.6 2.9 2.67 2.57 ...
$ p.non.white.lt : num 3.06 3.56 3.82 2.94 3.52 ...
$ p.claim.lt : num 0.459 1.287 1.146 1.415 1.237 ...
$ d.connections.lt: num 2.5614 0.6553 5.2573 0.9562 -0.0252 ...
$ SAM.KM.lt2 : num 1.449 1.081 1.071 1.246 0.594 ...
Thank you in advance for your help.
Sounds to me like R wants finite values. -inf is not finite. it is minus infinity. Perhaps you should be doing log(data + 1) if you really need to log transform your data, and not log a 0

Resources