Dealing with Zero Values in Principal Component Analysis - r

I've really been struggling to get my PCA working and I think it is because there are zero values in my data set. But I don't know how to resolve the issue.
The first problem is, the zero values are not missing values (they are areas with no employment in a certain sector), so I should probably keep them in there. I feel uncomfortable that they might be excluded because they are zero.
Secondly, even when I try remove all missing data I still get the same error message.
Starting with the following code, I get the following error message:
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Also, I can do this:
urban.cut<- na.omit(urban.cut)
> sum(is.na(urban.cut))
[1] 0
And then run it again and get the same issue.
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Is this a missing data issue? I've log transformed all of my variables according to this PCA tutorial. Here is the structure of my data.
> str(urban.cut)
'data.frame': 5490 obs. of 13 variables:
$ median.lt : num 2.45 2.57 2.53 2.6 2.31 ...
$ p.nga.lt : num 0.547 4.587 4.529 4.605 4.564 ...
$ p.mbps2.lt : num 1.66 4.17 4 3.9 4.2 ...
$ density.lt : num 3.24 3.44 3.85 3.21 4.28 ...
$ p_m_s.lt : num 4.54 4.61 4.56 4.61 4.61 ...
$ p_m_l.lt : num 1.87 -Inf 1.44 -Inf -Inf ...
$ p.tert.lt : num 4.59 4.61 4.55 4.61 4.61 ...
$ p.kibs.lt : num 4.25 3.05 3.12 3 3.03 ...
$ p.edu.lt : num 4.14 2.6 2.9 2.67 2.57 ...
$ p.non.white.lt : num 3.06 3.56 3.82 2.94 3.52 ...
$ p.claim.lt : num 0.459 1.287 1.146 1.415 1.237 ...
$ d.connections.lt: num 2.5614 0.6553 5.2573 0.9562 -0.0252 ...
$ SAM.KM.lt2 : num 1.449 1.081 1.071 1.246 0.594 ...
Thank you in advance for your help.

Sounds to me like R wants finite values. -inf is not finite. it is minus infinity. Perhaps you should be doing log(data + 1) if you really need to log transform your data, and not log a 0

Related

R - sapply Over Columns, then lappy Over Elements

Likely because I've spent an hour on this, I'm curious if it is possible - I am trying to transform each element of each column in a dataframe, where the transformation applied to each element depends upon the mean and standard deviation of the column that the element is in. I wanted to use nested lapply or sapply to do this, but ran into some unforeseen issues. My current "solution" (although it does not work as expected) is:
scale_variables <- function(dframe, columns) {
means <- colMeans(dframe[sapply(dframe, is.numeric)])
sds <- colSds(as.matrix(dframe[sapply(dframe, is.numeric)]))
new_dframe <- lapply(seq_along(means), FUN = function(m) {
sapply(dframe[ , columns], FUN = function(x) {
sapply(x, FUN = helper_func, means[[m]], sds[m])
})
})
return(new_dframe)
}
So, I calculate the column means and SDs beforehand; then, I seq_along the index of each mean in means, then each of the columns with the first sapply, and then each element in the second sapply. I get the mean and SD of this particular column using index m, then pass the current element, mean, and SD to the helper function to work on.
Running this on the numeric variables in the iris dataset yields this monstrosity:
'data.frame': 150 obs. of 16 variables:
$ Sepal.Length : num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num -2.83 -3.43 -3.19 -3.31 -2.71 ...
$ Petal.Length : num -5.37 -5.37 -5.49 -5.25 -5.37 ...
$ Petal.Width : num -6.82 -6.82 -6.82 -6.82 -6.82 ...
$ Sepal.Length.1: num 4.69 4.23 3.77 3.54 4.46 ...
$ Sepal.Width.1 : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length.1: num -3.8 -3.8 -4.03 -3.57 -3.8 ...
$ Petal.Width.1 : num -6.56 -6.56 -6.56 -6.56 -6.56 ...
$ Sepal.Length.2: num 0.76 0.647 0.534 0.477 0.704 ...
$ Sepal.Width.2 : num -0.1462 -0.4294 -0.3161 -0.3727 -0.0895 ...
$ Petal.Length.2: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width.2 : num -2.02 -2.02 -2.02 -2.02 -2.02 ...
$ Sepal.Length.3: num 5.12 4.86 4.59 4.46 4.99 ...
$ Sepal.Width.3 : num 3.02 2.36 2.62 2.49 3.15 ...
$ Petal.Length.3: num 0.263 0.263 0.132 0.394 0.263 ...
$ Petal.Width.3 : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
I assume I am applying each mean in means to each column of the dataframe in turn, when I only want to use it for elements in the column it refers to, so I'm not sure that nesting apply functions in this way will do what I need - but can it be done like this?
I'm not sure what your helper_func, is, but I've made a toy example below
helper_func <- function(x,m,sd) (x-m)/sd
You can then adjust your scale_variables() function like this:
scale_variables <- function(dframe, columns) {
means <- apply(dframe[columns],2,mean, na.rm=T)
sds <- apply(dframe[columns],2,sd)
sapply(columns, \(col) helper_func(dframe[[col]], m=means[col], sd=sds[col]))
}
And call it like this:
scale_variables(iris,names(iris)[sapply(iris, is.numeric)])
Output: (first 6 of 150 rows)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 -1.01843718 1.24503015 -1.33575163 -1.3110521482
6 -0.53538397 1.93331463 -1.16580868 -1.0486667950

Error when running boxcox on response variable

I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.

I'm getting an error while trying to create a confusion matrix

I'm getting the following error while trying to generate the confusion Matrix - this used to work.
str(credit_test)
# Generate predicted classes using the model object
class_prediction <- predict(object=credit_model,
newdata=credit_test,
type="class")
class(class_prediction)
class(credit_test$ACCURACY)
# Calculate the confusion matrix for the test set
confusionMatrix(data=class_prediction, reference=credit_test$ACCURACY)
'data.frame': 20 obs. of 4 variables:
$ ACCURACY : Factor w/ 2 levels "win","lose": 1 1 1 2 2 1 1 1 1 1 ...
$ PM_HIGH : num 5.7 5.12 10.96 7.99 1.73 ...
$ OPEN_PRICE: num 4.46 3.82 9.35 7.77 1.54 5.17 1.88 2.65 5.71 4.09 ...
$ PM_VOLUME : num 0.458 0.676 1.591 3.974 1.785 ...
[1] "factor"
[1] "factor"
**Error in confusionMatrix(data=class_prediction, reference=credit_test$ACCURACY) :
unused arguments (data=class_prediction, reference=credit_test$ACCURACY)**
From some reason I had to run it this way, something has changed
caret::confusionMatrix(data=class_prediction,reference=credit_test$ACCURACY)

“length of 'dimnames' [2] not equal to array extent”

So I have seen questions regarding this error code before, but the suggested troubleshooting that worked for those authors didn't help me diagnose. I'm self-learning R and new to Stackoverflow, so please give me constructive feedback on how to better ask my question, and I will do my best to provide necessary information. I've seen many, similar questions put on hold so I want to help you to help me. I'm sure the error probably stems from my lack of experience in data prep.
I'm trying to run a panel data model, loaded as .csv and this error returns when the model is run
fixed = plm(Y ~ X, data=pdata, model = "within")
Error in `colnames<-`(`*tmp*`, value = "1") :
length of 'dimnames' [2] not equal to array extent
running str() on my dataset returns that ID and Time are factors with 162 levels and 7 levels, respectively.
str(pdata)
Classes ‘plm.dim’ and 'data.frame': 1127 obs. of 11 variables:
$ ID : Factor w/ 162 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Time : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
$ Online.Service.Index : num 0.083 0.131 0.177 0.268 0.232 ...
$ Eparticipation : num 0.0345 0.0328 0.0159 0.0454 0.0571 ...
$ CPI : num 2.5 2.6 2.5 1.5 1.4 0.8 1.2 2.5 2.5 2.4 ...
$ GE.Est : num -1.178 -0.883 -1.227 -1.478 -1.466 ...
$ RL.Est : num -1.67 -1.71 -1.72 -1.95 -1.9 ...
$ LN.Pop : num 16.9 17 17 17.1 17.1 ...
$ LN.GDP.Cap : num 5.32 5.42 5.55 5.95 6.35 ...
$ Human.Capital.Index : num 0.268 0.268 0.268 0.329 0.364 ...
$ Telecommunication.Infrastructure.Index: num 0.0016 0.00173 0.00202 0.01576 0.03278 ...
Still, I don't see how it would create this error. I've tried transforming it as a data frame or matrix, with the same result (I got desperate and it worked for some people)
dim() yields
[1] 1127 11
I have some NA values, but I understand that these shouldn't cause a problem. Again, I'm self-taught and new here, so please take it easy on me! Hope I explained the problem well.

Truncate a Time-Series in R

I'm using continuous Morlet wavelet transform (cwt) analysis over a time series by the use of the R-package dplR. The time series corresponds to a 15min data (gam_15min) with length 7968 (corresponding to 83 days of measurements).
I have the following output:
cwtGamma=morlet(gam_15min,x1=seq_along(gam_15min),p2=NULL,dj=0.1,siglvl=0.95)
str(cwtGamma)
List of 9
$ y : Time-Series [1:7968] from 1 to 1993: 672 674 673 672 672 ...
$ x : int [1:7968] 1 2 3 4 5 6 7 8 9 10 ...
$ wave : cplx [1:7968, 1:130] -0.00332+0.0008i 0.00281-0.00181i -0.00194+0.00234i ...
$ coi : num [1:7968] 0.73 1.46 2.19 2.92 3.65 ...
$ period: num [1:130] 1.03 1.11 1.19 1.27 1.36 ...
$ Scale : num [1:130] 1 1.07 1.15 1.23 1.32 ...
$ Signif: num [1:130] 0.000382 0.001418 0.005197 0.018514 0.062909 ...
$ Power : num [1:7968, 1:130] 1.17e-05 1.11e-05 9.26e-06 7.09e-06 5.54e-06 ...
$ siglvl: num 0.95
In my analysis I want to truncate the time-series (I suppose $wave) by removing 1 period length in the beginning and 1 period length at the end. how do I do that? maybe its easy but I'm seeing how... Thanks

Resources