Apply a loop to a data frame - r

I'm trying to apply the function RE.Johnson from the Johnson package to a whole data frame df that contains 157 observations of 16 variables and i'd like to loop trough all the dataframe instead of doing it manually.
I've tried the following code but it doesn't work.
lapply(df[1:16], function(x) RE.Johnson(x))
I know it might seem easy for you guys but I'm juste starting with R.
Thanks
EDIT
R provides me the answer Error in RE.ADT(xsl[, i]) : object 'p' not found and the data are not transformed.
And here is a summary of the data:
data.frame': 157 obs. of 16 variables:
$ X : num 786988 781045 777589 775266 786843 ...
$ Y : num 486608 488691 490089 489293 488068 ...
$ Z : num 182 128 191 80 131 ...
$ pH : num 7.93 7.69 7.49 7.66 7.92 7.08 7.24 7.19 7.44 7.37 ...
$ CE : num 0.775 3.284 3.745 4.072 0.95 ...
$ Nitrate : int 21 14 18 83 30 42 47 101 85 15 ...
$ NP : num 19.6 43.6 31.7 18.6 31.7 ...
$ Cl : num 1.9 21.3 2.56 21.5 3.2 ...
$ HCO3 : num 6.65 4.85 4.4 7.72 4.1 ...
$ CO3 : num 0 0 0 0 0.0736 ...
$ Ca : num 4.12 7.52 3.48 7.58 4.8 10 4.4 4.6 4.2 7.4 ...
$ Mg : num 3.94 8.92 2.34 7.1 2.5 ...
$ K : num 0.1442 0.0759 0.0709 0.3691 0.07 ...
$ Na : num 2.41 34.55 2.51 44.01 2.1 ...
$ SO4 : num 1.45 23.6 1.2 26.66 2 ...
$ Residu_sec: num 0.496 2.102 2.397 2.606 0.608 ...

Not a complete solution, just some information for others.
I tried the Johnson::RE.Johnson manually on the columns in the iris data frame. It seems to be work fine for Sepal.Length and Petal.Length only:
lapply(iris[c(1,3)], Johnson::RE.Johnson)
... and it returns the error you mentioned for Sepal.Width and Petal.Width.
lapply(iris[c(2,4)], Johnson::RE.Johnson)
Error in RE.ADT(xsl[, i]) : object 'p' not found
This seems odd because all of those columns have a data type of num. The iris data frame doesn't appear to have any missing values or extra character values hidden anywhere, so I'm not sure why the calculation is working for those columns but not others.
Without understanding too much about what the Johnson::RE.Johnson is doing to the data, it looks like it is unable to calculate a value for p and is unable to complete the iteration for those columns.
From exploring the source code, the function appears to break down at this point:
if (xsb.valida[1, i] == 0)
xsb.adtest[1, i] <- (Johnson::RE.ADT(xsb[, i])$p) # succeeds
if (xsl.valida[1, i] == 0)
xsl.adtest[1, i] <- (Johnson::RE.ADT(xsl[, i])$p) # fails
if (xsu.valida[1, i] == 0)
xsu.adtest[1, i] <- (Johnson::RE.ADT(xsu[, i])$p) # fails
The function attempts to run Johnson::RE.ADT on xsl, which at this point is a vector of just 0's. The RE.ADT returns the same error with the p value not being found.

The problem is when the function try to perform the Anderson-Darling test to a vector of equals values. If you do this, you will get the error:
require(Johnson)
x = rep(1,n=100)
RE.ADT(x)
So, to solve this problem you could check it in the IF session inside the function RE.Johnson:
if (xsb.valida[1, i] == 0 & any(xsb[, i]!=xsb[1, i])){
xsb.adtest[1, i] <- (RE.ADT(xsb[, i])$p)
}else{
xsb.adtest[1, i] <- 0
}
if (xsl.valida[1, i] == 0 & any(xsl[, i]!=xsl[1, i])) {
xsl.adtest[1, i] <- (RE.ADT(xsl[, i])$p)
}else{
xsl.adtest[1, i] <- 0
}
if (xsu.valida[1, i] == 0 & any(xsu[, i]!=xsu[1, i])) {
xsu.adtest[1, i] <- (RE.ADT(xsu[, i])$p)
}else{
xsu.adtest[1, i] <- 0
}

Related

Loop in R not working, generating single Value

I have some metabolomics data I am trying to process (validate the compounds that are actually present).
`'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...`
I tried to write a loop so that if the Delta.ppm value is larger than (m/z - molecular weight)/molecular weight, the entire row is deleted in the subsequent dataframe.
for (i in 1:nrow(rawdata)) {
ppm <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
Instead of giving me a new df with the validated compounds, under the 'Values' section, it generates a single number for 'ppm'.
Still very new to R, any help is super appreciated!
No need to do this row-by-row, we can remove all undesired rows in one operation:
## base R
good <- with(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
newdat <- rawdat[good, ]
## dplyr
newdat <- filter(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
Iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g.,
out <- list()
for (...) {
# ... newrow ...
out <- c(out, list(newrow))
}
alldat <- do.call(rbind, out)
ppm[i] <- NULL
for (i in 1:nrow(rawdata)) {
ppm[i] <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm[i] > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}

Error when running boxcox on response variable

I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.

Box-Cox Tranformation Error: object 'x' not found

hopefully a relatively easy one for those more experienced than me!
Trying to perform a Box-Cox transformation using the following code:
fit <- lm(ABOVEGROUND_BIO ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
bc <- boxcox(fit)
lambda<-with(bc, x[which.max(y)])
MYCORRHIZAL_VARIANCE$bc <- ((x^lambda)-1/lambda)
boxplot(bc ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
however when I run it, I get the following error message:
Error: object 'x' not found. (on line 4)
For context, here's the str of my dataset:
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 24 obs. of 14 variables:
$ TREATMENT : Factor w/ 2 levels "Mycorrhizal",..: 1 1 1 1 1 1 1 1 1 1 ...
$ P_LEVEL : Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 2 2 2 2 ...
$ REP : int 1 2 3 4 5 6 1 2 3 4 ...
$ ABOVEGROUND_BIO : num 7.5 6.8 5.3 6 6.7 7 12 12.7 12 10.2 ...
$ BELOWGROUND_BIO : num 3 2.4 2 4 2.7 3.6 7.9 8.8 9.5 9.2 ...
$ ROOT_SHOOT : num 0.4 0.35 0.38 0.67 0.4 0.51 0.66 0.69 0.79 0.9 ...
$ ROOT_SHOOT.log : num -0.916 -1.05 -0.968 -0.4 -0.916 ...
$ ABOVEGROUND_BIO.log : num 2.01 1.92 1.67 1.79 1.9 ...
$ ABOVEGROUND_BIO.sqrt : num 2.74 2.61 2.3 2.45 2.59 ...
$ ABOVEGROUND_BIO.cubert: num 1.96 1.89 1.74 1.82 1.89 ...
$ BELOWGROUND_BIO.log : num 1.099 0.875 0.693 1.386 0.993 ...
$ BELOWGROUND_BIO.sqrt : num 1.73 1.55 1.41 2 1.64 ...
$ BELOWGROUND_BIO.cubert: num 1.44 1.34 1.26 1.59 1.39 ...
$ TOTAL_BIO : num 10.5 9.2 7.3 10 9.4 10.6 19.9 21.5 21.5 19.4 ...
- attr(*, "spec")=
.. cols(
.. TREATMENT = col_factor(levels = c("Mycorrhizal", "Non-mycorrhizal"), ordered = FALSE, include_na = FALSE),
.. P_LEVEL = col_factor(levels = c("Low", "High"), ordered = FALSE, include_na = FALSE),
.. REP = col_integer(),
.. ABOVEGROUND_BIO = col_number(),
.. BELOWGROUND_BIO = col_number(),
.. ROOT_SHOOT = col_number()
.. )
I understand there's no variable named bc in the MYCORRHIZAL_VARIANCE dataset, but I'm just following basic instructions given to me on performing a Box-Cox, and I guess I'm confused as to what 'x' should actually be denoted as, since I thought 'x' was being defined in line 3? Any suggestions as to how to fix this error?
Thanks in advance!
I thought 'x' was being defined in line 3?
Line 3 is lambda<-with(bc, x[which.max(y)]). It doesn't define x, it defines lambda. It does use x, which it looks for within the bc environment. If you're using boxcox() from the MASS package, bc should indeed include x and y components, so bc$x shouldn't give you the same error message. I'd expect an error about the replacement lengths. Because...
bc$x are the potential lambda values tried by boxcox - you're using the default seq(-2, 2, 1/10), and it would be an unlikely coincidence if your data had a multiple of 41 rows needed to not give an error when assigning 41 values to a new column.
Line 3 picks out the lambda value that maximizes the likelihood, so you shouldn't need the rest of the values in bc ever again. I'd expect you to use that lambda values to transform your response variable, as that's what the Box Cox transformation is for. ((x^lambda)-1/lambda) doesn't make any statistical or programmatic sense. Use this instead:
MYCORRHIZAL_VARIANCE$bc <- (MYCORRHIZAL_VARIANCE$ABOVEGROUND_BIO ^ lambda - 1) / lambda
(Note that I also corrected the parentheses. You want (y ^ lambda - 1) / lambda, not (y ^ lambda) - 1 / lambda.)

Analysis of PCA

I'm using the rela package to check whether I can use PCA in my data.
paf.neur2 <- paf(neur2)
summary(paf.neur2)
# [1] "Your dataset is not a numeric object."
I want to see the KMO (The Kaiser-Meyer-Olkin measure of sampling adequacy test). How to do that?
Output of str(neur2)
'data.frame': 1457 obs. of 66 variables:
$ userid : int 200 387 458 649 931 991 1044 1075 1347 1360 ...
$ funct : num 3.73 3.79 3.54 3.04 3.81 ...
$ pronoun: num 2.26 2.55 2.49 1.98 2.71 ...
.
.
.
$ time : num 1.68 1.87 1.51 1.03 1.74 ...
$ work : num 0.7419 0.2311 -0.1985 -1.6094 -0.0619 ...
$ achieve: num 0.174 0.2469 0.1823 -0.478 -0.0513 ...
$ leisure: num 0.2852 0.0296 0.0583 -0.3567 -0.0408 ...
$ home : num -0.844 -0.58 -0.844 -2.207 -1.079 ...
.
Variables are all numeric.
According to ?paf, object is a numeric dataset (usually a coerced matrix from a prior data frame)
So you need to turn your data.frame neur2 into a matrix: as.matrix(neur2).
Here is a reproduction of your problem using the Seatbelts dataset:
library(rela)
Belts <- Seatbelts[,1:7]
class(Belts)
# [1] "mts" "ts" "matrix"
Belts <- as.data.frame(Belts)
# [1] "data.frame"
paf.belt <- paf(Belts)
[1] "Your dataset is not a numeric object."
Belts <- as.matrix(Belts)
class(Belts)
# [1] "matrix"
paf.belt <- paf(Belts) # Works
Two options which can do it for you:
kmo_DIY <- function(df){
csq = cor(df)^2
csumsq = (sum(csq)-dim(csq)[1])/2
library(corpcor)
pcsq = cor2pcor(cor(df))^2
pcsumsq = (sum(pcsq)-dim(pcsq)[1])/2
kmo = csumsq/(csumsq+pcsumsq)
return(kmo)
}
or
the function KMO() from the psych package.

classification in R

I am trying to do naive bayes classification in R. I have seen this example in following link.
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Na%C3%AFve_Bayes
Only 2 lines are there. First classify and then predict.
> classifier<-naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])
This same code on "iris dataset" working fine. But when i applied the same on my dataset, I am getting some errors.
My dataset contains 4 attributes and 4th attribute the class attribute.
> str(data1)
'data.frame': 1370 obs. of 4 variables:
$ TenScore : num 85 84.2 67.2 91.5 79.3 ...
$ TwelthScore : num 69 87.9 67.5 82.7 72.4 ...
$ GDegreeScore : num 63.3 70.7 61.3 78.2 62.1 ...
$ Got_Admission: chr "No" "No" "No" "No" ...
So, I tried this.
> classifier<-naiveBayes(data1[,1:3], data1[,4])
> table(predict(classifier, data1[,-4]), data1[,4])
Error in table(predict(classifier, data1[, -4]), data1[, 4]) :
all arguments must have the same length
I am getting above error when I am executing the command. When I just use predict, its giving me following output.
> predict(classifier, data1[,-4])
factor(0)
Levels:
str(data1) 'data.frame': 1370 obs. of 4 variables:
$ TenScore : num 85 84.2 67.2 91.5 79.3 ...
$ TwelthScore : num 69 87.9 67.5 82.7 72.4 ...
$ GDegreeScore : num 63.3 70.7 61.3 78.2 62.1 ...
$ Got_Admission: chr "No" "No" "No" "No" ...
Please explain me whats the errors about and how to solve?
I can produce the same error by changing the 5th column of iris to character:
> iris[ , 5] <- as.character(iris[ , 5] )
> classifier<-naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])
Error in table(predict(classifier, iris[, -5]), iris[, 5]) :
all arguments must have the same length
# The fix -------->
iris[ , 5] <- factor(as.character(iris[ , 5] ))
classifier<-naiveBayes(iris[,1:4], iris[,5])
table(predict(classifier, iris[,-5]), iris[,5])
# ---- output--------
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
So you should probably do this:
data1$ Got_Admission <- factor(data1$ Got_Admission)
If your 'Got_Admission' column is not in good order you will get confusing results (the GIGO effect). You should first look at the contents with:
table(data1$ Got_Admission)

Resources