Writing a formula and trying to loop it in R - r

A2.DM19C.MICSw… A2.DM19C.MICSw… A2.IF12C.MICSwm… A2.DM12C.MICSwm… A2.HA12C.MICSwm…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.131 0.0516 -0.294 1.29 0.144
2 -0.175 -0.0250 -0.183 1.31 0.146
3 -0.128 0.0691 -0.294 1.31 0.0224
4 -0.175 0.0359 -0.294 1.31 0.136
5 -0.142 0.0169 -0.295 1.31 0.0239
6 -0.252 -0.0918 -0.272 1.33 -0.0263
I have a head of data that looks like this and the dataset is called data_LOG. I want to z-score all these columns. Because there are over 1000 columns, I want to loop the formula so that I can quickly change all these values to a z-score. The equation for z-score is (y-mean(y)/sd(y)). So i made a function called 'zscore'.
zscore <- function(r){
Cal <- (r-mean(r))/sd(r)
return(Cal)
}
Which works just fine when tested against the first column. I want the z-score data to be in a new data frame i call dataZ.
dataZ <- data_log
However, when i attempt to loop the formula, i get an error code.
for (i in 1:ncol(data_log)) {
dataZ[,i] <- zscore(data_log[,i])
}
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
In addition: Warning message:
In mean.default(r) :
Show Traceback
Rerun with Debug
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
I am unsure what this means and how to fix it? please help!

If you want to keep your approach try this
dataZ <- NULL
for (i in 1:ncol(df)) {
z <- zscore(df[[i]])
dataZ <- cbind(z, dataZ)
}
dataZ <- as.data.frame(dataZ)

You could use apply() in combination with standardize()or scale()
dataZ <- apply(data_LOG, 2, scale) # margins = 2, indicates that the function is applied columns
HTH :)

Related

r plot correlation matrix from file with correlation

After try to find a solution, I didn't.
I have a .txt file with a correlation matrix which was previously created from other records. It looks like this:
CXCL9 IL2RG TAP1
CXCL9 1
IL2RG 0.828 1
TAP1 0.605 0.631 1
CD274 0.564 0.57 0.679
LAG3 0.624 0.676 0.681
I am trying to generate a correlogram, an for that I've done this:
m <- read.table("file.txt", sep="\t", header=TRUE, check.names = FALSE)
mymatrix <- as.matrix(m)
corrplot(mymatrix, type = "lower", method="number")
And I get this message:
Error in corrplot(mymatrix, type = "lower") : The matrix is not in [-1, 1]!
How can I do a simple correlogram with this data? (maybe doing a heatmap?)
The desired output:

Using glmnet on binomial data error

I imported some data as follows
surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat",header = T)
x <- as.matrix(select(surv,-ab))
y <- as.matrix(select(surv,ab))
glmnet::cv.glmnet(x,y,alpha=1,,family="binomial",type.measure = "auc")
and I am getting the following error.
NAs introduced by coercion
Show Traceback
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : NA/NaN/Inf in foreign function call (arg 5)
What is a good fix for this?
The documentation of the glmnet package has the information that you need,
surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat", header = T, stringsAsFactors = T)
x <- surv[, -which(colnames(surv) == 'ab')] # remove the 'ab' column
y <- surv[, 'ab'] # the 'binomial' family takes a factor as input (too)
xfact = sapply(1:ncol(x), function(y) is.factor(x[, y])) # separate the factor from the numeric columns
xfactCols = model.matrix(~.-1, data = x[, xfact]) # one option is to build dummy variables from the factors (the other option is to convert to numeric)
xall = as.matrix(cbind(x[, !xfact], xfactCols)) # cbind() numeric and dummy columns
fit = glmnet::cv.glmnet(xall,y,alpha=1,family="binomial",type.measure = "auc") # run glmnet error free
str(fit)
List of 10
$ lambda : num [1:89] 0.222 0.202 0.184 0.168 0.153 ...
$ cvm : num [1:89] 1.12 1.11 1.1 1.07 1.04 ...
$ cvsd : num [1:89] 0.211 0.212 0.211 0.196 0.183 ...
$ cvup : num [1:89] 1.33 1.32 1.31 1.27 1.23 ...
$ cvlo : num [1:89] 0.908 0.9 0.89 0.874 0.862 ...
$ nzero : Named int [1:89] 0 2 2 3 3 3 4 4 5 6 ...
.....
I have come across the same problem of mixed data types of numeric and character/factor. For converting the predictors, I recommend using a function that comes with the glmnet package for exactly this mixed data type problem: glmnet::makeX(). It handles the dummy creation and is even able to perform a simple imputation in case of missing data.
x <- glmnet::makeX(surv[, -which(colnames(surv) == 'ab')])
or more tidy-ish:
library(tidyverse)
x <-
surv %>%
select(-ab) %>%
glmnet::makeX()

neural network: in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments

i am trying to the neural network method on my data and i am stuck.
i am allways getting the message:
in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments
the facts are:
i am reading my data using read.csv
i am adding a link to a file with some of my data, i hope it helps
https://www.dropbox.com/s/b1btx0cnhmj229p/collineardata0.4%287.2.2017%29.csv?dl=0
i have no NA in my data (i checked twice)
the outcome of str(data) is:
'data.frame': 20 obs. of 457 variables:
$ X300.5_alinine.sulphate : num 0.351 0.542 0.902 0.656 1 ...
$ X300.5_bromocresol.green : num 0.435 0.603 0.749 0.314 0.922 ...
$ X300.5_bromophenol.blue : num 0.415 0.662 0.863 0.345 0.784 ...
$ X300.5_bromothymol.blue : num 0.2365 0.0343 0.4106 0.3867 0.8037 ...
$ X300.5_chlorophenol.red : num 0.465 0.1998 0.7786 0.0699 1 ...
$ X300.5_cresol.red : num 0.534 0.311 0.678 0.213 0.821 ...
continued
i have tried to do use model.matrix
the code i have was tried on different datasets (i.e iris) and it was good.
can anyone please try and suggest what is wrong with my data/data reading?
the code is
require(neuralnet)
require(MASS)
require(grid)
require(nnet)
#READ IN DATA
data<-read.table("data.csv", sep=",", dec=".", head=TRUE)
dim(data)
# Create Vector of Column Max and Min Values
maxs <- apply(data[,3:459], 2, max)
mins <- apply(data[,3:459], 2, min)
# Use scale() and convert the resulting matrix to a data frame
scaled.data <- as.data.frame(scale(data[,3:459],center = mins, scale = maxs - mins))
# Check out results
print(head(scaled.data,2))
#create formula
feats <- names(scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
f <- paste('data$Type ~',f)
# Convert to formula
f <- as.formula(f)
f
#creating neural net
nn <- neuralnet(f,model,hidden=c(21,15),linear.output=FALSE)
str(scaled.data)
apply(scaled.data,2,function(x) sum(is.na(x)))
There are multiple things wrong with your code.
1.There are multiple factors in your dependent variable Type. The neuralnet only accepts numeric input so you must convert it to a binary matrix with model.matrix.
y <- model.matrix(~ Type + 0, data = data[,1,drop=FALSE])
# fix up names for as.formula
y_feats <- gsub(" |\\+", "", colnames(y))
colnames(y) <- y_feats
scaled.data <- cbind(y, scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
y_f <- paste(y_feats,collapse=' + ')
f <- paste(y_f, '~',f)
# Convert to formula
f <- as.formula(f)
2.You didn't even pass in your scaled.data to the neuralnet call anyway.
nn <- neuralnet(f,scaled.data,hidden=c(21,15),linear.output=FALSE)
The function will run now but you will need to look in to the multiclass problems (beyond the scope of this question). This package does not output straight probabilities so you must be cautious.

How to export results from bootstrapping in R?

I have a time series of 540 observations which I resample 999 times using the following code:
boot.mean = function(x,i){boot.mean = mean(x[i])}
z1 = boot(x1, boot.mean, R=999)
z1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = x1, statistic = boot.mean, R = 999)
Bootstrap Statistics :
original bias std. error
t1* -0.009381397 -5.903801e-05 0.002524366
trying to export the results gives me the following error:
write.csv(z1, "z1.csv")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""boot"" to a data.frame
How can I export the results to a .csv file?
I am expecting to obtain a file with 540 observations 999 times, and the goal is to apply the approx_entropy function from the pracma package, to obtain 999 values for approximate entropy and plot the distribution in Latex.
First, please make sure that your example is reproducible. You can do so by generating a small x1 object, or by generating a random x1 vector:
> x1 <- rnorm(540)
Now, from your question:
I am expecting to obtain a file with 540 observations 999 times
However, this is not what you will get. You are generating 999 repetitions of the mean of the resampled data. That means that every bootstrap replicate is actually a single number.
From Heroka's comment:
Hint: look at str(z1).
The function str shows you the actual data inside the z1 object, without the pretty formatting.
> str(z1)
List of 11
$ t0 : num 0.0899
$ t : num [1:999, 1] 0.1068 0.1071 0.0827 0.1413 0.0914 ...
$ R : num 999
$ data : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
(... lots of irrelevant stuff here ...)
- attr(*, "class")= chr "boot"
So your original data is stored as z1$data, and the data that you have bootstraped, which is the mean of each resampling, is stored in z1$t. Notice how it tells you the dimension of each slot: z1$t is 999 x 1.
Now, what you probably want to do is change the boot.mean function by a boot.identity function, which simply returns the resampled data. It goes like:
> boot.identity = function(x,i){x[i]}
> z1 = boot(x1, boot.identity, R=999)
> str(z1)
List of 11
$ t0 : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
$ t : num [1:999, 1:540] -0.851 -0.434 -2.138 0.935 -0.493 ...
$ R : num 999
$ data : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
(... etc etc etc ...)
And you can save this data with write.csv(z1$t, "z1.csv").

variable lengths differ in R

I am getting the error above when trying to use the cv.lm fucntion. Please see my code
sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
sample1<-sample[2:2000,3:131]
samplex<-sample[2:50,3:131]
y<-as.numeric(sample1[1,])
y<-as.numeric(sample1[2:50,2])
x1<-as.numeric(sample1[2:50,3])
x2<-as.numeric(sample1[2:50,4])
x11<-x1[!is.na(y)]
x12<-x2[!is.na(y)]
y<-y[!is.na(y)]
fit1 <- lm(y ~ x11 + x12, data=sample)
fit1
x3<-as.numeric(sample1[2:50,5])
x4<-as.numeric(sample1[2:50,6])
x13<-x3[!is.na(y)]
x14<-x4[!is.na(y)]
fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
anova(fit1,fit2)
install.packages("DAAG")
library("DAAG")
cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation
Any insight will be appreciated.
Example of data
ID peak height LCA001 LCA002 LCA003
N001786 32391.111 0.397 0.229 -0.281
N005356 32341.473 0.397 -0.655 -1.301
N002416 32215.474 -0.703 -0.214 -0.901
GS239 31949.777 0.354 0.118 0.272
N016343 31698.853 0.226 0.04 -0.006
N003255 31604.978 0.024 NA -0.534
N004358 31356.597 -0.252 -0.022 -0.407
N000122 31168.09 -0.487 -0.533 -0.134
GS10564 31106.103 -0.156 -0.141 -1.17
GS17987 31043.876 NA 0.253 0.553
N003674 30876.207 0.109 0.093 0.07
Please see the example of the data above
First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.
So try this:
## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)
This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.
Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:
cv.lm(df=samplex, fit1, m=10)
is equivalent to:
cv.lm(df=samplex,y~x11+x12,m=10)
Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.

Resources