Matrix assignment in R as a result of logical operation - r

With R I want to generate a matrix where the epsilons are the columns and the rows are the input data. However when I try to assign a value to matrix an error appears:
"Error in results[, j] <- (probabilities > epsilons[j]) :
replacement has length zero"
I tried many ways but I am stuck with this. Please note that this problem happens when oracle R objects are in use. See a small code below that reproduces the problem:
library(ORE)
ore.connect(user="XXXX", service_name="XXXXX", host="XXXXXXXX", password="XXXXX", port=XXXX, all=TRUE)
ore.sync('MYDATABASE')
ore.attach()
ore.pull(MY_TABLE)
trainingset <- MY_TABLE$MY_COLUMN[1:1000]
crossvalidationset <- MY_TABLE$MY_COLUMN[1001-2000]
# Training
my_column_avg <- mean(trainingset)
my_column_std <- sd(trainingset)
# Validation
probabilities <- dnorm(crossvalidationset,my_column_avg,my_column_std)
epsilons <- c(0.01,0.05,0.1,0.25,0.5,0.75,0.8)
num_rows <- length(probabilities)
num_cols <- length(epsilons)
results <- matrix(TRUE, num_rows, num_cols)
# Anomaly detection results for several epsilons
for(j in 1:num_cols)
{
results[,j] <- (probabilities > epsilons[j])
}

Object MY_TABLE is an oracle table object not a data-frame as well as probabilities since it was derived from MY_TABLE.
However when a value assignment was tried to an R matrix than the error was happening as shown in the line below:
results[,j] <- (probabilities > epsilons[j])
The reason of the error described above was due to the use of oracle R library (ORE).
If common R data structures are used in the code above since the beginning than this problem never happens. For instance by replacing MY_TABLE oracle object to a data-frame.
Therefore it is a good practice to get rid of Oracle R objects and use R data frames whenever possible.

Related

cohen.d not recognizing numeric values

I have a sample dataset:
Score <- c(1,2,3,4,5,6,7,8)
Condition <- c(rep(1,each=5),rep(2,each=3))
Test <- data.frame(Condition,Score)
I tried running cohen.d from the effsize package using the following code:
cohen.d(Test,group="Condition")
but I obtained this error:
Error in cohen.d.default(Test, group = "Condition") : First parameter must be a numeric type
even though both column are numeric (I checked both the workspace and used as.numeric).
May I know what did I do wrong? I am aware that someone has solved this issue before (here's the link) but I fail to understand what she did.
Thank you
Change Condition column to factor.
Test$Condition <- factor(Test$Condition)
There are two ways in which you can apply the function.
Using values.
library(effsize)
cohen.d(Test$Score, Test$Condition)
Using formula syntax.
cohen.d(Score~Condition, Test)

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Removing dataframe outliers in R with `boxplot.stats`

I'm relatively new at R, so please bear with me.
I'm using the Ames dataset (full description of dataset here; link to dataset download here).
I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:
regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))
My next objective was to remove the outliers, so I tried to subset using a which() function:
regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]
Unfortunately, that produced the
longer object length is not a multiple of shorter object length
error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)
Nice use with boxplot.stats.
You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).
regressionFrame <- subset(regressionFrame,
subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))
What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).
A simple example.
x <- c(1:10/10, 101, 102, 103) ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out ## `boxplot.stats` has picked them out
x[x != out] ## this gives a warning and wrong result
x[!(x %in% out)] ## this removes them from x

Analyzing the relationship between cost and gamma for the svm, (NAs are not allowed) error message

I have been trying to analyse the relationship between the cost parameter,c , and the gamma parameter using the ksvm package (kernlab). The program I have written is as follows:
function (data)
{
library(kernlab)
p<-ncol(data)
y<-data[,p]
x<-data[,-p]
Rad.gamma<-matrix(seq(exp(-10),exp(1),length=20))
Con.c<-matrix(c(0.1,0.5,1.5),nrow=1)
mat<-expand.grid(Rad.gamma,Con.c)
Output<-data.frame(0,nrow=80,ncol=2)
for(i in 1:80)
{
Gamma<-mat[i,1]
CC<-mat[i,2]
Svm<-ksvm(y~.,data=as.data.frame(x),
kernel="rbfdot",kpar=list(sigma=Gamma),
cross=5, C=CC, type='C-svc',prod.model=FALSE)
Output[i,1]<-error(Svm)
Output[i,2]<-cross(svm)
Output[i,3]<-nSV(svm)/nrow(data)
}
Output<-data.frame(Output)
results<-cbind(mat,Output)
colnames(results)<-c("C","Train","Cross","SVs")
results
}
The error I obtain is:
Error in votematrix[i, ret < 0] <- votematrix[i, ret < 0] + 1 :
NAs are not allowed in subscripted assignments
I have attempted to check stackoverflow for a solution but the best answer I could find is that data.frame needs to come before cbind when there are missing values. I have been testing this function with the iris data set and there are no missing values. I would like to plot the results and analyze the patterns of the output matrix's contents; that should be simple enough. The problem is getting the results table to use for the plotting.
Any help would be greatly appreciated.
The mat made by expand grid has 60 rows, and you are attempting to find indexes of up to row 80. This should work:
data(iris)
library(kernlab)
p<-ncol(iris)
y<-iris[,p]
x<-iris[,-p]
Rad.gamma<-matrix(seq(exp(-10),exp(1),length=20))
Con.c<-matrix(c(0.1,0.5,1.5),nrow=1)
mat<-expand.grid(Rad.gamma,Con.c)
Output<-data.frame(0,nrow=60,ncol=2)
for(i in 1:60){
Gamma<-mat[i,1]
CC<-mat[i,2]
Svm<-ksvm(y~.,data=as.data.frame(x),
kernel="rbfdot",kpar=list(sigma=Gamma),
cross=5, C=CC, type='C-svc',prod.model=FALSE)
Output[i,1]<-error(Svm)
Output[i,2]<-cross(Svm)
Output[i,3]<-nSV(Svm)/nrow(iris)
}
Output<-data.frame(Output)
results<-cbind(mat,Output)
colnames(results)<-c("C","Train","Cross","SVs")
results
additionally, results have 5 columns, perhaps
colnames(results)<-c("gamma", "C","Train", "Cross","SVs")
may I suggest using apply instead of the for loop. In this case one needs not worry where to store the results:
out = apply(mat, 1, function(p){
Gamma<-p[1]
CC<-p[2]
Svm<-ksvm(y~.,data=as.data.frame(x),
kernel="rbfdot",kpar=list(sigma=Gamma),
cross=5, C=CC, type='C-svc',prod.model=FALSE)
out = data.frame(error(Svm), cross(Svm), nSV(Svm)/nrow(iris))
colnames(out) = c("train", "Cross","SVs")
return(out)
})
out = do.call(rbind, out)
out = data.frame(mat, out)

R saving results from a nested for-loop

I am trying to write the following loop over an empirical data set where
each ID replicate has a different number of observations for each sample period.
Any suggestions would be greatly appreciated!
a <- unique(bma$ID)
t <- unique(bma$Sample.period)
# empty list to hold the data
dens.data <- vector(mode='list', length = length(a) * length(t))
tank1 <- double(length(a))
index = 0
for (i in 1:length(a)){
for (j in 1:length(t)){
index = index + 1
tank1[index] = a[index] ### building an ID column
temp.tank <- subset(bma, bma$ID == a[i])
time.tank <- subset(temp.tank, temp.tank$Sample.period == t[j])
temp1 <- unique(temp.tank$Sample.period)
temp.tank <- data.frame(temp.tank, temp1)
dens.1 <- density(time.tank$Biomass_.adults_mgC.mm.3, na.rm = T)
# extract the y-values from the pdf function - these need to be separated by each Replicate and Sample Period
dens.data[[index]] <- dens.1$y
}
}
#### extract the data and place into a dataframe
dens.new<- data.frame(dens.data)
dens.new
colnames(dens.new) <- c("Treatment","Sample Period","pdf/density for biomass")
all<- list(dens.new)
all
### create new spreadsheet with all the data from the loop
dens.new.data<- write.csv(dens.new, "New.density.csv") ## export file to excel spreadsheet
Calling dens.new<- data.frame(dens.data) Yield the following error message:
Error in data.frame(c(...) :
arguments imply differing number of rows: 512, 0
The loop seems to work for dens.data[[1]] but returns NULL for
dens.data[[>1]]
As there isn't a minimal example, it is difficult for me to guess what the original data.frame looks like. However, as for the error message, it is clear that your for-loop fails to assign values to the list dens.data for indices greater than 1.
My guess is that the index didn't update by index = index + 1. Maybe you could try changing the equal sign = to the standard R assignment operator <- and see whether the whole list is updated.
I heard that using equal sign for assignment may cause some problems in an older version of R, but I'm not sure whether you are facing the same problem. Anyway, using <- to assign a value is always safer and recommended.

Resources