rdirichlet distribution isn't working in R - r

I am attempting to run some code I wrote over a year ago and for some reason it is not working this time. Previously I had the matrix variables alpha and prob typed into R. However this time I am importing them as csv files so it's easier for me to modify them with each itteration. I'm currently running R x64 2.12.2. I loaded gtools, splines and stat4 packages.
alpha <- read.csv("AlphaOriginal.csv", header = FALSE)
prior <- array(0,c(45,45))
for (j in 1:45) prior[j,] <- rdirichlet(1,alpha[j,])
prob <- read.csv("ProbOriginal.csv", header = FALSE)
data <- array(0,c(45,45))
for (i in 1:45) data[i,] <- rmultinom(1,1,prob[i,])
posterior <- data%*%prior
write.table(posterior, file = "PosteriorOriginal.csv", row.names = FALSE, na = "", col.names = FALSE, sep = ",")
After the rdirichlet line I get the following error.
Error in rgamma (1*n, alpha) : invalid arguments
Does anyone know what this error means and how to fix it? Thx

If you take a look at the 'rdirichlet' help page you will find that the 'alpha' argument must be a vector, not a data.frame. So you should instead do:
rdirichlet(1, unlist(alpha[j,]))

Related

KNN function in R producing NA/NaN/Inf in foreign function call (arg 6) error

I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN

R nsltools Regression, preview function doesn't take variables

im quite new to R but wanted to use the packages "nls" and "nlstools" since it has nice tools for analysis and evaluation.
the code I use is:
conB1_2015 = read.csv("C:\\Path_to_File\\conB1_2015.csv")
conB1_2015 = na.omit(conB1_2015)
tRef <- mean(conB1_2015$Mean_Soil_Temp_V2..C., na.rm=TRUE)
rRef <- conB1_2015$Lin_Flux..mymol.m.2.s.1.[which.min(abs(conB1_2015$Mean_Soil_Temp_V2..C.-tRef))]
rMax <- max(conB1_2015$Lin_Flux..mymol.m.2.s.1., na.rm=TRUE)
half <- rMax/2
half_SM <- conB1_2015$Soil_Moist_V3[which.min(abs(conB1_2015$Lin_Flux..mymol.m.2.s.1.-half))]
form <- as.formula(Lin_Flux..mymol.m.2.s.1. ~ (rRef)*a*exp(b*Mean_Soil_Temp_V2..C.)*Soil_Moist_V3/(half_SM)+Soil_Moist_V3)
preview(form, data = conB1_2015, start = c(a = -1.98, b = -0.05), variable = 1)
The Problem is, that i get this Error running this code:
Error in data.frame(value, row.names = rn, check.names = FALSE) :
row names supplied are of the wrong length
When i change the variables in form <- as.formula(Lin_Flux..mymol.m.2.s.1. ~ (rRef)*a*exp(b*Mean_Soil_Temp_V2..C.)*Soil_Moist_V3/(half_SM)+Soil_Moist_V3)
to form <- as.formula(Lin_Flux..mymol.m.2.s.1.~(rRef<-4.41)*a*exp(b*Mean_Soil_Temp_V2..C.)*Soil_Moist_V3/(half_SM<-7.19)+Soil_Moist_V3)
the function works fine.
I wanted to automate the script to run over several csv's to test different models on different data. Is it really not possible to pass variables into the preview function or am I missing something? There can't be a problem with headers or the data table since it's working fine in the second example.

R function not working in code

I'm starting to learn Timeseries forcasting in R.
I keep getting an error when running the following code (the last line is the error).
Any ideas?
mydata = read.csv("Sample data.csv", header = FALSE, sep = ",") # read csv file
Revenue_Data <- ts(mydata, frequency=12, start=c(2015,1))
library("forecast")
library("ggplot2")
logRevDataTimeSeries <- log(Revenue_Data)
RevForecasts <- HoltWinters(logRevDataTimeSeries)
RevForecasts$SSE
[1] 0.141991
RevForecasts2 <- forecast.HoltWinters(RevForecasts, h=48)
Error in forecast.HoltWinters(RevForecasts, h = 48) :
could not find function "forecast.HoltWinters"
You need to call forecast not forecast.HoltWinters. forecast.HoltWinters specifies that the forecast function can be applied to a HoltWinters object.
See the help page for further details: https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/forecast.HoltWinters
I found one simple way to get the output with holtwinnter. Try this
holtwinnter_fit <- holt(training_data$predictor)

All NaNs in RMA normalization of GSE31312 using Brainarray custom CDFs

I'm trying to RMA normalize a particular gene expression dataset concerning diffuse large B-cell lymphoma using custom gene-level annotation CDF (chip definition file) files from Brainarray.
Unfortunately, the RMA normalized expression matrix is all NaNs, and I don't understand why.
The dataset (GSE31312) is freely available at the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I'm using the affy package to perform the RMA normalization.
Since the problem is specific to the dataset, the following R code to reproduce the problem is unfortunately quite cumbersome (2 GB download, 8.8 GB unpacked).
Set the working directory.
setwd("~/Desktop/GEO")
Load the needed packages. Uncomment to install the packages.
#source("http://bioconductor.org/biocLite.R")
#biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
library("GEOquery") # To automatically download the data
library("affy")
library("R.utils") # For file handling
Download the array data to the work dir.
files <- getGEOSuppFiles("GSE31312")
Untar the data in a dir called CEL
#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")
Unzip the all .gz files
gz.files <- list.files("CEL", pattern = "\\.gz$",
ignore.case = TRUE, full.names = TRUE)
for (file in gz.files)
gunzip(file, skip = TRUE, remove = TRUE)
cel.files <- list.files("CEL", pattern = "\\.cel$",
ignore.case = TRUE, full.names = TRUE)
Download, install, and load the custom Brainarray Ensembl ENSG gene annotation package
download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/",
"Database/CustomCDF/18.0.0/ensg.download/",
"hgu133plus2hsensgcdf_18.0.0.tar.gz"),
destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz")
install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz",
repos = NULL, type = "source")
library(hgu133plus2hsensgcdf)
Perform the RMA normalization with and without the custom CDF.
affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)
ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
As can be seen, the function returns without warning an expression matrix exprs(ensg.ram) where all entries in the expression matrix are NaN.
sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482
Interestingly, there are quite a few NaNs in exprs(affy.rma) when using the standard CDF.
# Show some NaNs
na.rows <- apply(is.na(exprs(affy.rma)), 1, any)
exprs(affy.rma)[which(na.rows)[1:10], 1:4]
# A particular bad probeset:
exprs(affy.rma)["1553575_at", ]
# There are relatively few NaNs in total (but the really should be none)
sum(is.na(exprs(affy.rma))) # == 12305
# Probesets of with all NaNs
sum(apply(is.na(exprs(affy.rma)), 1, all))
There really should be none NaNs at all. I've tried using the expresso function to perform background correction only (with no normalization and summarization) which also yield NaNs. So the problem appears to stem from the background correction. However, one might worry that it is one or more bad arrays that is the cause. Can anybody help me track down the origin of the NaNs and get some useful expression values?
Thanks, and best regards
Anders
Edit: A single file appears to be the issue (but not quite)
I decided to check what happens if each .CEL file is normalized independently. What actually happens underneath the RMA hood when justRMA is given a single array I'm not sure of. But I imagine that the quantile normalization step becomes the identity function and the summarization (median polish) simply stops after the first iteration.
Anyway, to perform the one-by-one RMA normalization we run:
ensg <- vector("list", length(cel.files)) # Initialize a list
names(ensg) <- basename(cel.files)
for (file in cel.files) {
ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
cat("File", which(file == cel.files), "done.\n\n")
}
# Extract the expression matrices for each file and combine them
X <- as.data.frame(do.call(cbind, lapply(ensg, exprs)))
By looking at head(X) it appears that GSM776462.CEL is all NaNs. Indeed, it almost is:
sum(!is.nan(X$GSM776462.CEL)) # == 18
Only 18 of 20009 is not NaN. Next, I count the number of NaN appearing other places
sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0
which is 0. Hence GSM776462.CEL appears to be the culprit.
But the regular CDF annotation does not give any problems:
affy <- justRMA(filenames = "CEL/GSM776462.CEL")
any(is.nan(exprs(affy))) # == FALSE
It is also weird, that using regular CDF has seemingly randomly scattered NaNs in the expression matrix. I still don't quite know what to make of this.
Edit2: NaNs vanish when excluding sample but not with standard CDF
For what it is worth, when I exclude the GSM776462.CEL file and RMA normalize with and without the custom CDF files the NaNs only disappear in the custom CDF case.
# Trying with all other CEL than GSM776462.CEL
cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)
affy.rma.no.776462 <- justRMA(filenames = cel.files2)
ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
sum(is.na(exprs(affy.rma.no.776462))) # == 12275
sum(is.na(exprs(ensg.rma.no.776462))) # == 0
Puzzling.
Edit3: No NAs or NaNs in "raw data"
For what it is worth, I've tried to read in the raw probe-level expression values to check if they contain NAs or NaNs. The following goes through the .CEL-files one-by-one and checks for any missing values.
for (file in cel.files) {
affybtch <- suppressWarnings(read.affybatch(filenames = file))
tmp <- exprs(affybtch)
cat(file, "done.\n")
if (any(is.na(tmp)))
stop(paste("NAs or NaNs are present in", file))
}
No NAs or NaNs are found.

KNN in R: 'train and class have different lengths'?

Here is my code:
train_points <- read.table("kaggle_train_points.txt", sep="\t")
train_labels <- read.table("kaggle_train_labels.txt", sep="\t")
test_points <- read.table("kaggle_test_points.txt", sep="\t")
#uses package 'class'
library(class)
knn(train_points, test_points, train_labels, k = 5);
dim(train_points) is 42000 x 784
dim(train_labels) is 42000 x 1
I don't see the issue, but I'm getting the error :
Error in knn(train_points, test_points, train_labels, k = 5) :
'train' and 'class' have different lengths.
What's the problem?
Without access to the data, it's really hard to help. However, I suspect that train_labels should be a vector. So try
cl = train_labels[,1]
knn(train_points, test_points, cl, k = 5)
Also double check:
dim(train_points)
dim(test_points)
length(cl)
I had the same issue in trying to apply knn on breast cancer diagnosis from wisconsin dataset I found that the issue was linked to the fact that cl argument need to be a vector factor (my mistake was to write cl=labels , I thought this was the vector to be predicted it was in fact a data frame of one column ) so the solution was to use the following syntax : knn (train, test,cl=labels$diagnosis,k=21) diagnosis was the header of the one column data frame labels and it worked well
Hope this help !
I have recently encountered a very similar issue.
I wanted to give only a single column as a predictor. In such cases, selecting a column, you have to remember about drop argument and set it to FALSE. The knn() function accepts only matrices or data frames as train and test arguments. Not vectors.
knn(train = trainSet[, 2, drop = FALSE], test = testSet[, 2, drop = FALSE], cl = trainSet$Direction, k = 5)
Try converting the data into a dataframe using as.dataframe(). I was having the same problem & afterwards it worked fine:
train_pointsdf <- as.data.frame(train_points)
train_labelsdf <- as.data.frame(train_labels)
test_pointsdf <- as.data.frame(test_points)
Simply set drop = TRUE while you're excluding cl from dataframe, it causes to remove dimension from an array which have only one level:
cl = train_labels[,1, drop = TRUE]
knn(train_points, test_points, cl, k = 5)
I had a similar error when I was reading to a tibble (read_csv) and when I switched to read.csv the code worked.
Followed the code as given in the book but will show error due to mismatch lengths (1 is df other is vector returned). I reached here but nothing worked exactly but ideas helped that vectors were needed for comparison.
This throws error
gmodels::CrossTable(x = wbcd_test_labels, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
The following works :
gmodels::CrossTable(x = wbcd_test_labels$diagnosis, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
where using $ for x makes it a vector and hence matches
Additionally while running knn
Cl parameter shoud also have vector save labels in vectors else there will be length mismatch OR use labelDF$Class_label
wbcd_test_pred <- knn(train = wbcd_train,
test = wbcd_test,
cl =wbcd_train_labels$diagnosis, #note this
k = 21)
Hope this helps beginners like me.
Uninstall R Previous versions and install R version > 4.0. It will work.

Resources