I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?
Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)
I am trying to do Hartigan's diptest in R, however, I get the following error: 'x' must be numeric.
Apologies for such a basic question, but how do I ensure that the data that I load is numeric?
If I make up a set of values as follows (code below), the diptest works without problems:
library(diptest)
x = c(1,2,3,4,4,4,4,4,4,4,5,6,7,8,9,9,9,9,9,9,9,9,9)
hist(x)
dip.test(x)
But for example, when the same values are saved in an Excel file/tab delimited .txt file (saved as one column of values), and imported into R, when I run the diptest the 'x' must be numeric error occurs.
x <- read.csv("x.csv") # comes back in as a data frame
hist(x)
dip.test(x)
Is there a way to check what format the imported data from an Excel/.txt file is in R, and subsequently change it to be numeric? Thanks again.
Any help will be much appreciated thank you.
Here's what's happening. If you run the code that you know works, it's working because the data class is numeric as it should be. When you read it back in it's a data.frame, however. So you need to point to the numeric element of the data.frame:
library(diptest)
x = c(1,2,3,4,4,4,4,4,4,4,5,6,7,8,9,9,9,9,9,9,9,9,9)
write.csv(x, "x.csv", row.names=F)
x <- read.csv("x.csv") # comes back in as a data frame
hist(x$x)
dip.test(x$x)
Hartigans' dip test for unimodality / multimodality
data: x$x
D = 0.15217, p-value = 2.216e-05
alternative hypothesis: non-unimodal, i.e., at least bimodal
If you were to save the file to a .RDS instead of .csv then you could avoid this problem.
You could also check if your data frame contains any non-numeric characters as follows:
which(!grepl('^[0-9]',your_data_frame[[1]]))
I am trying to read data out of a csv-file.
The data consists of small integer numbers (53, 98 ...)
The csv was made with OpenOffice, the data stood there in the first column
one number in each row.
reading data was simple (no problem at all):
BirthNumbers <- read.csv(“/Users/.../RawData.csv”, header=FALSE)
Now I try to calculate mean(BirthNumbers) (for example),
but it is not possible, the error message:
x is not numeric
Where is my mistake?
Thanks for all help
Norbert
It's probably being read in as characters.
Try mean(as.numeric(BirthNumbers))
As per https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html (see Value section), read.csv returns a data frame.
You should be calling mean on the column of the data frame. Since you have no headers (given your header = FALSE), most likely the column is called V1 (verify by doing head(BirthNumbers) or colnames(BirthNumbers)), so you should do mean(BirthNumbers$V1).
I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,
Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.