R - [DESeq2] - Making DESeq Dataset object from csv of already normalized counts - r

I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?

Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)

Related

How do I export a textstat_simil document without losing observations or variables?

I'm new to quanteda and I am having issues exporting my documents. I am comparing two documents, "dfm_latam", with more than 27k observations, and "dfm_cosines", which consists of two corpuses with texts to be compared with each one of the 27k observations of the dfm_latam database.
corpus_cosine_2 <- corpus(cosine_2_pdf)
corpus_cosines <- corpus_cosine_1 + corpus_cosine_2
dfm_cosines <- dfm(corpus_cosines, case_insensitive = TRUE)
corpus_latam <- corpus(latam_review)
docvars(corpus_latam, "Text") <- names(corpus_latam$text)
dfm_latam <- dfm(corpus_latam, case_insensitive = TRUE)
simil_latam <- textstat_simil(dfm_latam, dfm_cosines, method = "cosine", margin = "documents", case_insensitive = TRUE)
view(simil_latam)
The view() function in R provides me with the first 1000 rows and everything is fine. Both numeric variables from the dfm_cosines are showing up. But, when I try to export it as an Excel document, the output looks completely different from the view() 1000 rows preview. One of the variables is missing, and the .xlsx output only shows "corpus_cosine_1's" results. The dfm "dfm_cosines" is made after both "corpus_cosine_1" and "corpus_cosine_2". Why does it happen when I export it?
openxlsx::write.xlsx(simil_latam, file = "F:\\path\\simil_latam.xlsx")
So, I tried exporting along with the view() function:
openxlsx::write.xlsx(view(simil_latam), file = "F:\\path\\simil_latam.xlsx")
For this write.xlsx(view()), the variables showing up are just right, but I only export 1.000 observations out of the 27.000+ I have. How do I automatically export all of the observations of the table with all variables showing up?
You need to convert the textstat_simil object to something more spreadsheet-like. Try
as.matrix(simil_latam)
before you call write.xlsx() or if you prefer this format,
as.data.frame(simil_latam)
I suggest you inspect both coerced objects before exporting them, and also see the help functions for each of these for these methods (found in the quanteda.textstats package).

How to use corrplot with is.corr=FALSE

I previously made a beautiful functional and perfect actual corrolation plot with corrplot (my plot). Now I have to get the underlying data in the same look. So my goal is to have triangular similarity matrixes in the same colours as my corrolation plot. Imagine it like the conditional formatting in excel.
My Data: my Data from excel
Link to CSV Data file
it is loaded in as a csv and it can read the csv perfectly
My Code:corrplot(Phylogeny, is.corr=FALSE,method="number", cl.lim=c(0,1))
The error it throws me: Error in if (any(corr < cl.lim[1]) || any(corr > cl.lim[2])) { : Missing value, where TRUE/FALSE is required
i made sure all colums are numeric
i made sure to fill the missing bits with NA's (because that was a problem somwhere before)
i made sure all my values are between 0 and 1 like i want the limit to be (in between it told me that my values are not within the limit, when i tried around with some stuff)
the error does not change when i change the limit
the error does not change when i take the is.corr=FALSE out (default=TRUE)
i played around with corrplot.mixed and its still not working
have been referencing information from Corrplot Intro
I have looked into the condformat function but i am not really sure if it can do a filling of each cell with one colour according to the overall gradient like i used for my corrolation plot.
What am I missing here that it does not want to give me my table back with pretty colours?
I had the same error, but I was able to fix it by converting my data.frame to a matrix. I ended up with corrplot(as.matrix(df), is.corr = FALSE).
If I am understanding correctly, your posted data are already a correlation matrix - although not a fully symmetrical one of the sort that would be produced with the call cor on raw data.
In that case, the problem is just that you have variable names (Species) as a column in your data. Change this column to row names, drop the variable names, and call corrplot as user9536160 suggests:
# read in your data
phyl <- as.data.frame(read_csv("Phylogeny.csv"))
# name rows and drop variable names in the df itself
row.names(phyl) <- phyl$Species
phyl <- phyl %>%
select(-Species)
# call corrplot
corrplot(as.matrix(phyl), is.corr = FALSE)
The result:

Fixed it. What is the option_description used for in the build_dict function in the dataMeta package in R?

I have a dataset with some 100,000 tweets and their sentiment scores attached. The original dataset just has two columns one for the tweets and one for their sentiment scores.
I am trying to build a data dictionary for it using the dataMeta package. Here is the code that I have writtern so far:
#Data Dictionary
var_desc<-c("Sentiment Score 0 for Negative sentences and 4 for Positive sentences","The tweets collected")
var_type<-c(0,1)
#Creating the Linker Data Frame
linker <- build_linker(tweets_train, variable_description = var_desc, variable_type = var_type)
linker
#Build the data dictionary
dict<-build_dict(my.data = tweets_train,linker=linker,option_description = NULL, prompt_varopts = F)
kable(dict,format="html",caption="Data dictionary for the Training dataset")
My problem is in the data dictionary I have provided the Variable Name and the Variable Description but I think in the Variable Options column it is trying to print the entire 100,000 tweets which I want to avoid. Is it possible for me to set that column up too manually. Would the option_description in the build_dict function be of any help to do it?
I tried getting some idea about it from online but to no use. Here is the link that I have followed till now:
https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html
This is the first time I am trying to build a data dictionary and hence the struggle. Any suggestions would be extremely appreciated. Thanks in advance.

Retain SPSS value labels when working with data

I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")

Transform a matrix txt file in spectra data for ChemoSpec package

I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)

Resources