I am running a DESeq analysis through importing counts from FeatureCounts. I generate the counts matrix but when I get to running the DESeqMatrixFromDataSet, I get the following error. I have checked my counts several times and I don't see any negative results.
I will appreciate any help with this.
library(purrr)
library(tidyverse)
f_files<- list.files("C:/Users/cash/Desktop/DESeq analysis 2", pattern = "\\.txt$", full.names = T)
read_in_feature_counts<- function(file){
cnt<- read_tsv(file, col_names =T, comment = "#")
cnt<- cnt %>% dplyr::select(-Chr, -Start, -End, -Strand, -Length)
return(cnt)
}
raw_counts<- map(f_files, read_in_feature_counts)
raw_counts_df<- purrr::reduce(raw_counts, inner_join)
head(raw_counts_df)
# Assign condition (first four are controls, second four contain the expansion)
condition <- factor(c("Donor1-1_S1_R2_001", "Donor1-2_S2_R2_001", "Donor2-1_S10_R2_001", "Donor2-2_S11_R2_001","Donor3-1_S1_R2_001", "Donor3-2_S2_R2_001", "Donor4-1_S10_R2_001", "Donor4-2_S11_R2_001"))
library(DESeq2)
# Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
coldata <- data.frame(row.names=colnames(raw_counts_df))
dds <- DESeqDataSetFromMatrix(countData=raw_counts_df, colData=coldata, design=~condition)
**Error in DESeqDataSet(se, design = design, ignoreRank) :
some values in assay are negative**
Here is the new error:
> dds <- DESeqDataSetFromMatrix(countData=raw_counts_df, colData=coldata, design=~condition)
converting counts to integer mode
Error in DESeqDataSet(se, design = design, ignoreRank) :
all variables in design formula must be columns in colData
Related
I'm using R to create a descriptive actor graph based on Twitter data acquired with vosonSML. I'm trying to use the "edge_cleanup" command (using code that's been provided to me) to clean up self-loops (replies to self). When I do, I receive the following error message:
Error in data.frame(name = as.character(V(actorGraph)$name), label = as.character(V(actorGraph)$label)) :
arguments imply differing number of rows: 913, 0
Can anyone tell me why I'm getting this error message, and how to resolve it? The full excerpted code is below
actorGraph <- twitterData %>%
Create("actor") %>%
Graph
## clean up the graph data removing self-loop
edge_cleanup <- function(graph = actorGraph){
library(igraph)
df <- get.data.frame(actorGraph)
names_list <- data.frame('name' = as.character(V(actorGraph)$name),
'label' = as.character(V(actorGraph)$label))
df$from <- sapply(df$from, function(x) names_list$label[match(x,names_list$name)] %>% as.character())
df$to <- sapply(df$to, function(x) names_list$label[match(x,names_list$name)] %>% as.character())
nodes <- data.frame(sort(unique(c(df$from,df$to))))
links <- df[,c('from','to')]
net <- graph.data.frame(links, nodes, directed=T)
E(net)$weight <- 1
net <- igraph::simplify(net,edge.attr.comb="sum")
return(net)
}
It's after this last command (edge_cleanup <- function(graph = actorGraph) that I receive the above error message.
I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!
I am trying to create a loop to go through and perform a correlation (and in future a partial correlation) using ppcor function on variables stored within a data frame. The first variable (A) will remain the same for all correlations, whilst the second variable (B) will be the next variable along in the next column within my data frame. I have around 1000 variables.
I show the mtcars dataset below as an example, as it is in the same layout as my data.
I've been able to complete the operation successfully when performed manually using cbind to bind 2 columns (the 2 variables of interest) prior to running ppcor on the array ("tmp_df"). I have then been able to bind the output from correlation operation ("mpg_cycl"), ("mpg_disp") into a single object. However I can't get any of this operation to work in a loop. Any ideas please?
library("MASS")
install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
attempting to loop above operation ## (ammended after last reviewer's comments:
for (i in mtcars_df[2:7]){
tmp_df = (cbind(i, mtcars_df$mpg)
i <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i, file = paste0("MyDataOutput",i[1],".csv")
}
I expected the loop to output two of the correlations results to MyDataOutput csv file. But this generates an error message, I thought i was in the correct place?:
Error: unexpected symbol in:
" tmp_df = (cbind(i, mtcars_df$mpg)
i"
Even adding a curly bracket at the end does not resolve issue so I have left this out as it introduces another error message '}'
I have redone some of your code and fixed missing ), }, ". The for cyckle now outputs file with name + name of the variable. Hope this will help.
library("MASS")
#install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
"mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i_resutl, file = paste0("MyDataOutput_",i,".csv"))
}
for merging before saving:
dta <- c()
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
dta <- rbind(dta,c(i,(unlist( i_resutl))))
}
I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:
library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)
files <- list.files(inputdir,pattern="*.txt")
GetNrcSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # positive - negative
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
return(sentiment)
}
The .txt files are stored in inputdirand have names AB-City.0000, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).
The function works for a single file as expected, i.e. GetNrcSentiment(files[1]) gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.
nrc_sentiments <- data_frame()
for(i in files){
nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}
I get the following error message:
Joining, by = "word"
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).
As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):
GetBingSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>%
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
# return our sentiment dataframe
return(sentiment)
}
Error in mutate_impl(.data, dots) :
Evaluation error: object 'negative' not found.
EDIT: Following the recommendation by David Klotz I edited the code to
for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) }
As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:
Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.
The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?
dplyr's bind_rows function is more flexible than rbind, at least when it comes to missing columns:
nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))
The input might be missing the "negative" column that is used in the expression
I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].
I have a vector groups containing names of some of its columns (3 in example below).
I generate strings based on combinations of elements in the columns as follows:
paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")
I would like to generalize this so that I don't need to know how many elements are in groups.
The following attempt fails:
> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) :
attempt to extract more than one element
Here is how I would do in functional-style with a python dictionary:
map("-".join, zip(*map(rld.get, groups)))
Is there a similar column-getter operator in R ?
As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)
This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.
DESeq2 can be installed from bioconductor as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
Reproducible example
One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:
Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) :
second argument must be a list
After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:
library(DESeq2)
library(test.package)
lib_names <- c(
"WT_1",
"mut_1",
"WT_2",
"mut_2",
"WT_3",
"mut_3"
)
file_names <- paste(
lib_names,
"txt",
sep="."
)
wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))
sample_table = data.frame(
lib = lib_names,
file_name = file_names,
genotype = genotypes,
replicate = replicates
)
dds_raw <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_table,
directory = ".",
design = ~ genotype
)
# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)
test_do_paste <- function(dds) {
require(DESeq2)
groups <- head(colnames(colData(dds)), -2)
rld <- rlog(dds, blind=F)
stopifnot(all(groups %in% names(colData(rld))))
combined_names <- do.call(
function (...) paste(..., sep = "-"),
colData(rld)[groups]
)
print(combined_names)
}
test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)
The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Data used in the example:
WT_1.txt
WT_2.txt
WT_3.txt
mut_1.txt
mut_2.txt
mut_3.txt
I posted this issue as a separate question: do.call error "second argument must be a list" with S4Vectors when the code is in a library
Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.
We may use either of the following:
do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))
We can consider a small, reproducible example:
rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0" "21-160-3.9-2.875-0" "22.8-108-3.85-2.32-1"
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"
Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.
(I am copying your comment as a follow-up)
It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:
> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"
do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):
> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mcols’ for signature ‘"character"’