RNAseq - Plotting log2foldchange-basemean but has weird data points - r

I am new to processing RNA seq data and am now practicing to reproduce a published figure related to RNA seq. This os the paper and Fig2A is what I'm trying to achieve.
In brief, I downloaded the code with recount3 and subset the sample for groups that I want (control vs condition 1, control vs condition 2, etc). Then I performed the following code:
dds_4uM_30min <- DESeqDataSetFromMatrix(countData = ha_4uM_30min_data,
colData = ha_4uM_30min_meta,
design = ~ type)
dds2_4uM_30min <- DESeq(dds_4uM_30min)
res_4uM_30min <- results(dds2_4uM_30min, tidy=F)
(type is the column that I made to contain the information of whether it's control or condition 1)
This is the figure I get, which confuses me since it is nowhere near the original figure.
I thought that they might do additional processing of the data, but have no idea what are the common or reasonable ways to do.
Furthermore, there seems to be datapoints that form lines (as can seen in the above figure), which is not seen by in the original figure. I am wondering what causes this kind of distribution and how to adjust for getting rid of it.
Thanks in advance for any opinion or suggestion.
I have been trying to use the function lfcShrink but the figure still has this weird line.
Any suggestions on how to further process RNA seq data?

Related

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

Rstudio - how to write smaller code

I'm brand new to programming and an picking up Rstudio as a stats tool.
I have a dataset which includes multiple questionnaires divided by weeks, and I'm trying to organize the data into meaningful chunks.
Right now this is what my code looks like:
w1a=table(qwest1,talm1)
w2a=table(qwest2,talm2)
w3a=table(quest3,talm3)
Where quest and talm are the names of the variable and the number denotes the week.
Is there a way to compress all those lines into one line of code so that I could make w1a,w2a,w3a... each their own object with the corresponding questionnaire added in?
Thank you for your help, I'm very new to coding and I don't know the etiquette or all the vocabulary.
This might do what you wanted (but not what you asked for):
tbl_list <- mapply(table, list(qwest1, qwest2, quest3),
list(talm1, talm2, talm3) )
names(tbl_list) <- c('w1a', 'w2a','w3a')
You are committing a fairly typical new-R-user error in creating multiple similarly named and structured objects but not putting them in a list. This is my effort at pushing you in that direction. Could also have been done via:
qwest_lst <- list(qwest1, qwest2, quest3)
talm_lst <- list(talm1, talm2, talm3)
tbl_lst <- mapply(table, qwest_lst, talm_lst)
names(tbl_list) <- paste0('w', 1:3, 'a')
There are other ways to programmatically access objects with character vectors using get or wget.

why is this R code for a t-sne analysis running so slowly

I am trying to perform a t-sne analysis on a file with 39772 columns and 170 rows.
I first used the "Rtsne" package, but that package seems to have a limit of 10,000 columns as R keeps aborting every time I run the code with the entire file.
Because of this, I changed the package to "tsne" instead of "Rtsne" but now the code is taking FOREVER to run (like over 2 hours). This is what I have so far...I've read other posts but nothing seems to apply to my problem. I'd appreciate any ideas on what I can do to fix this and actually see an output.
CODE USING "TSNE" PACKAGE (taking 2+ hours to run...still haven't seen an output):
exp =read.csv("tsnedata.csv")
library(tsne)
exp1=t(exp)
exp2=matrix(as.numeric(unlist(exp1)),nrow=nrow(exp1))
exp3=data.matrix(exp2)
cols=rainbow(10)
ecb=function(x,y){plot(x, t='n'); text(x, col=cols);}
tsne_res=tsne(exp3, epoch_callback=ecb, perplexity=50, epoch=50)
ORIGINAL CODE USING "RTSNE" PACKAGE (this is the code that immediately causes R to abort unless I run the code using only the first 10,000 columns of the data):
exp<- read.csv("tsnedata.csv")
library(Rtsne)
exp1=t(exp)
exp2=matrix(as.numeric(unlist(exp1)),nrow=nrow(exp1))
exp3 <- data.matrix(exp2)
tsne <- Rtsne(as.matrix(exp3), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2)
cols <- rainbow(10)
plot(tsne$Y, t='n')
text(tsne$Y, col=cols)
If you are dealing with scRNAseq data and want to visualize each cell as each dot on tsne visualization, here are my thoughts:
1. Make sure your input is cell by gene expression matrix.
2. Do dimension reduction first(e.g PCA), only feed in the first few principal components into Rtsne.
Rtsne is based on Barnes-Hut implementation, it is much faster than original implementation of tsne, and also a better way to do tsne analysis as well(as it corrected some bugs from the original tsne package). However, from my experience, tsne outputs cuter (round shape, ball-like) visualization than Rtsne.

Creating vectors between specific values in a dataset with R

I have a quite special case with a dataset and what I want to do with it.To make it comprehensive I have to give a brief description of the background:
I have a sensor producing data, which needs maintenance every-now-and-then. Between every maintenance the data produced has a decreasing trend which I want to get rid of, and since maintenance is carried out quite often, I want to automate this procedure.
The sensor is turned off when carrying out maintenance but the telemetry system still produces readings marked with " * ". Therefore the subsets of data to be detrended can be easily spotted between batches of "*" readings.
I have been (unsuccessfully) trying to create a vector (on which I can then carry out a detrending procedure) with this data by selecting the desired values by looping through the data using conditional statements. To begin selecting the values I used to following statement:
if((tryp[i-2,2]="*")&(tryp[i-1,2]="*")&(tryp[i,2]!="*"))
and to finish the selection (exit the loop):
if((tryp[i-2,2]!="*")&(tryp[i-1,2]!="*")&(tryp[i,2]="*"))
However, this last statement gives an error of "argument is of length zero" and the first statement doesn't seem to be working properly either.
This is how the data looks like
So for example, one subset of data that I would like to select for de-trending is between data points 9686 and 9690. Obviously this is very small subset, but it shows well what I am trying to communicate.
I would really appreciate if someone could let me know about an elegant way of doing this, including anything way different from what I was trying to do originally.
Many thanks!
library(dplyr)
my_df <- data.frame(a = LETTERS[1:10], b = c('+','*','*', '+', '*', '*', '+', '+', '*', '*'))
my_df %>% filter(b != '*')
Suppose the '+'-signs are your data points, you can easily get rid of the '*'-signs with filtering the rows which does not contain it.
And of course a solution without the dplyr-package:
my_df[which(my_df$b!='*'),]

Plot going off graph in gvisMotionChart

I have created a plot in R using googleVis, specifically gvisMotionChart, plotting a number of variables.
I am primarily using the line graph and it is all good when I view the graph with all variables, however when I select some of the individual variables it zooms in sunch that some of the plot for this variable is no longer on the graph. I know it should zoom in just to view this variable and can exclude other variables (which is a good feature) but it zooms in too much so that the variable I am after is not entirely on the graph.
This doesn't happen with all variables, and I can get around it by also selecting other variables either side of the one which I want to view, but it would be good if I could fix this. Has anyone come across a similar problem before and know a way around it?
Thanks in advance
EDIT: I have an example of this using the data Batting from the Lahman package. (I know nothing about basaeball so the analysis probably doesn't make sense, in fact looking at the results it almost certainly doesn't but it displays my point). If you run the following code:
library(Lahman)
recent <- subset(Batting, yearID > 2000)
homeruns <- aggregate(HR ~ stint + yearID, data = recent, FUN = sum)
avgHR <- mean(homeruns$HR)
homeruns$HR <- homeruns$HR - avgHR
m <- gvisMotionChart(data = homeruns, idvar = "stint", timevar = "yearID")
plot(m)
Then select the line graph, then subset on number 2, the top part of the graph is cut off
It seems to be Google's bug. I could even reproduce this same error in their "Visualization Playground" (https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart) making part of the data negative.
I've already reported the issue as a bug: https://code.google.com/p/google-visualization-api-issues/issues/detail?id=1479
Might the force be with them!
I just had the same problem w/ a Sankey plot. I resolved it by deleting entries with value==0. However, I just tried to reproduce your example and could not reproduce your bug, so perhaps this has already been solved?

Resources