R design.matrix issue -- dropped column in design matrix? - r

I'm having an odd problem while trying to set up a design matrix to do downstream pairwise differential expression analysis on RNAseq data.
For the design matrix, I have both the donor information and each condition:
group<-factor(y$samples$group) #44 samples, 6 different conditions
sample<-factor(y$samples$samples) #44 samples, 11 different donors.
design<- model.matrix(~0+sample+group)
head(design)
Donor11.CD8 Donor12.CD8 Donor14.CD8 Donor15.CD8 Donor16.CD8
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Donor17.CD8 Donor18.CD8 Donor19.CD8 Donor20.CD8 Donor3.CD8
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Donor4.CD8 Treatment2 Treatment3 Treatment4 Treatment5
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
Treatment6
1 1
2 0
3 0
4 0
5 0
6 0
>
The issue is that I seem to be losing a condition (treatment 1) when I form the design matrix, and I'm not sure why.
Many thanks, in advance, for your help!

That's not a problem. Treatment 1 is indicated by all 0 for the columns in the design matrix. Look at row 4 - zero for Treatments 2 through 6. That means it is Treatment 1. This is called a "treatment contrast" because the coefficients in the model contrast the named treatment against the "base" level, in this case the base level is Treatment1.

Related

R - merge/combine columns with same name but some data values equal zero

First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
And the second set looks like this:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
I got this code from this topic but I'm new to R and I still need some help with it:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?
There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1

sum or group specific columns based on clusters in r

So I have a data set of species and abundances, here is a sample of it:
aca.qua aca.bah aca.chi achi.lin alb.vul alu.mon ani.vir arc.rho asp.lun aux.roc bag.bag bag.mar bal.cap cal.cal cal.pen
1 0 0 0 0 5 0 57 0 0 0 0 0 0 0 16
2 0 0 1 0 2 0 3 0 0 0 0 8 0 0 0
3 0 0 0 0 1 0 3 0 0 0 0 0 0 0 3
4 0 0 0 0 5 0 0 0 22 0 0 94 0 0 0
5 0 0 0 0 1 0 0 0 0 2 3 2 0 0 1
6 0 0 0 0 0 0 0 1 0 0 2 2 0 0 0
A made a cluster analysis with some of the species traits and came up with some clusters were each species should be included:
aca.qua aca.bah aca.chi achi.lin alb.vul alu.mon ani.vir arc.rho asp.lun aux.roc bag.bag bag.mar bal.cap cal.cal cal.pen
1 1 1 2 3 1 4 4 1 5 4 4 1 1 1
"aca.qua" should be in cluster 1, as well as "aca.bah", "aca.chi" and "alu.mon", etc. "achi.lin" in cluster two and so on.
I was trying to come up with a code that uses the references in the second data frame to group the columns by cluster and sum them. I was trying to do so with dplyr, mutate and some loops, but I never managed to get to a good way of doing that. I tried adding the clusters as a row thant using t() to transpose and select(), then transpose back, etc, it was getting way too complicated.
Is there any way that I can use the the vector containing the names of the species and it's clusters as reference to sum the respective columns of each cluster?
The idea is to end up with something like this, but for all the clusters:
V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 cluster1
1 1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 0 22
6 0 1 0 0 0 0 0 0 0 0 0
Here I used the following code:
teste4 <- teste3 %>%
filter(V1 == 1) %>%
select(-1)
teste5 <- teste4 %>%
mutate(cluster1 = rowSums(teste4[, 1:rowSums(teste4)]))
The point here is that I will also try several different cluster methods and models, therefore, I need to make it somehow more automatic when I come up with new cluster combinations instead of manualy selecting each columns (the original dataset is much larger.
Try to add the rows that match each cluster with rowSums. We can wrap it in an lapply call to cycle through each unique cluster:
lst <- lapply(1:max(df2[1,]), function(x) rowSums(df1[,df2[1,] == x, drop=F]))
setNames(data.frame(lst),paste0("clust",1:length(lst)))
# clust1 clust2 clust3 clust4 clust5
# 1 16 0 5 57 0
# 2 1 0 2 11 0
# 3 3 0 1 3 0
# 4 22 0 5 94 0
# 5 1 0 1 5 2
# 6 0 0 0 5 0

finding strcutural holes constraint , efficiency,ego density and effective size in r

I am working on the adjacency matrix to find the results of the egonet package function. But when I run the command index.egonet, it gives me an error.
My adjacency matrix "p2":
p2
1 2 3 4 5 7 8 9 6
1 0 1 1 1 1 0 0 0 0
2 1 0 0 0 1 1 1 1 0
3 1 0 0 0 0 1 0 1 1
4 1 0 0 0 0 0 0 0 0
5 1 1 0 0 0 0 0 0 0
7 0 1 1 0 0 0 0 0 0
8 0 1 0 0 0 0 0 0 0
9 0 1 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
I apply this command on the adjacency for the desired results but it gives me an error
index.egonet(p2)
Error in dati[ego.name, y] : subscript out of bounds
So any alternative or solution to current code error will highly be appreciated.
The ego name must be "EGO" in capital letters, as far as I could understand from working with that function.
colnames(p2) <- rownames(p2) <- c("EGO", 2:ncol(p2))
index.egonet(p2)
this should work...

Vertex names by creating a network object via an edgelist (R package: network)

I want to create a network object, representing a directed network on basis of an edgelist. The first column contains some unique ID of project leaders, the second project partners, let's say:
library("network")
x <- cbind(rbind(1,1,2,2,3), rbind(3,7,10,9,6))
y.nw <- network(x, matrix="edgelist", directed=TRUE, loops=FALSE)
Now my problem is: I need all vertexes to have the right ID, since after creating the network object I have to transfer it back to a adjacency matrix with the right corresponding firm IDs. However, I am not sure in which order I should assign them, since I sorted the dataframe by column 1 (project leaders), which, however, not always show up as project partners as well.
If your ids are sequential integers as in your example, you can produce the adjacency matrix corresponding to the edgelist in your example with:
>as.sociomatrix(y.nw))
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
But maybe you have a different type of id system in your real input?

How can I calculate an empirical CDF in R?

I'm reading a sparse table from a file which looks like:
1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1
Note row lengths are different.
Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.
I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.
To do this I can first sum up each column, but I need to take zeros for the undef columns.
How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).
This will read the data in:
dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)
Resulting in:
> head(df)
Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1 1 0 7 0 0 1 0 0 0 5 0 0
2 1 0 0 1 0 0 0 3 0 0 0 0
3 0 0 0 1 0 0 0 2 0 0 0 0
4 1 0 0 1 0 3 0 0 0 0 1 0
5 0 0 0 1 0 0 0 2 0 0 0 0
....
If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.
We get the column sums using
df.csum <- colSums(df, na.rm = TRUE)
the ecdf() function generates the ECDF you wanted,
df.ecdf <- ecdf(df.csum)
and we can plot it using the plot() method:
plot(df.ecdf, verticals = TRUE)
You can use the ecdf() (in base R) or Ecdf() (from the Hmisc package) functions.

Resources