Creating a pairwise product matrix in R - r

I need to take the following data frame and create a 3x3 matrix with all pairwise products of the prop variable. Here is the data I am starting with...
> example
Parasite prop
1 Hel_1.1 0.06818182
2 Hel_11 0.18181818
3 Hel_13 0.02272727
> dput(example)
structure(list(Parasite = structure(1:3, .Label = c("Hel_1.1",
"Hel_11", "Hel_13", "Hel_14", "Hel_2", "Hel_3", "Hel_4", "Hel_4.5",
"Hel_5", "Hel_6", "Hel_7", "Hel_9", "Pro_1", "Pro_2", "Hel_1.4"
), class = "factor"), prop = c(0.0681818181818182, 0.181818181818182,
0.0227272727272727)), .Names = c("Parasite", "prop"), row.names = c(NA,
3L), class = "data.frame")
I would like to obtain a matrix that looks like this (The pairwise product values are a little off because I computed them by hand without rounding uniformly)
Hel_1.1 Hel_11 Hel_13
Hel_1.1 .0046 .0122 .0015
Hel_11 .0122 .0324 .0039
Hel_13 .0015 .0039 .0004
I would appreciate any help.

You can try this:
prop <- example$prop
names(prop) <- example$Parasite
prop %o% prop
# Hel_1.1 Hel_11 Hel_13
#Hel_1.1 0.004648760 0.012396694 0.0015495868
#Hel_11 0.012396694 0.033057851 0.0041322314
#Hel_13 0.001549587 0.004132231 0.0005165289

Related

Are there any functions to subtract all features(rows) to a particular value(row) in the same data file?

I am new to programming in R and Python, however have some basics. I have a technical question about computation. I would like to know if there are any functions for performing subtraction of all features(rows) to a particular value (row) from the same data list. I would like to obtain the output_value1 as shown in the link below and post this, multiply by (-1) to obtain the output_value2.
data file link: https://www.dropbox.com/s/m5rsi6ru419f5bf/Template_matrixfile.xlsx?dl=0
Please let me know if you need more details.
I have tried performing the same operation in the MS Excel, this is very tedious and time consuming.
I have many large datasets with several hundred rows and columns which becomes more complex to manually perform the same in MS Excel. Hence, I would prefer to write a code and obtain the desired outputs.
Here is the example data:Inputs are feature and value columns and outputs are Output_value1, and Output_value2 columns.
|Feature| |Value| |Output_value1| |Output_value2|
|Gene_1| |14.25633934| |0.80100922| |-0.80100922|
|Gene_2| |16.88394578| |3.42861566| |-3.42861566|
|Gene_3| |16.01| |2.55466988| |-2.55466988|
|Gene_4| |13.82329514| |0.36796502| |-0.36796502|
|Gene_5| |12.96382949| |-0.49150063| |0.49150063|
|Normalizer| |13.45533012| |0| |0|
dput(head(Exampledata))
structure(list(Feature = structure(1:6, .Label = c("Gene_1", "Gene_2",
"Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"), Value =
c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012), Output_value1 = c(0.80100922, 3.42861566, 2.55466988,
0.36796502, -0.49150063, 0), Output_value2 = c(-0.80100922,
-3.42861566, -2.55466988, -0.36796502, 0.49150063, 0)), row.names = c(NA, 6L), class = "data.frame")
Assuming you'll only have one row where Feature == "Normalizer", in R you get the Value of that row and subtract it from rest of the rows.
Exampledata$Output_value1 <- Exampledata$Value -
Exampledata$Value[Exampledata$Feature == "Normalizer"]
Exampledata$Output_value2 <- Exampledata$Output_value1 * -1
Exampledata
# Feature Value Output_value1 Output_value2
#1 Gene_1 14.25634 0.8010092 -0.8010092
#2 Gene_2 16.88395 3.4286157 -3.4286157
#3 Gene_3 16.01000 2.5546699 -2.5546699
#4 Gene_4 13.82330 0.3679650 -0.3679650
#5 Gene_5 12.96383 -0.4915006 0.4915006
#6 Normalizer 13.45533 0.0000000 0.0000000
EDIT
For multiple such columns, we can do
cols <- grep("^Value", names(data))
inds <- which(data$Feature == "Normalizer")
data[paste0("Output", seq_along(cols))] <- data[cols] - data[rep(inds, nrow(data)),cols]
data[paste0("Output_inverted", seq_along(cols))] <- data[grep("Output", names(data))] * -1
data
Exampledata <- structure(list(Feature = structure(1:6, .Label = c("Gene_1",
"Gene_2", "Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"),
Value = c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012)), row.names = c(NA, 6L), class = "data.frame")

R Data Frame Filter Not Working

I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).
For example, the data is output as a .csv, so I read in the whole directory, as follows.
readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)
#row.names = 1 sets rownames as gene names
This reads in 3 .csv files, A, B and C. The data look like this
A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
logFC logCPM LR PValue FDR
YER037W -1.943616 6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273 5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C 1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C 2.096877 6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W 1.649138 8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C 1.992810 9.302504 22.91553 0.000001692786468 0.00132736130
I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.
1.logFC > 1 or < -1
2.FDR < 0.05
So I loop through the dataframes like so
genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)
}
This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.
e.g.
df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)
whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example
A.logFC A.FDR B.logFC B.FDR C.logFC C.FDR logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C 0.8321432 0.1462919 0.7401477 0.2191413 -0.9616989 0.04098177 TRUE
The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.
Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.
I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.
Thanks.
Using merge as suggested by akash87 solved the problem.
Turns out cbind was causing the rownames to not be assigned correctly.
I'm not exactly sure what your desired output is here, but it might be possible to simplify a bit and use the dplyr library to filter all your outputs at once, assuming the format of your data is consistent. Using some modified versions of your data as an example:
A <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-1.943616, -1.771273, 0, 2.096877, 1.649138, 1.99281
), logCPM = c(6.294092, 5.840774, 10.107793, 6.534761, 8.940714,
9.302504), LR = c(34.30835, 31.97088, 30.55576, 28.08635,
23.32097, 22.91553), PValue = c(4.703583e-09, 1.5650144e-08,
3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 1.692786468e-06
), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 0.00014035695,
0.00132682314, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
B <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-0.4, -0.3, 0, 2.096877, 1.649138, 1.99281), logCPM = c(6.294092,
5.840774, 10.107793, 6.534761, 8.940714, 9.302504), LR = c(34.30835,
31.97088, 30.55576, 28.08635, 23.32097, 22.91553), PValue = c(4.703583e-09,
1.5650144e-08, 3.2440747e-08, 1.16021451e-07, 1.370968319e-06,
1.692786468e-06), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05,
0.00014035695, 0.1, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
Use rbind to create a single dataframe to work with:
AB<- rbind(A,B)
Then filter this whole thing based on your criteria. Note that duplicates can occur, so you can use distinct to only return unique genes that qualify:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene)
gene
1 YER037W
2 YJL184W
3 YDR342C
4 YGL062W
Or, to keep all the rows for those genes as well:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene, .keep_all = TRUE)
gene logFC logCPM LR PValue FDR
1 YER037W -1.943616 6.294092 34.30835 4.703583e-09 2.276064e-05
2 YJL184W -1.771273 5.840774 31.97088 1.565014e-08 3.786552e-05
3 YDR342C 2.096877 6.534761 28.08635 1.160215e-07 1.403570e-04
4 YGL062W 1.649138 8.940714 23.32097 1.370968e-06 1.326823e-03

What is the functional form of the assignment operator, [<-?

Is there a functional form of the assignment operator? I would like to be able to call assignment with lapply, and if that's a bad idea I'm curious anyways.
Edit:
This is a toy example, and obviously there are better ways to go about doing this:
Let's say I have a list of data.frames, dat, each corresponding to a one run of an experiment. I would like to be able to add a new column, "subject", and give it a sham-name. The way I was thinking of it was something like
lapply(1:3, function(x) assign(data.frame = dat[[x]], column="subject", value=x)
The output could either be a list of modified data frames, or the modification could be purely a side effect.
dput of list starting list
list(structure(list(V1 = c(-1.16664504687199, -0.429499924318301, 2.15470735901367, -0.287839633854442, -0.850578353982526, 0.211636723222015, -0.184714165752958, -0.773553182015158, 0.801811848828454, 1.39420292299319 ), V2 = c(-0.00828185523886259, -0.0215669898046275, 0.743065397283645, -0.0268464140141802, 0.168027242784788, -0.602901928341917, 0.0740511186398372, 0.180307494696194, 0.131160421341309, -0.924995634374182)), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = "data.frame"), structure(list( V1 = c(1.81912921386885, 1.17011641727415, 0.692247839769473, 0.0323050362633069, 1.35816977313292, -0.437475434344363, -0.270255715332778, 0.96140963297774, 0.914691132220417, -1.8014509598977), V2 = c(1.45082316226241, 2.05135744606495, -0.787250759618171, 0.288104852581324, -0.376868533959846, 0.531872044490353, -0.750375220117567, -0.459592764008714, 0.991667163481123, 1.31280356980115)), .Names = c("V1", "V2" ), row.names = c(NA, -10L), class = "data.frame"), structure(list( V1 = c(0.528912899341174, 0.464615157920766, -0.184211714281637, 0.526909095449027, -0.371529800682086, -0.483772861751781, -2.02134822661341, -1.30841566046747, -0.738493559993166, -0.221463545903242), V2 = c(-1.44732101816006, -0.161730785376045, 1.06294520132753, 1.22680614207705, -0.721565979363022, -0.438309438404104, -0.0243401435910825, 0.624227513999603, 0.276605218579759, -0.965640602482051)), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = "data.frame"))
Maybe I don't get it but as stated in "The Art of R programming":
Any assignment statement in which the left side is not just an
identifier (meaning a variable name) is considered a replacement
function.
and so in fact you can always translate this:
names(x) <- c("a","b","ab")
to this:
x <- "names<-"(x,value=c("a","b","ab"))
the general rule is just "function_name<-"(<object>, value = c(...))
Edit to the comment:
It works with the " too:
> x <- c(1:3)
> x
[1] 1 2 3
> names(x) <- c("a","b","ab")
> x
a b ab
1 2 3
> x
a b ab
1 2 3
> x <- c(1:3)
> x
[1] 1 2 3
> x <- "names<-"(x,value=c("a","b","ab"))
> x
a b ab
1 2 3
There is the assign function. I don't see any problems with using it but you have to be aware of what environment you want to assign to. See the help ?assign for syntax.
Read this chapter carefully to understand the ins and outs of environments in detail. http://adv-r.had.co.nz/Environments.html

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Resources