Apply function on a list of tables by column - r

I have a list of table objects as such : list(X1A.1145442 = structure(c(0.3204, 0.6796, 0.3645, 0.6355, 0.1615, 0.8385, 0.3266, 0.6734, 0.2884, 0.7116, 0.3042, 0.6958), .Dim = c(2L, 6L), class = "table", .Dimnames = list(x = c("1", "2"),c("ES1-5", "ES14-26", "ES27-38", "ES6-13", "SA1-13", "SA14-25"))), X1A.1158042 = structure(c(0.4437, 0.5563, 0.4264, 0.5736, 0.2308, 0.7692, 0.3896, 0.6104, 0.2997, 0.7003, 0.3148, 0.6852), .Dim = c(2L, 6L), class = "table", .Dimnames = list(x = c("1", "2"), c("ES1-5", "ES14-26", "ES27-38", "ES6-13", "SA1-13", "SA14-25"))))
The list looks this way :
$`X1A.1145442`
x ES1-5 ES14-26 ES27-38 ES6-13 SA1-13 SA14-25
1 0.3204 0.3645 0.1615 0.3266 0.2884 0.3042
2 0.6796 0.6355 0.8385 0.6734 0.7116 0.6958
$X1A.1158042
x ES1-5 ES14-26 ES27-38 ES6-13 SA1-13 SA14-25
1 0.4437 0.4264 0.2308 0.3896 0.2997 0.3148
2 0.5563 0.5736 0.7692 0.6104 0.7003 0.6852
I would like to obtain the minimum value for each element of the list of tables in a column wise fashion.
I tried something with lapply but without success. Could someone help me on that please.
Regards,
Alex

It is a list of matrices. So the unit will be each element. If we use lapply, then it will loop through each of the element unless it is a converted to a data.frame. Here, we can make use of apply with MARGIN specified as 2 (for looping through columns)
lapply(lst1, function(x) apply(x, 2, min))
Or another option is colMins from matrixStats
library(matrixStats)
lapply(lst1, colMins)

Related

Are there any functions to subtract all features(rows) to a particular value(row) in the same data file?

I am new to programming in R and Python, however have some basics. I have a technical question about computation. I would like to know if there are any functions for performing subtraction of all features(rows) to a particular value (row) from the same data list. I would like to obtain the output_value1 as shown in the link below and post this, multiply by (-1) to obtain the output_value2.
data file link: https://www.dropbox.com/s/m5rsi6ru419f5bf/Template_matrixfile.xlsx?dl=0
Please let me know if you need more details.
I have tried performing the same operation in the MS Excel, this is very tedious and time consuming.
I have many large datasets with several hundred rows and columns which becomes more complex to manually perform the same in MS Excel. Hence, I would prefer to write a code and obtain the desired outputs.
Here is the example data:Inputs are feature and value columns and outputs are Output_value1, and Output_value2 columns.
|Feature| |Value| |Output_value1| |Output_value2|
|Gene_1| |14.25633934| |0.80100922| |-0.80100922|
|Gene_2| |16.88394578| |3.42861566| |-3.42861566|
|Gene_3| |16.01| |2.55466988| |-2.55466988|
|Gene_4| |13.82329514| |0.36796502| |-0.36796502|
|Gene_5| |12.96382949| |-0.49150063| |0.49150063|
|Normalizer| |13.45533012| |0| |0|
dput(head(Exampledata))
structure(list(Feature = structure(1:6, .Label = c("Gene_1", "Gene_2",
"Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"), Value =
c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012), Output_value1 = c(0.80100922, 3.42861566, 2.55466988,
0.36796502, -0.49150063, 0), Output_value2 = c(-0.80100922,
-3.42861566, -2.55466988, -0.36796502, 0.49150063, 0)), row.names = c(NA, 6L), class = "data.frame")
Assuming you'll only have one row where Feature == "Normalizer", in R you get the Value of that row and subtract it from rest of the rows.
Exampledata$Output_value1 <- Exampledata$Value -
Exampledata$Value[Exampledata$Feature == "Normalizer"]
Exampledata$Output_value2 <- Exampledata$Output_value1 * -1
Exampledata
# Feature Value Output_value1 Output_value2
#1 Gene_1 14.25634 0.8010092 -0.8010092
#2 Gene_2 16.88395 3.4286157 -3.4286157
#3 Gene_3 16.01000 2.5546699 -2.5546699
#4 Gene_4 13.82330 0.3679650 -0.3679650
#5 Gene_5 12.96383 -0.4915006 0.4915006
#6 Normalizer 13.45533 0.0000000 0.0000000
EDIT
For multiple such columns, we can do
cols <- grep("^Value", names(data))
inds <- which(data$Feature == "Normalizer")
data[paste0("Output", seq_along(cols))] <- data[cols] - data[rep(inds, nrow(data)),cols]
data[paste0("Output_inverted", seq_along(cols))] <- data[grep("Output", names(data))] * -1
data
Exampledata <- structure(list(Feature = structure(1:6, .Label = c("Gene_1",
"Gene_2", "Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"),
Value = c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012)), row.names = c(NA, 6L), class = "data.frame")

R Data Frame Filter Not Working

I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).
For example, the data is output as a .csv, so I read in the whole directory, as follows.
readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)
#row.names = 1 sets rownames as gene names
This reads in 3 .csv files, A, B and C. The data look like this
A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
logFC logCPM LR PValue FDR
YER037W -1.943616 6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273 5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C 1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C 2.096877 6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W 1.649138 8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C 1.992810 9.302504 22.91553 0.000001692786468 0.00132736130
I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.
1.logFC > 1 or < -1
2.FDR < 0.05
So I loop through the dataframes like so
genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)
}
This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.
e.g.
df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)
whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example
A.logFC A.FDR B.logFC B.FDR C.logFC C.FDR logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C 0.8321432 0.1462919 0.7401477 0.2191413 -0.9616989 0.04098177 TRUE
The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.
Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.
I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.
Thanks.
Using merge as suggested by akash87 solved the problem.
Turns out cbind was causing the rownames to not be assigned correctly.
I'm not exactly sure what your desired output is here, but it might be possible to simplify a bit and use the dplyr library to filter all your outputs at once, assuming the format of your data is consistent. Using some modified versions of your data as an example:
A <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-1.943616, -1.771273, 0, 2.096877, 1.649138, 1.99281
), logCPM = c(6.294092, 5.840774, 10.107793, 6.534761, 8.940714,
9.302504), LR = c(34.30835, 31.97088, 30.55576, 28.08635,
23.32097, 22.91553), PValue = c(4.703583e-09, 1.5650144e-08,
3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 1.692786468e-06
), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 0.00014035695,
0.00132682314, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
B <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-0.4, -0.3, 0, 2.096877, 1.649138, 1.99281), logCPM = c(6.294092,
5.840774, 10.107793, 6.534761, 8.940714, 9.302504), LR = c(34.30835,
31.97088, 30.55576, 28.08635, 23.32097, 22.91553), PValue = c(4.703583e-09,
1.5650144e-08, 3.2440747e-08, 1.16021451e-07, 1.370968319e-06,
1.692786468e-06), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05,
0.00014035695, 0.1, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
Use rbind to create a single dataframe to work with:
AB<- rbind(A,B)
Then filter this whole thing based on your criteria. Note that duplicates can occur, so you can use distinct to only return unique genes that qualify:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene)
gene
1 YER037W
2 YJL184W
3 YDR342C
4 YGL062W
Or, to keep all the rows for those genes as well:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene, .keep_all = TRUE)
gene logFC logCPM LR PValue FDR
1 YER037W -1.943616 6.294092 34.30835 4.703583e-09 2.276064e-05
2 YJL184W -1.771273 5.840774 31.97088 1.565014e-08 3.786552e-05
3 YDR342C 2.096877 6.534761 28.08635 1.160215e-07 1.403570e-04
4 YGL062W 1.649138 8.940714 23.32097 1.370968e-06 1.326823e-03

Creating a pairwise product matrix in R

I need to take the following data frame and create a 3x3 matrix with all pairwise products of the prop variable. Here is the data I am starting with...
> example
Parasite prop
1 Hel_1.1 0.06818182
2 Hel_11 0.18181818
3 Hel_13 0.02272727
> dput(example)
structure(list(Parasite = structure(1:3, .Label = c("Hel_1.1",
"Hel_11", "Hel_13", "Hel_14", "Hel_2", "Hel_3", "Hel_4", "Hel_4.5",
"Hel_5", "Hel_6", "Hel_7", "Hel_9", "Pro_1", "Pro_2", "Hel_1.4"
), class = "factor"), prop = c(0.0681818181818182, 0.181818181818182,
0.0227272727272727)), .Names = c("Parasite", "prop"), row.names = c(NA,
3L), class = "data.frame")
I would like to obtain a matrix that looks like this (The pairwise product values are a little off because I computed them by hand without rounding uniformly)
Hel_1.1 Hel_11 Hel_13
Hel_1.1 .0046 .0122 .0015
Hel_11 .0122 .0324 .0039
Hel_13 .0015 .0039 .0004
I would appreciate any help.
You can try this:
prop <- example$prop
names(prop) <- example$Parasite
prop %o% prop
# Hel_1.1 Hel_11 Hel_13
#Hel_1.1 0.004648760 0.012396694 0.0015495868
#Hel_11 0.012396694 0.033057851 0.0041322314
#Hel_13 0.001549587 0.004132231 0.0005165289

Comparing pairs of rows in a list of data frames

I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.
Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))
You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Resources