for loops and if conditional applied to data frames - r

If have two csv data frames data1 and data2 of dimension/size n1*n2 and m1*m2. I would like to create a new data frame consisting of differences: If (and only if)
data1[i,1] = data2[j,1] & data1[i,3] = data2[j,3]
then I want to consider
difference[i,z] <- abs(data1[i,x]-data2[i,y])
Is it possible to this in a simple manner, for instance using for/if?
difference <- matrix(nrow = max{n1,m1}, ncol = 3)
for (i in 1:n1) {
for (j in 1:m1) {
if(data1[i,1] == data2[j,1] & data1[i,3] == data2[j,3]){
difference[i,1] = data1[i,1]
difference[i,2] = data1[i,3]
difference[i,3] = data1[i,6]-data2[j,7]
}
}
This code is obviously far from being complete and I have several issues:
(1) I don't know if it is realizable using for loops/if conditional. If yes, being unfamiliar with R, I'm not sure if I need to put a 'print(something)' at the end of the loops.
(2) data1/2[i,1] is of type character. Hence I'm not sure if
data1[i,1] == data2[j,1] & data1[i,3] == data2[j,3]
is well-defined.
(3) The 'difference' matrix/frame should have as many rows as the number of i's and j's where
data1[i,1] = data2[j,1] & data1[i,3] = data2[j,3]
I do not know what this number is. Therefore I cannot really specify the size of 'difference'.
EDIT:
data1 = read.csv("path/to/data1.csv") ## Prices of 157 products each at
## 122 time points; (column1=Product, column3=date, column7=price)
data2 = read.csv("path/to/data2.csv") ## Prices of 118 products each at
## 122 time points; (column1=Product, column3=date, column6=price)
## the 122 time points are the same for both frames
## But: data1 contains some products data2 doesn't and vice versa
## I want to compare prices of the same products at the same time
So far, I've done it manually for product X1:
priceX1 = as.data.frame(data1[c(1,122),7])
priceX2 = as.data.frame(data2[c(5,126),6]) ## Product X2 starts at row 5
differenceX1 <- abs(priceX1 - priceX2)
The problem is I'd have to repeat this for all products contained in both data1 and data2.
RE-EDIT: dput(data1) returns
...), class = "factor"),
COMMENT = c(NA, ..., NA)), .Names = c("PRODUCT", "QUALIFIER_I",
"DATE", "QUALIFIER_II", "QUOTATION_DATE", "PROD_DATE", "PRICE",
"TYPE", "ID", "COMMENT"), row.names = c(NA, 14400L), class
= "data.frame")
"..." stands for me omitting a long list of products that couldn't fit here.
dput(data2) returns
..., NA, NA, NA)), .Names = c("PRODUCT", "QUALIFIER_II",
"DATE", "QUALIFIER_I", "Data2_source", "PRICE"), row.names = c(NA,
19161L), class = "data.frame")
"..." stand for me omitting a huge list of prices that couldn't fit in here.

You can find all pairs (i,j) which satisfy your condition by merging the two data.frames:
differences = merge(data1, data2, by=c('PRODUCT','DATE'))
This avoids for-loops entirely, and you can easily define the new column:
differences$Diff = abs(differences$PRICE.x - differences$PRICE.y)

Related

How to make new columns across multiple lists in R?

I am new to R and I don't quite know what structures to use and the correct syntax for them.
I have lists (that are more like tables with columns and column names). I would like to do the same functions to multiple lists. I assumed for loops would be reasonable to use.
My functions are
1) use a column to calculate a new column. (calculate fold change from log2foldchange)
2) make a new list using a subset of the old list and name it adjusting the name of the original list name
Here are the lines of code that worked for these tables individually.
#take values from the log2FoldChange column and calculate Fold Change
resCondition_anno$FoldChange <- 2^resCondition_anno$log2FoldChange
#subset my dataset based on the values for each row in the padj column
resCondition_anno_padj05 <- subset(resCondition_anno, resCondition$padj <= 0.05)
I would like to do these functions to multiple tables.
When I tried to do it in a for loop
resfiles1 <- c(resCondition_anno,resVirus_anno,resInter_anno)
for (i in resfiles1){
i$FoldChange <- 2^i$log2FoldChange # I was trying to calculate a new column based on log2FoldChange column
i_with_padj05 <- paste(i,"_padj05") # I was trying to create a new name like resCondition_anno_padj05
i_with_padj05 <- subset(i, i[[padj]] <= 0.05) # I was trying to subset my dataset based on values in the padj column
}
I tried to access the columns of my tables with $ and that gave me
Error: $ operator is invalid for atomic vectors
I tried to access the columns of my tables with [padj], I get
Error in subset.default(i, i[padj] <= 0.05) : object 'padj' not found
When I tried to access the columns of my table with `[[padj]], I got the following error
Error in subset.default(i, i[[padj]] <= 0.05) : object 'padj' not found
Am I going about this completely the wrong way? Is for loops reasonable way to approach my goals? I know apply functions exists but I had such a hard to getting output files out of them when I tried to input multiple files into it so I wanted to give for loops a try.
I would appreciate a code that would work for a random table and does these things and then I can figure out whether my tables are weird.
dput(head(resCondition_anno))
structure(list(ensembl = c("ENSMUSG00000051951", "ENSMUSG00000102331",
"ENSMUSG00000025902", "ENSMUSG00000104238", "ENSMUSG00000102269",
"ENSMUSG00000096126"), baseMean = c(2.34691358937965, 0.169507902147731,
49.4591642836684, 0.253911076708937, 3.27439052075304, 0.258178295608587
), log2FoldChange = c(1.04699290132002, 1.89907052894015, 0.629095304499277,
0.0597400040882164, -0.291997327218544, 1.97984690635658), lfcSE = c(1.09309963258445,
4.36961772602319, 0.291712394209747, 4.37647193807779, 1.21524080418346,
4.3263845102792), stat = c(0.95782019324678, 0.434607933236415,
2.15656008104662, 0.0136502655411644, -0.240279396654017, 0.457621577937096
), pvalue = c(0.338153434807336, 0.66384703564954, 0.0310399577136823,
0.989109002094381, 0.810113666298446, 0.647224338296786), padj = c(NA,
NA, 0.106540309680362, NA, 0.911344697137259, NA), mgi_symbol = c("Xkr4",
"Gm19938", "Sox17", "Gm37587", "Gm7357", "Gm22307"), gene_biotype = c("protein_coding",
"sense_intronic", "protein_coding", "processed_transcript", "processed_pseudogene",
"snRNA")), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x0000027bef7e1ef0>)`
Expected results for the aim 1
> dput(head(resCondition_anno))
structure(list(ensembl = c("ENSMUSG00000051951", "ENSMUSG00000102331",
"ENSMUSG00000025902", "ENSMUSG00000104238", "ENSMUSG00000102269",
"ENSMUSG00000096126"), baseMean = c(2.34691358937965, 0.169507902147731,
49.4591642836684, 0.253911076708937, 3.27439052075304, 0.258178295608587
), log2FoldChange = c(1.04699290132002, 1.89907052894015, 0.629095304499277,
0.0597400040882164, -0.291997327218544, 1.97984690635658), lfcSE = c(1.09309963258445,
4.36961772602319, 0.291712394209747, 4.37647193807779, 1.21524080418346,
4.3263845102792), stat = c(0.95782019324678, 0.434607933236415,
2.15656008104662, 0.0136502655411644, -0.240279396654017, 0.457621577937096
), pvalue = c(0.338153434807336, 0.66384703564954, 0.0310399577136823,
0.989109002094381, 0.810113666298446, 0.647224338296786), padj = c(NA,
NA, 0.106540309680362, NA, 0.911344697137259, NA), mgi_symbol = c("Xkr4",
"Gm19938", "Sox17", "Gm37587", "Gm7357", "Gm22307"), gene_biotype = c("protein_coding",
"sense_intronic", "protein_coding", "processed_transcript", "processed_pseudogene",
"snRNA"), FoldChange = c(2.0662186086592, 3.72972827627808, 1.54659483966075,
1.0422779093498, 0.816770504282921, 3.94451221821964)), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000027bef7e1ef0>)
Expected results for aim2
> dput(head(resCondition_anno_padj05))
structure(list(ensembl = c("ENSMUSG00000103922", "ENSMUSG00000025907",
"ENSMUSG00000061024", "ENSMUSG00000025911", "ENSMUSG00000025935",
"ENSMUSG00000025937"), baseMean = c(7.45083924607695, 1035.42915800337,
756.089939474399, 1510.50670239711, 2014.55644970672, 5206.99654662079
), log2FoldChange = c(3.31157886392159, -0.345358245876914, 0.340037961752993,
-0.637902858828505, 0.592795289538968, 0.59912370697665), lfcSE = c(0.984296895396084,
0.131191642000487, 0.0967702378760271, 0.120687031774959, 0.114283891072725,
0.161639505766009), stat = c(3.36441055479404, -2.63247140298489,
3.51386923517349, -5.28559572181691, 5.18704153292907, 3.70654255676794
), pvalue = c(0.000767073434065771, 0.00847661586751943, 0.000441630160084079,
1.25296333033368e-07, 2.13661093734535e-07, 0.000210107944374613
), padj = c(0.00522376704325313, 0.0385092726153939, 0.00325683272694307,
2.17721401368104e-06, 3.51690667040699e-06, 0.00168321660710376
), mgi_symbol = c("Gm6123", "Rb1cc1", "Rrs1", "Adhfe1", "Tram1",
"Lactb2"), gene_biotype = c("processed_pseudogene", "protein_coding",
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
), FoldChange = c(9.92852128160573, 0.787112498791522, 1.26578990036559,
0.642646438673565, 1.5081660610658, 1.51479619975327)), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000027bef7e1ef0>)
for aim 1
library(dplyr)
resCondition_anno_dumb <- resCondition_anno # produce a similar list
resCondition_anno_dumb$log2FoldChange <- resCondition_anno$log2FoldChange*3 # make some changes
list_t <- list(resCondition_anno, resCondition_anno_dumb) # here you enter your dataframes
# mutate adds a column to existing data sets, lapply makes it recursive
new_list <- lapply(list_t, function(x){x %>% mutate(FoldChange=2^log2FoldChange)})
for aim 2 something like
new_list <- lapply(list_t, function(x){x %>% filter(padj<=0.05)})
or you can pipe them together:
new_list <- lapply(list_t, function(x){x %>% mutate(FoldChange=2^log2FoldChange) %>% filter (padj <=0.05)})

Are there any functions to subtract all features(rows) to a particular value(row) in the same data file?

I am new to programming in R and Python, however have some basics. I have a technical question about computation. I would like to know if there are any functions for performing subtraction of all features(rows) to a particular value (row) from the same data list. I would like to obtain the output_value1 as shown in the link below and post this, multiply by (-1) to obtain the output_value2.
data file link: https://www.dropbox.com/s/m5rsi6ru419f5bf/Template_matrixfile.xlsx?dl=0
Please let me know if you need more details.
I have tried performing the same operation in the MS Excel, this is very tedious and time consuming.
I have many large datasets with several hundred rows and columns which becomes more complex to manually perform the same in MS Excel. Hence, I would prefer to write a code and obtain the desired outputs.
Here is the example data:Inputs are feature and value columns and outputs are Output_value1, and Output_value2 columns.
|Feature| |Value| |Output_value1| |Output_value2|
|Gene_1| |14.25633934| |0.80100922| |-0.80100922|
|Gene_2| |16.88394578| |3.42861566| |-3.42861566|
|Gene_3| |16.01| |2.55466988| |-2.55466988|
|Gene_4| |13.82329514| |0.36796502| |-0.36796502|
|Gene_5| |12.96382949| |-0.49150063| |0.49150063|
|Normalizer| |13.45533012| |0| |0|
dput(head(Exampledata))
structure(list(Feature = structure(1:6, .Label = c("Gene_1", "Gene_2",
"Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"), Value =
c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012), Output_value1 = c(0.80100922, 3.42861566, 2.55466988,
0.36796502, -0.49150063, 0), Output_value2 = c(-0.80100922,
-3.42861566, -2.55466988, -0.36796502, 0.49150063, 0)), row.names = c(NA, 6L), class = "data.frame")
Assuming you'll only have one row where Feature == "Normalizer", in R you get the Value of that row and subtract it from rest of the rows.
Exampledata$Output_value1 <- Exampledata$Value -
Exampledata$Value[Exampledata$Feature == "Normalizer"]
Exampledata$Output_value2 <- Exampledata$Output_value1 * -1
Exampledata
# Feature Value Output_value1 Output_value2
#1 Gene_1 14.25634 0.8010092 -0.8010092
#2 Gene_2 16.88395 3.4286157 -3.4286157
#3 Gene_3 16.01000 2.5546699 -2.5546699
#4 Gene_4 13.82330 0.3679650 -0.3679650
#5 Gene_5 12.96383 -0.4915006 0.4915006
#6 Normalizer 13.45533 0.0000000 0.0000000
EDIT
For multiple such columns, we can do
cols <- grep("^Value", names(data))
inds <- which(data$Feature == "Normalizer")
data[paste0("Output", seq_along(cols))] <- data[cols] - data[rep(inds, nrow(data)),cols]
data[paste0("Output_inverted", seq_along(cols))] <- data[grep("Output", names(data))] * -1
data
Exampledata <- structure(list(Feature = structure(1:6, .Label = c("Gene_1",
"Gene_2", "Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"),
Value = c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012)), row.names = c(NA, 6L), class = "data.frame")

R Data Frame Filter Not Working

I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).
For example, the data is output as a .csv, so I read in the whole directory, as follows.
readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)
#row.names = 1 sets rownames as gene names
This reads in 3 .csv files, A, B and C. The data look like this
A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
logFC logCPM LR PValue FDR
YER037W -1.943616 6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273 5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C 1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C 2.096877 6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W 1.649138 8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C 1.992810 9.302504 22.91553 0.000001692786468 0.00132736130
I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.
1.logFC > 1 or < -1
2.FDR < 0.05
So I loop through the dataframes like so
genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)
}
This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.
e.g.
df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)
whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example
A.logFC A.FDR B.logFC B.FDR C.logFC C.FDR logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C 0.8321432 0.1462919 0.7401477 0.2191413 -0.9616989 0.04098177 TRUE
The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.
Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.
I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.
Thanks.
Using merge as suggested by akash87 solved the problem.
Turns out cbind was causing the rownames to not be assigned correctly.
I'm not exactly sure what your desired output is here, but it might be possible to simplify a bit and use the dplyr library to filter all your outputs at once, assuming the format of your data is consistent. Using some modified versions of your data as an example:
A <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-1.943616, -1.771273, 0, 2.096877, 1.649138, 1.99281
), logCPM = c(6.294092, 5.840774, 10.107793, 6.534761, 8.940714,
9.302504), LR = c(34.30835, 31.97088, 30.55576, 28.08635,
23.32097, 22.91553), PValue = c(4.703583e-09, 1.5650144e-08,
3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 1.692786468e-06
), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 0.00014035695,
0.00132682314, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
B <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-0.4, -0.3, 0, 2.096877, 1.649138, 1.99281), logCPM = c(6.294092,
5.840774, 10.107793, 6.534761, 8.940714, 9.302504), LR = c(34.30835,
31.97088, 30.55576, 28.08635, 23.32097, 22.91553), PValue = c(4.703583e-09,
1.5650144e-08, 3.2440747e-08, 1.16021451e-07, 1.370968319e-06,
1.692786468e-06), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05,
0.00014035695, 0.1, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
Use rbind to create a single dataframe to work with:
AB<- rbind(A,B)
Then filter this whole thing based on your criteria. Note that duplicates can occur, so you can use distinct to only return unique genes that qualify:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene)
gene
1 YER037W
2 YJL184W
3 YDR342C
4 YGL062W
Or, to keep all the rows for those genes as well:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene, .keep_all = TRUE)
gene logFC logCPM LR PValue FDR
1 YER037W -1.943616 6.294092 34.30835 4.703583e-09 2.276064e-05
2 YJL184W -1.771273 5.840774 31.97088 1.565014e-08 3.786552e-05
3 YDR342C 2.096877 6.534761 28.08635 1.160215e-07 1.403570e-04
4 YGL062W 1.649138 8.940714 23.32097 1.370968e-06 1.326823e-03

how to set assignement to fill subset by row

While cleaning up a dataframe I found out that assignments into subsets works by columns and not by lines, an unfortunate result when doing dataset cleanup as you typically search cases of issues and then apply your correction across multiple lines.
# example table
releves <- structure(list(cult2015 = c("bp", "bp"), prec2015 = c("?", "?"
)), .Names = c("cult2015", "prec2015"), row.names = c(478L, 492L
), class = "data.frame")
# assignement to a subset
iBad2 <- which(releves$cult2015 == "bp" & releves$prec2015 == "?")
releves[iBad2,c("cult2015","prec2015")] <- c("b","p")
I understand that the "filling" of the matrices is done by columns and hence, the repetition of the provided vector is done on each column but is there any option to get: "b", "p" on each line and not:
> releves
cult2015 prec2015
478 b b
492 p p
I wrote the following function that does the job, at least in the cases I faced:
# allows to to assigment of newVals to a subset spanning over multiple rows
AssignToSubsetByRow <- function(dat,rows,cols,newVals){
if(is.null(dim(newVals))&length(rows)*length(cols)> length(newVals)){
fullRep <- rep(newVals,each=length(rows))
}else{
fullRep <- newVals
}
dat[rows,cols] <- fullRep
return(dat)
}
And doing the job fine:
releves <- AssignToSubsetByRow(releves,iBad2,c("cult2015","prec2015"),c("b","p"))
> releves
cult2015 prec2015
478 b p
492 b p

Rank columns and put the column with the highest score as first and so on

Lets say i have a very large data.frame containing scores per column.
for example:
MA0001.1 AGL3 MA0003.1 TFAP2A MA0004.1 Arnt MA0005.1 AG MA0006.1 Arnt::Ahr
7.789524e-09 0.4012127249 3.771518e-03 1.892011e-06 0.002733200
5.032498e-07 0.0001873801 9.947449e-05 3.284222e-05 0.001367041
1.194487e-06 0.0009357406 6.943634e-05 1.589373e-05 0.002551519
4.833494e-06 0.0150703600 1.003488e-04 1.197928e-03 0.001431416
6.865040e-05 0.0000732607 3.857193e-04 5.388744e-03 0.001363706
R data.frame:
testfr<-structure(list(`MA0001.1 AGL3` = c(7.78952366977488e-09, 5.03249791215203e-07,
1.19448739380034e-06, 4.83349413748598e-06, 6.86504034402563e-05
), `MA0003.1 TFAP2A` = c(0.401212724871542, 0.000187380067026448,
0.000935740631438077, 0.0150703600158589, 7.32607018758816e-05
), `MA0004.1 Arnt` = c(0.00377151826447817, 9.94744903768433e-05,
6.94363387424972e-05, 0.000100348764966112, 0.00038571926458373
), `MA0005.1 AG` = c(1.89201084302835e-06, 3.2842217133538e-05,
1.58937284554136e-05, 0.00119792816070882, 0.00538874414923338
), `MA0006.1 Arnt::Ahr` = c(0.00273319966783363, 0.00136704060025893,
0.00255151921946167, 0.00143141576426544, 0.00136370552325235
)), .Names = c("MA0001.1 AGL3", "MA0003.1 TFAP2A", "MA0004.1 Arnt",
"MA0005.1 AG", "MA0006.1 Arnt::Ahr"), class = "data.frame", row.names = c(4L,
2L, 5L, 1L, 3L))
Now i want to select the column with the highest values in it and place that column first.
So the values of 1 column should stay below the same column name and the entire column should move by rank.
I tried the following:
ranked<-unlist(lapply(testfr,rank))
testranked<-testfr[ranked, ]
this produces a data frame with 2259obs*459vars while the original was 5*459.
Note that, testfr is a data.frame derived from a function which scores sequences on to a list of matrices! And gives that score back into a data.frame where the rows are the sequences and the columns are the matrices.
I know i do something wrong with the indexing or unlisting but i dont have any clue how to fix this. Any help is appreciated.
How about this?
> testfr[rev(order(sapply(testfr, max, na.rm = TRUE)))]
Break down:
sapply(test.fr, max, na.rm = TRUE) # get max of each column (after removing NA)
order(.) # get the order of these values in increasing order
rev(.) # get the reverse order so that highest value index stays first
testfr[.] # get the columns in this order back
I would use apply for readability,
testfr[order(apply(testfr, 2, max, na.rm = TRUE),decreasing=T)]
I apply max for each margin , column here, Then I sort column in decreasing order.

Resources