R: For loop to subset Granges objects by list

R: For loop to subset Granges objects by list - r

I have a txf_df which I subset by gene.list$entrez and then found the list of unique number of transcripts. The txf_df is then converted to txf_grange.
Now, I want to create a for loop of the 15 unique genes, where upon each iteration, subset the txf_grange objects by only the specific gene.
Code:
# Subset by the Entrez IDs
txf_df <- txf_df %>% filter(geneName %in% gene.list$entrez)
# Find the number of common transcripts
unique <- unique(txf_df$geneName)
length(unique)
# Recast this dataframe back to a GRanges object
txf_grange <- makeGRangesFromDataFrame(txf_df, keep.extra.columns=T)
# For each of the 15 genes, subset the Granges objects by only the gene
for (i in gene.list["entrez"]) {
for (j in txf_grange$geneName) {
if (i==j) {
assign(paste0("gene.", i), 1:j) <- txf_grange[j,]
}
}
}
Data:
> dput(head(txf_df))
structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "16", class = "factor"),
start = c(12058964L, 12059311L, 12059311L, 12060052L, 12060198L,
12060198L), end = c(12059311L, 12060052L, 12061427L, 12060198L,
12060877L, 12061427L), width = c(348L, 742L, 2117L, 147L,
680L, 1230L), strand = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("+", "-", "*"), class = "factor"), type = structure(c(3L,
1L, 1L, 2L, 1L, 1L), .Label = c("J", "I", "F", "L", "U"), class = "factor"),
txName = structure(list(c("uc002dbv.3", "uc010buy.3", "uc010buz.3"
), c("uc002dbv.3", "uc010buy.3"), "uc010buz.3", c("uc002dbv.3",
"uc010buy.3"), "uc010buy.3", "uc002dbv.3"), class = "AsIs"),
geneName = structure(list("608", "608", "608", "608", "608",
"608"), class = "AsIs")), row.names = c(NA, 6L), class = "data.frame")
> dput(head(gene.list))
structure(list(Name = c("AQP8", "CLCA1", "GUCA2B", "ZG16", "CA4",
"CA1"), Pvalue = c(3.24077275512836e-22, 2.57708986670727e-21,
5.53491656902485e-21, 4.14482213350182e-20, 2.7795892896524e-19,
1.23890644641685e-18), adjPvalue = c(8.3845272720681e-18, 6.66744690314504e-17,
1.43199361473811e-16, 1.07234838237959e-15, 7.19135341018869e-15,
3.20529875816967e-14), logFC = c(-3.73323340223377, -2.96422555675244,
-3.34493724166712, -2.87787132076412, -2.87670608798164, -3.15664667432159
), entrez = c(AQP8 = "343", CLCA1 = "1179", GUCA2B = "2981",
ZG16 = "653808", CA4 = "762", CA1 = "759")), row.names = c(NA,
6L), class = "data.frame")

Related

Conditionally adding characters to new column based on separate dataset

Hello all and thank you in advance.
I would like to add a new column to my pre-existing data frame where the values sourced from a second data frame based on certain conditions. The dataset I wish to add the new column to ("data_melt") has many different sample IDs (sample.#) under the variable column. Using a second dataset ("metadata") I want to add the pond names to the "data_melt" new column based on the sample-ids. The sample IDs are the same in both datasets.
My gut tells me there's an obvious solution but my head is pretty fried. Here is a toy example of my data_melt df (since its 25,000 observations):
> dput(toy)
structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
And here is a toy example of my metadata df:
> dput(toy)
structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
Thank you again!

We can use match from base R to create a numeric index to replace the values
toy$pond <- with(toy, out$pond[match(variable, out$sample)])

I believe merge will work here.
sss <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
ss <- structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
ssss <- merge(sss, ss, by.x = "variable", by.y = "sample")

You can use left_join() from the dplyr package after renaming sample to variable in the metadata data frame.
library(tidyverse)
data_melt <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"),
process = structure(c(1L, 1L, 1L, 1L),
.Label = "energy",
class = "factor"),
category = structure(c(1L, 1L, 1L, 1L),
.Label = "metabolism",
class = "factor"),
ko = structure(1:4,
.Label = c("K00058", "K00093", "K00125", "K00148"),
class = "factor"),
variable = structure(c(1L, 2L, 3L, 3L),
.Label = c("sample.10", "sample.19", "sample.72"),
class = "factor"),
value = c(0.00116, 2.77e-05, 1.84e-05, 0.0125)),
row.names = c(NA, -4L),
class = "data.frame")
metadata <- structure(list(sample = c("sample.10", "sample.19", "sample.72", "sample.13"),
pond = structure(c(2L, 2L, 1L, 1L),
.Label = c("lower", "upper"),
class = "factor")),
row.names = c(NA, -4L),
class = "data.frame") %>%
# Renaming the column, so we can join the two data sets together
rename(variable = sample)
data_melt <- data_melt %>%
left_join(metadata, by = "variable")

Extract common values for more than one column between several R dataframes

Imagine I have this 4 data frames:
abc_df
abc_ID . abc_classification
a . neutral
b . deletereous
c . benign
def_df
def_ID . def_classification
f . neutral
a . neutral
c . benign
ghi_df
ghi_ID . ghi_classification
f . deletereous
c . benign
k . neutral
vmk_df
vmk_ID . vmk_classification
c . benign
k . deletereous
a . neutral
As you can see, the columns "dfname_ID" and "dfname_classification" are not contiguous (the dots represent another columns in the data frame) and have not the same colnames. So, I would like to extract the common rows between all data frames for these 2 columns, using the index of the columns, and not their names.
The output should be this:
ID . classification
c . benign
I am trying to use intersect, lapply(mget(c('abc_df', 'def_df', 'ghi_df', 'vmk_df'))), but I don't know how to specify the correct command. Do you know how can I solve this?

Might need to use purrr, so the conversion to character might not be necessary since intersect forces it to change:
library(purrr)
library(magrittr)
COLUMNS = c(1,2,3)
list(abc_df,def_df,ghi_df,vmk_df) %>%
map(~mutate_if(.x[,COLUMNS],is.factor, as.character)) %>%
map(~set_colnames(.x,c("id",".","classification"))) %>%
reduce(intersect)
id . classification
1 c . benign
Your data:
abc_df = structure(list(abc_ID = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), . = structure(c(1L, 1L, 1L), .Label = ".", class = "factor"),
abc_classification = structure(3:1, .Label = c("benign",
"deletereous", "neutral"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
def_df = structure(list(def_ID = structure(c(3L, 1L, 2L), .Label = c("a",
"c", "f"), class = "factor"), . = structure(c(1L, 1L, 1L), .Label = ".", class = "factor"),
def_classification = structure(c(2L, 2L, 1L), .Label = c("benign",
"neutral"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
ghi_df = structure(list(ghi_ID = structure(c(2L, 1L, 3L), .Label = c("c",
"f", "k"), class = "factor"), . = structure(c(1L, 1L, 1L), .Label = ".", class = "factor"),
ghi_classification = structure(c(2L, 1L, 3L), .Label = c("benign",
"deletereous", "neutral"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
vmk_df = structure(list(vmk_ID = structure(c(2L, 3L, 1L), .Label = c("a",
"c", "k"), class = "factor"), . = structure(c(1L, 1L, 1L), .Label = ".", class = "factor"),
vmk_classification = structure(1:3, .Label = c("benign",
"deletereous", "neutral"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))

For the data you provided you could use:
library(dplyr)
abc_df %>%
rename(ID = abc_ID, classification = abc_classification) %>%
inner_join(def_df, by = c("ID" = "def_ID",
"classification" = "def_classification")) %>%
inner_join(ghi_df, by = c("ID" = "ghi_ID",
"classification" = "ghi_classification")) %>%
inner_join(jkl_df, by = c("ID" = "jkl_ID",
"classification" = "jkl_classification"))

For loop & if else working for less data but not working for more data

Calculation inside for loop & ifelse is working when I have 100-200 rows but not working when I have 20000 rows.
Can someone help me with the FOR loop and IFELSE if something is wrong or if there is some timeout happening in R studio when running for & if-else loop
Code:
#FROM HERE IT IS NOT WORKING WHEN WE HAVE 20000 ROWS OF DATA IN FINAL DATFRAME.
#WE ARE CREATING FINAL_V1 WHICH IS POPULATING ONLY 1 ROW
#New Dataframe with Null values
Final <- structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "0S1576", class = "factor"),
LC = structure(1:6, .Label = c("MW92", "OY01", "RM11", "RS11",
"WK14", "WK15"), class = "factor"), Fiscal.Week = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "2019-W24", class = "factor"),
SS = c(15L, 7L, 5L, 9L, 2L, 2L), Freq = c(3, 6, 1, 2, 1,
1), agg = c(1, 1, 1, 1, 0, 0)), row.names = c(NA, -6L), class = "data.frame")
lctolc <- structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "0S1576", class = "factor"),
LC = structure(c(1L, 2L, 2L, 3L, 3L), .Label = c("MW92",
"OY01", "RM11"), class = "factor"), ToLC = structure(1:5, .Label = c("OY01",
"RM11", "RS11", "WK14", "WK15"), class = "factor")), row.names = c(NA,
-5L), class = "data.frame")
df <- as.data.frame(unique(Final$Item))
Final_v1 <- NA
j <- 1
i <- 1
#SS computations
#For 1 to no of rows in df(which is having no of unique items
for(j in 1:nrow(df)) {
#copying the data from Final to Final_v1(with charater type)
Final_v1 <- Final[Final$Item == as.character(df[j,1]),]
#for 1 to the no of rows in Final_v1
for(i in 1:nrow(Final_v1)) {
if(Final_v1[i,6] <= 0)
{
Final_v1[i,7] = Final_v1[i,4]}
else
{
if(Final_v1[i,5] == '1')
{
Final_v1[i,7]=0
}
else
{
Final_v1[i,7]=Final_v1[i,4]
}
SSNew <- Final_v1[i,7]
#Leftover distribution
LCS <- lctolc$ToLC[Final_v1$Item[i] == lctolc$Item & Final_v1$LC[i] == lctolc$LC]
inds <- Final_v1$LC %in% LCS
if (any(inds))
{ Final_v1$SS[inds]<- if (SSNew == 0) {Final_v1$SS[inds]==0} else {Final_v1$SS[inds]=Final_v1$SS[inds]} }
}
}
names(Final_v1)[7] <- "SSNew"
}
Can someone help why it is not performing for 20000rows

Credibility interval with respect two factors using ggplot2 in r

I have problem ploting credibility interval like this:
My data structure is following，L1,L2,M,U1,U2 stand for 0.025quant,0.25quant,0.5quant,0.75quant,0.975quant,respectively.
`
structure(list(approach = structure(c(1L, 2L, 1L, 2L, 1L, 2L), class = "factor", .Label = c("INLA",
"rjags")), param = structure(c(1L, 2L, 3L, 1L, 2L, 3L), class = "factor", .Label = c("alpha",
"beta", "sig2")), L1 = c(0.0844546867936143, 1.79242348175439,
0.163143886545317, 0.0754165380733685, 1.79067991488052, 3.66675821267498
), L2 = c(0.60090835904286, 1.95337968870806, 0.898159977552433,
0.606017177641373, 1.95260448314298, 4.07080184844179), M = c(0.870204161297956,
2.03768437879748, 2.20651061559405, 0.87408237273113, 2.03725552264872,
4.32531027636171), U2 = c(1.13905085248391, 2.12210930874551,
4.26836270504725, 1.66260576926063, 2.28900567640091, 5.10063756831338
), U1 = c(1.65214011950274, 2.28396345192398, 4.9109804477583,
1.1450384685802, 2.12117799328209, 4.55657971279654), AP = structure(c(1L,
4L, 5L, 2L, 3L, 6L), .Label = c("INLA.alpha", "rjags.alpha",
"INLA.beta", "rjags.beta", "INLA.sig2", "rjags.sig2"), class = "factor")), .Names = c("approach",
"param", "L1", "L2", "M", "U2", "U1", "AP"), row.names = c(NA,
-6L), class = "data.frame")`
I referenced this answerenter link description here,but 'fill' seems only work for boxplot case.the code I tried so far is:
CI$AP=interaction(CI$approach,CI$param)
p=ggplot(CI,aes(y=AP))+geom_point(aes(x=M))
p=p+geom_segment(aes(x=L1,xend=U1,y=AP,yend=AP))
p=p+geom_segment(aes(x=L2,xend=U2,y=AP,yend=AP),size=1.5)
It is far away from what I want.
Many thanks!

How about the following:
ggplot(df, aes(x = param, y = M, colour = approach)) +
geom_point(position = position_dodge2(width = 0.3), size = 3) +
geom_linerange(
aes(ymin = L2, ymax = U2, x = param),
position = position_dodge2(width = 0.3),
size = 2) +
geom_linerange(
aes(ymin = L1, ymax = U1, x = param),
position = position_dodge2(width = 0.3),
size = 1) +
coord_flip() +
labs(x = "Parameter", y = "Estimate")
Sample data
df <- structure(list(approach = structure(c(1L, 2L, 1L, 2L, 1L, 2L), class = "factor", .Label = c("INLA",
"rjags")), param = structure(c(1L, 2L, 3L, 1L, 2L, 3L), class = "factor", .Label = c("alpha",
"beta", "sig2")), L1 = c(0.0844546867936143, 1.79242348175439,
0.163143886545317, 0.0754165380733685, 1.79067991488052, 3.66675821267498
), L2 = c(0.60090835904286, 1.95337968870806, 0.898159977552433,
0.606017177641373, 1.95260448314298, 4.07080184844179), M = c(0.870204161297956,
2.03768437879748, 2.20651061559405, 0.87408237273113, 2.03725552264872,
4.32531027636171), U2 = c(1.13905085248391, 2.12210930874551,
4.26836270504725, 1.66260576926063, 2.28900567640091, 5.10063756831338
), U1 = c(1.65214011950274, 2.28396345192398, 4.9109804477583,
1.1450384685802, 2.12117799328209, 4.55657971279654), AP = structure(c(1L,
4L, 5L, 2L, 3L, 6L), .Label = c("INLA.alpha", "rjags.alpha",
"INLA.beta", "rjags.beta", "INLA.sig2", "rjags.sig2"), class = "factor")), .Names = c("approach",
"param", "L1", "L2", "M", "U2", "U1", "AP"), row.names = c(NA,
-6L), class = "data.frame")

Using a lookup table in R with continous values

I have a lookup table in R that I am trying to figure out how to implement. The challenge for me is that it involves continuous values or ranges of data. If the value falls inbetween I'd like it to pick the right value.
I want to use the two continuous 'GRADE', 'SAT' variables plus the categorical 'TYPE' value to assign a 'GROUP' value. This big block of code looks intimidating but these are tiny tiny tables.
Any advice is appreciated!!!!
#lookup table code for recreating dataframe
structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
#example ----- desired value is in the 'GROUP' column so this would be NULL before I used the lookup table
structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), Group = structure(c(1L,
2L, 3L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type", "Group"), class = "data.frame", row.names = c(NA,
-4L))

how abt this?
ltab <- structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
dat <- structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type"), class = "data.frame", row.names = c(NA,
-4L))
library(plyr)
mdat <- adply(merge(dat, ltab, by="Type", all=T), 1, function(x) {
c(FallsIn=x$Grade > x$min_grade & x$Grade <= x$max_grade & x$Sat > x$min_sat & x$Sat <= x$max_sat)
})
mdat[mdat$FallsIn,]
thinking about generalizing, are there going to be more continuous variables that you need to check?
EDIT: could not edit OP post so taking OP's comment into account is how I would tackle an example of "categorizing multidimensional continuous random variables"
so that these keywords will flag up in future searches
breaks <- list(Var1=c(0, 0.25, 1),
Var2=c(0, 0.5, 1),
Var3=c(0, 0.25, 0.75, 1))
#generate this on the fly
genIntv <- function(x) {
ret <- paste0("(", x[1:(length(x)-1)],", ",x[2:length(x)], "]")
names(ret) <- 1:(length(x)-1)
ret
}
lookupTbl <- data.frame(expand.grid(lapply(breaks, genIntv), stringsAsFactors=F),
Group=LETTERS[1:12])
lookupTbl2 <- data.frame(expand.grid(lapply(breaks, function(x) 1:(length(x)-1)), stringsAsFactors=F),
Group=LETTERS[1:12])
#data set
dat <- data.frame(Var1=c(0.1, 0.76), Var2=c(0.5, 0.75), Var3=c(0.25,0.9))
binDat <- do.call(cbind, setNames(lapply(1:ncol(dat), function(k)
.bincode(dat[,k], breaks[[k]], T, T)),colnames(dat)))
merge(binDat, lookupTbl2, all.x=T, all.y=F)
good to learn if someone else has better approaches

If you have small data, a full join should be fine.
library(dplyr)
result =
example %>%
select(-Type) %>%
full_join(look_up) %>%
filter(min_grade < Grade & Grade <= max_grade &
min_sat < Sat & Sat <= max_sat)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: For loop to subset Granges objects by list - r

Related

Conditionally adding characters to new column based on separate dataset

Extract common values for more than one column between several R dataframes

For loop & if else working for less data but not working for more data

Credibility interval with respect two factors using ggplot2 in r

Using a lookup table in R with continous values

Categories

Resources