R: How to use IF correctly? - r

I have two columns of values like this:
>bb
GDis BDis
1 12.291488 8.009909
2 11.283319 13.625103
3 6.674549 8.629232
4 13.493121 17.175888
5 9.550731 9.867878
6 9.193895 9.785301
7 10.541702 10.941371
8 9.849527 9.496284
9 8.682287 8.133774
10 8.439381 4.335260
I need to add extra column and call it Index which calculates ratio GDis/BDis if GDis is bigger, and BDis/GDis if BDis is bigger.
How do I do that?

You can use pmax and pmin.
transform(bb, Index = pmax(GDis, BDis) / pmin(GDis, BDis))
You can also use arithmetics:
transform(bb, Index = (GDis / BDis) ^ (1 - 2 * (BDis > GDis)))
The result:
GDis BDis Index
1 12.291488 8.009909 1.534535
2 11.283319 13.625103 1.207544
3 6.674549 8.629232 1.292856
4 13.493121 17.175888 1.272937
5 9.550731 9.867878 1.033207
6 9.193895 9.785301 1.064326
7 10.541702 10.941371 1.037913
8 9.849527 9.496284 1.037198
9 8.682287 8.133774 1.067436
10 8.439381 4.335260 1.946684

Try
transform(bb, Index=ifelse(GDis>BDis, GDis/BDis, BDis/GDis))
# GDis BDis adn
#1 12.291488 8.009909 1.534535
#2 11.283319 13.625103 1.207544
#3 6.674549 8.629232 1.292856
#4 13.493121 17.175888 1.272937
#5 9.550731 9.867878 1.033207
#6 9.193895 9.785301 1.064326
#7 10.541702 10.941371 1.037913
#8 9.849527 9.496284 1.037198
#9 8.682287 8.133774 1.067436
#10 8.439381 4.335260 1.946684

Not as nice as the other answers but how about this?
bb$RATIO=ifelse(bb$GDis>bb$BDis,bb$GDis/bb$BDis,bb$X1/bb$GDis)

Related

extract word from string and create new column in r

my data looks like this:
try=data.frame("histones"= c("encode3Ren_limb_H3K27me3_E10","encode3Ren_facial_prominence_H3K27me3_E10", "encode3Ren_liver_H3K27me3_E12", "encode3Ren_neural_tube_H3K27me3_E14", "encode3Ren_neural_tube_H3K4me1_E12" ,"encode3Ren_neural_tube_H3K27me3_E11", "encode3Ren_neural_tube_H3K4me1_E15", "encode3Ren_neural_tube_H3K4me2_E13" ), "a"= c(1,2,3,4,5,6,7,8))
try
histones a
1 encode3Ren_limb_H3K27me3_E10 1
2 encode3Ren_facial_prominence_H3K27me3_E10 2
3 encode3Ren_liver_H3K27me3_E12 3
4 encode3Ren_neural_tube_H3K27me3_E14 4
5 encode3Ren_neural_tube_H3K4me1_E12 5
6 encode3Ren_neural_tube_H3K27me3_E11 6
7 encode3Ren_neural_tube_H3K4me1_E15 7
8 encode3Ren_neural_tube_H3K4me2_E13 8
and I would to extract from the column "histones" only the histone mark (i.e. H3K27me3, H3K4me2), putting it in new column. I'm not able to use regular expression, so any help are very appreciated.
Please check the str_extract from stringr
try %>% mutate(hist=str_extract(histones, '\\w\\d\\w\\d+.*\\d(?=\\_)'))
Created on 2023-01-21 with reprex v2.0.2
histones a hist
1 encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3 encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4 encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5 encode3Ren_neural_tube_H3K4me1_E12 5 H3K4me1
6 encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7 encode3Ren_neural_tube_H3K4me1_E15 7 H3K4me1
8 encode3Ren_neural_tube_H3K4me2_E13 8 H3K4me2
A base R option using gsub
cbind(try, mod = gsub(".*_([H\\d+])|_[Ee]\\d+$", "\\1", try$histones))
histones a mod
1 encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3 encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4 encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5 encode3Ren_neural_tube_H3K4me1_E12 5 H3K4me1
6 encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7 encode3Ren_neural_tube_H3K4me1_E15 7 H3K4me1
8 encode3Ren_neural_tube_H3K4me2_E13 8 H3K4me2
Well actually regular expressions are a good choice here:
try$mark <- str_extract(try$histones, "(?<=_)H\\d+K\\d+\\w+?(?=_)")
If you really can't use regex for some reason, here is an option using base R string functions:
x <- "encode3Ren_facial_prominence_H3K27me3_E10"
mark <- tail(unlist(strsplit(x, "_")), 2)[-2]
mark
[1] "H3K27me3"

purrr map / lapply / sapply across groups of multiple (n > 1) elements at a time?

Suppose we have a vector, we can easily enough lapply, sapply or map across 1 element at a time.
Is there a way to do the same across groups of (>1) elements of the vector?
Example
Suppose we are constructing API calls by appending comma-separated user_identifiers to the URL, like so:
user_identifiers <- c("0011399", "0011400", "0013581", "0013769", "0013770", "0018374",
"0018376", "0018400", "0018401", "0018410", "0018415", "0018417",
"0018419", "0018774", "0018775", "0018776", "0018777", "0018778",
"0018779", "0021627", "0023492", "0023508", "0023511", "0023512",
"0024120", "0025672", "0025673", "0025675", "0025676", "0028226",
"0028227", "0028266", "0028509", "0028510", "0028512", "0028515",
"0028518", "0028520", "0028523", "0029160", "0033141", "0034586",
"0035035", "0035310", "0035835", "0035841", "0035862", "0036503",
"0036580", "0036583", "0036587", "0037577", "0038582", "0038583",
"0038587", "0039727", "0039729", "0039731", "0044703", "0044726"
)
get_data <- function(user_identifier) {
url <- paste0("https://www.myapi.com?userIdentifier=",
paste0(user_identifier, collapse=","))
fromJSON(url)
}
In the above, get_data(user_identifiers) would return the APIs response for all 60 user_identifiers in one single request.
But suppose the API accepts a maximum of 10 identifiers at a time (so we cannot do all 60 at once).
A simple solution could be to simply map/lapply/sapply over each element, e.g. sapply(get_data, user_identifiers - this would work fine - however, we would make 60 API calls, when all we really need is 6. If we could map/lapply/sapply over groups of 10 at a time; that would be ideal
Question
Is there an elegant way to map/lapply/sapply over groups of n elements at a time (where n>1)?
We can split user_identifiers in groups of 10 and use sapply/map/lapply
sapply(split(user_identifiers, gl(length(user_identifiers)/10, 10)), get_data)
where gl creates groups from 1 to 6 each of length 10.
gl(length(user_identifiers)/10, 10)
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
#Levels: 1 2 3 4 5 6
The same groups can be created with rep
rep(1:ceiling(length(user_identifiers)/10), each = 10)
As #thelatemail mentioned, we can use cut and specify number of groups to cut the data into
sapply(split(user_identifiers, cut(seq_along(user_identifiers),6)), get_data)

R - Skip columns in pmax command if they do not exist

I'd like to use the pmax command to create a new column. My code Looks like this:
Master <- Master %>%
mutate(RAM = pmax(RAM1, RAM2, RAM3, RAM4, RAM5, RAM6, RAM7, RAM8, RAM9, RAM10,
RAM11, RAM12, RAM13, RAM14, RAM15, RAM16, RAM17, RAM18,
RAM19, RAM20, RAM21, RAM22, RAM23, RAM24, RAM25, RAM26,
RAM27, RAM28, RAM29, RAM30, RAM31, RAM32, RAM33, RAM34,
RAM35, RAM36, RAM37, RAM38, RAM39, RAM40, RAM41, RAM42,
RAM43, RAM44, RAM45, RAM46, RAM47, RAM48, RAM49, RAM50,
RAM51, RAM52, RAM53, RAM54, RAM55, RAM56, RAM57, RAM58,
RAM59, RAM60, RAM61, RAM62, RAM63, RAM64, RAM65, RAM66,
RAM67, RAM68, RAM69, RAM70, RAM71, RAM72, RAM73, RAM74,
RAM75, RAM76, RAM77, RAM78, RAM79, RAM80, RAM81, RAM82,
RAM83, RAM84, RAM85, RAM86, RAM87, RAM88, RAM89, RAM90,
RAM91, RAM92, na.rm =T))
In my current data base, however, only the columns RAM1 to RAM8 exist. In this case, I want R to skip all the other columns mentioned in the Statement and to only use column RAM1 to RAM8 (it is okay if R displays an error message, but I don't want the program to interrupt running the code).
Any ideas how to do so?
Thanks!
One way to do this would be as follows:
Set up some data to make a reproducible example
set.seed(0)
Master <- data.frame(Other=100,RAM1=1:10, RAM2=1:10, RAM3=1:10, RAM4=1:10,
RAM5=1:10, RAM6=1:10, RAM7=1:10, RAM8=rnorm(10)+5)
Master[5,5] <- NA
Select required columns of the dataframe:
Master[colnames(Master) %in% paste0("RAM",1:92)]
Use do.call to run pmax using the selected columns as arguments, and adding the argument na.rm=TRUE
Master$RAM <- do.call(pmax, c(Master[colnames(Master) %in% paste0("RAM",1:92)], na.rm=TRUE))
Sample output:
Master
# Other RAM1 RAM2 RAM3 RAM4 RAM5 RAM6 RAM7 RAM8 RAM
#1 100 1 1 1 1 1 1 1 6.262954 6.262954
#2 100 2 2 2 2 2 2 2 4.673767 4.673767
#3 100 3 3 3 3 3 3 3 6.329799 6.329799
#4 100 4 4 4 4 4 4 4 6.272429 6.272429
#5 100 5 5 5 NA 5 5 5 5.414641 5.414641
#6 100 6 6 6 6 6 6 6 3.460050 6.000000
#7 100 7 7 7 7 7 7 7 4.071433 7.000000
#8 100 8 8 8 8 8 8 8 4.705280 8.000000
#9 100 9 9 9 9 9 9 9 4.994233 9.000000
#10 100 10 10 10 10 10 10 10 7.404653 10.000000

Stack columns row by row

I have a dataframe which contains 2 columns, such as
Name Seq
1 ENSE00000789668:ENSE00000789668 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
2 ENSE00000789668:ENSE00000814448 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
3 ENSE00000789668:ENSE00000814452 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
4 ENSE00000789668:ENSE00001021870 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
5 ENSE00000789668:ENSE00001316145 CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
6 ENSE00000789668:ENSE00001445604 CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
I wish to stack one column onto anther row by row to give:
ENSE00000789668:ENSE00000789668
CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
ENSE00000789668:ENSE00000814448
CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
ENSE00000789668:ENSE00000814452
CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
ENSE00000789668:ENSE00001021870
CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
ENSE00000789668:ENSE00001316145
CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
ENSE00000789668:ENSE00001445604
CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
How do I do this?
You can try
data.frame(Col1=c(t(df)))
# Col1
#1 ENSE00000789668:ENSE00000789668
#2 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
#3 ENSE00000789668:ENSE00000814448
#4 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
#5 ENSE00000789668:ENSE00000814452
#6 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
#7 ENSE00000789668:ENSE00001021870
#8 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
#9 ENSE00000789668:ENSE00001316145
#10 CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
#11 ENSE00000789668:ENSE00001445604
#12 CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
Or
library(reshape2)
melt(t(df))[3]
Or may be this too
data.frame(Col1=as.matrix(df)[c(matrix(seq(prod(dim(df))), nrow=2, byrow=2))])

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Resources