How to replace characters in a vector in R? [duplicate] - r

This question already has answers here:
How to reverse complement a DNA strand using R
(1 answer)
Reverse complementary Base
(2 answers)
Closed 1 year ago.
I have been tasked with writing an R function capable of taking a DNA string (s) as input and returning the complementary string on the reverse strand (e.g. "ACGT" returns "TGCA"). The result should look something like this:
> s <- "CCCTTAG"
> reverse.dna(s)
[1] "CTAAGGG"
I currently have the following functions for converting a string to a vector and vice versa, however any attempts I have made to use the replace() or switch() commands to substitute the complementary bases into either the string or vector have been unsuccessful.
string.to.vec <- function(s) {
strsplit(s,"") [[1]]
vec.to.string <- function(v) {
paste(v,collapse="")
As I have very limited experience using R I was wondering if anyone would be able to help me by suggesting the simplest method to implement this functionality into my function. Thanks!

We may use chartr in combination with stri_reverse
library(stringi)
chartr("ACGT", "TGCA", stri_reverse(s))
[1] "CTAAGGG"

The Bioconductor Biostrings package has functions for DNA strings. We convert s from the question to a DNAString object, run reverseComplement and then convert back to character class. Omit the last conversion if you want to keep it as a DNAString.
library(Biostrings)
s |> DNAString() |> reverseComplement() |> as.character()
## [1] "CTAAGGG"

Related

Convert column of Hex into Text in R

I am new to R, please have mercy. I imported a table from an Access database via odbc:
df <- select(dbReadTable(accdb_path, name ="accdb_table"),"Col_1","Col_2","Col_3")
For
> typeof(df$Col_3)
I get
[1] "list"
Using library(dplyr.teradata). I converted blob to string (maybe already on the wrong path here):
df$Hex <- blob_to_string(df$Col_3)
and now end up with a column (typeof = character) full of Hex:
df[1,4]
[1] 49206765742061206c6f74206f662048657820616e642068617665207468652069737375652077697468207370656369616c2063687261637465727320696e204765726d616e206c616e6775616765206c696b65206e2b4150592d7
My question is, how to convert each value in Col_3 into proper Text (if possible, with respect to German special chracters like ü,ö, ä and ß).
I am aware of this solution How to convert a hex string to text in R?, but can't apply it properly:
df$Text <- rawToChar(as.raw(strtoi(df$Hex, 16L)))
Fehler in rawToChar(as.raw(strtoi(BinData$Hex, 16L))) :
Zeichenkette '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
Thx!
If I understand this correctly, what you want to do it to apply a function to each element of a list so that it returns a character vector (that you can add to a data frame, if you so wish).
This can be easily accomplished with the purrr family of functions. The following takes each element df$Col_3 and runs the function (with each element being the x in the given function)
purrr::map_chr(.x = df$Col_3,
.f = function(x) {rawToChar(as.raw(strtoi(x,16L)))})
You should probably achieve the same with base R functions such as lapply() followed by unlist(), or sapply() but with purrr it's often easier to find inconsistent results.

Get Certain Patterns of Chracters in a String in R

How can I in R predefine patterns that I would like to keep in a string a then in a column of a data frame?
g <- c("3+kk120", "3+1121", "1+170", "1+kk5")
# I want to get
c("3+kk", "3+1", "1+1", "1+kk")
I am not quite sure if I understand you but, after replacing two digits (kk) into one (X) you can use substr(). Then you can replace back the previous one as follows,
sub("X","kk",substr(sub("kk","X",g),1,3))
gives,
# [1] "3+kk" "3+1" "1+1" "1+kk"

Modify a substring matched by regex and place it back [duplicate]

This question already has answers here:
Applying a function to a backreference within gsub in R
(4 answers)
Closed 3 years ago.
I have an arbitrary string, say "1a2 2a1 3a2 10a5" I want to do an arbitrary mathematical operation, say doubling, to some of the numbers, say anything followed by an "a".
I can extract the numbers I want with relative ease
string = "1a2 2a1 3a2 10a5"
numbers = stringr::str_extract_all(string,'[0-9]+(?=a)')[[1]]
and obviously, doubling them is trivial
numbers = 2*(as.integer(numbers))
But I am having problems with placing the new results in their old positions. To get the output "2a2 4a1 6a2 20a5". I feel like there should be a single function that accomplishes this but all I can think of is recording the original indexes of the matches using gregexpr and manually placing the new results in the coordinates.
We can use str_replace_all from stringr to capture numbers followed by "a" and then multiply them by 2 in their callback function.
stringr::str_replace_all(string, "\\d+(?=a)", function(m) as.integer(m) * 2)
#[1] "2a2 4a1 6a2 20a5"
gsubfn is like gsub except that the second argument can be a string, function (possibly expressed in formula notation), a list or a proto object. If it is a function the capture groups are input into it and the match is replaced with the output of the function.
library(gsubfn)
string <- "1a2 2a1 3a2 10a5"
gsubfn("(\\d+)(?=a)", ~ 2L * as.integer(..1), string)
## [1] "2a2 4a1 6a2 20a5"
This variation also works. backref=0 says input the match into the function rather than the the capture groups.
gsubfn("\\d+(?=a)", ~ 2 * as.integer(x), string, backref = 0)

Extracting substring from string if pattern varies [duplicate]

This question already has answers here:
Function to extract domain name from URL in R
(5 answers)
Closed 4 years ago.
I would like to extract names from weblinks using substr(). My problem is that the patterns vary slightly, so I am not sure how to account for the variances. Here is a sample:
INPUT:
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
What I want as Output
c("gatcoin", "appcoins", "pareto", "betbox", "aidcoin")
In my understanding I need to specify the start and end of the string to be extracted, but sometimes start would be "https://", while other times it would be "https://www."
How could I solve this?
You can do this easily with stringr...
library(stringr)
str_match(list, "\\/(www\\.)*(\\w+)\\.")[,3]
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
The regex extracts the first sequence of letters between a slash and an optional www., and the next dot.
The equivalent in base R is slightly messier...
sub(".+?\\/(?:www\\.)*(\\w+)\\..+", "\\1", list)
This adds the start and end of the string as well, replacing the whole lot with just the capture group you want. It sets the optional www. as a non-capturing group, as sub and str_match behave differently if the first group is not found.
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN- Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
pattern <- c("https://", "www.", "http://")
for(p in pattern) list <- gsub(p, "", list)
unlist(lapply(strsplit(list, "[.]"), function(x) x[1]))
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
You could use Regular Expressions. However, that is reinventing the wheel. People have thought about how to split URLs before so use an already existing function.
For example parse_url of the httr package. Or google "R parse URL" for alternatives.
urls <- list("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
Use lapply to use parse_url for every element of urls
parsed <- lapply(urls, httr::parse_url)
Now you have a list of lists. Each element of the list parsed has multiple elements itself which contain the parts of the URL`.
Extract all the elements parsed[[...]]$hostname:
hostname <- sapply(parsed, function(e) e$hostname)
Split those by the dot and take the second last element:
domain <- strsplit(hostname, "\\.")
domain <- sapply(domain, function(d) d[length(d)-1])
This Regex captures the word after the ://(www.) .
(?::\/\/(?:www.)?)(\w+)

Meaning of Symbol %>% in R [duplicate]

This question already has an answer here:
What does %>% mean in R [duplicate]
(1 answer)
Closed 8 years ago.
I am an entry level R user.May be this question sound like easy but it will be great if some one can help .
what is the meaning of this symbol in R-coding ...
%>%
Thank you
%>% is most commonly used as an operator for the popular dplyr package
It can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.
%>% means whatever you want it to mean, in Base R anyway:
> %>%
Error: unexpected SPECIAL in "%>%"
(which means that symbol is not defined.)
Binary operators are ones that have an input from the left and from the right of the operator, just like *, + etc. You use them as you would mathematically like a * b, which R turns into the call '*'(a, b). R allows you to add your own binary operators via the %foo% syntax, with foo replace by whatever you want, as long as it hasn't already been used by R, which includes %*% and %/% for example.
`%foo%` <- function(x, y) paste("foo", x, "and foo", y)
> 1 %foo% 2
[1] "foo 1 and foo 2"
%>% takes on a specific and well-defined meaning once you load the magrittr R package for example, where it is used as a pipe operator might be in a Unix shell to chain together a series of function calls.

Resources