paste specific text to strings that do not have it - r

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")

One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")

vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

Related

Generate full width character string in R

In R, how can I make the following:
convert this string: "my test string"
to something like this ( a full width character string): "my  test  string"
is there a way to do this through hexidecimal character encodings?
Thanks for your help, I'm really not sure how to even start. Perhaps something with {stringr}
I'm trying to get an output similar to what I would expect from this online conversion tool:
http://www.linkstrasse.de/en/%EF%BD%86%EF%BD%95%EF%BD%8C%EF%BD%8C%EF%BD%97%EF%BD%89%EF%BD%84%EF%BD%94%EF%BD%88%EF%BC%8D%EF%BD%83%EF%BD%8F%EF%BD%8E%EF%BD%96%EF%BD%85%EF%BD%92%EF%BD%94%EF%BD%85%EF%BD%92
Here is a possible solution using a function from the archived Nippon package. This is the han2zen function, which can be found here.
x <- "my test string"
han2zen <- function(s){
stopifnot(is.character(s))
zenEisu <- paste0(intToUtf8(65295 + 1:10), intToUtf8(65312 + 1:26),
intToUtf8(65344 + 1:26))
zenKigo <- c(65281, 65283, 65284, 65285, 65286, 65290, 65291,
65292, 12540, 65294, 65295, 65306, 65307, 65308,
65309, 65310, 65311, 65312, 65342, 65343, 65372,
65374)
s <- chartr("0-9A-Za-z", zenEisu, s)
s <- chartr('!#$%&*+,-./:;<=>?#^_|~', intToUtf8(zenKigo), s)
s <- gsub(" ", intToUtf8(12288), s)
return(s)
}
han2zen(x)
# [1] "my test string"

Read txt file as a numeric array in R

I am using RStudio in a MAC (10.14.6), and I am trying to read a text file that looks like this
5:[0.12126984126984124, 0.11682539682539679, 0.14666666666666664, 0.07269841269841269, 0.06984126984126983, 0.0911111111111111, 0.1092063492063492, 0.12253968253968253, 0.08698412698412696, 0.09523809523809523, 0.12222222222222222, 0.10761904761904759]
I've used several iterations of "read", "read.delim", and "read.csv" and all pretty much do the same
> data.matrix(read.delim("data.txt",sep=','))
X5..0.12126984126984124 X0.11682539682539679 X0.14666666666666664 X0.07269841269841269 X0.06984126984126983
X0.0911111111111111 X0.1092063492063492 X0.12253968253968253 X0.08698412698412696 X0.09523809523809523
X0.12222222222222222 X0.10761904761904759.
Using "unlist", "as.numeric", "as.character" does not yield anything most likely due to the presence of the X in front of each number. Does anyone have ideas to read this file properly?
if you are only interested in reading the numbers, then you first have to delete 5:[ at the beginning and also ] at the end. then read using scan with the sep = ','
scan(text=gsub("^.*\\[|\\]", "", string), sep = ",")
Read 12 items
[1] 0.12126984 0.11682540 0.14666667 0.07269841 0.06984127 0.09111111
[7] 0.10920635 0.12253968 0.08698413 0.09523810 0.12222222 0.10761905

Make string different

My data is as follows:
“Louis Hamilton”
“Tiger Wolf”
“Sachin Tendulkar”
“Lebron James”
“Michael Shoemaker”
“Hollywood – Career as an Actor”
I need to extract all the characters until a space or a dash(-) is reached
I need to extract no more than 10 characters
My desired output is
“Louis”
“Tiger”
“Sachin”
“Lebron”
“Michael”
“Hollywood”
I tried using below function and it worked perfectly
sub("^([^- ]+).*", "\\1", v1)
Now, how can I manipulate these names so that the output is as follows?
“Louis Wolf”
“Tiger:5”
“Sachin James”
“Lebron Tendulkar”
“Michael – Actor”
“Hollywood: Shoemaker”

How to sub matching words with bracketed words?

Trying to create a function to bracket reserved words in Access for a SQL query:
library(dplyr)
tester <- data.frame(names=c("Add", "Date", "Test", "DOB"))
bracket_access <- function(x) {x %>% gsub(c("ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|
AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|
COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|
CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|
CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|
DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|
DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|
FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|
GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|
IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|
INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|
JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|
LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|
NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|
OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|
PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|
Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|
Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|
SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|
StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|
TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|
VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO"), paste0("[",.,"]"), ignore.case = T)
}
bracket_access(tester)
I get a numeric output and I don't really understand why:
> bracket_access(tester)
[1] "[c(1, 2, 4, 3)]"
You may fix the current approach by matching and capturing the strings equal to the alternatives you provided and then replace the names column with [\\1] in the gsub:
bracket_access <- function(x) {
gsub("^(ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO)$",
"[\\1]",
x,
ignore.case = T)
}
bracket_access(tester$names)
## => [1] "[Add]" "[Date]" "Test" "DOB"
Here, the gsub pattern looks like ^(word1|word2|...|wordN)$ and once there is a match, the whole string is wrapped with [...] and put back (the \\1 is a placeholder for the capturing group #1 in the pattern, and there is one defined with a pair of unescaped parentheses).

trying to prune down a list of files

I have a list of files and I'm trying to extract all layer1_*.grd files. Is there a way of doing this in one grep expression?
lof <- c("layer1_1.grd", "layer1_1.gri", "layer1_2.grd", "layer1_2.gri",
"layer1_3.grd", "layer1_3.gri", "layer1_4.grd", "layer1_4.gri",
"layer1_5.grd", "layer1_5.gri", "layer2_1.grd", "layer2_1.gri",
"layer2_2.grd", "layer2_2.gri", "layer2_3.grd", "layer2_3.gri",
"layer2_4.grd", "layer2_4.gri", "layer2_5.grd", "layer2_5.gri",
"layer3_1.grd", "layer3_1.gri", "layer3_2.grd", "layer3_2.gri",
"layer3_3.grd", "layer3_3.gri", "layer3_4.grd", "layer3_4.gri",
"layer3_5.grd", "layer3_5.gri", "layer4_1.grd", "layer4_1.gri",
"layer4_2.grd", "layer4_2.gri", "layer4_3.grd", "layer4_3.gri",
"layer4_4.grd", "layer4_4.gri", "layer4_5.grd", "layer4_5.gri")
I tried doing this in two steps:
list.of.files <- list.files(pattern = c("1_"))
list.of.files <- list.of.files[grep(".grd", list.of.files)]
Can someone enlighten me how to do this with grep in one step? I naively tried passing list() and c() to the grep but, as you can imagine, it doesn't work.
list.of.files <- list.files()
list.of.files <- list.of.files[grep(list("1_", ".grd"), list.of.files)]
This should work for you:
> lof[grep("layer1_.*.grd", lof)]
[1] "layer1_1.grd" "layer1_2.grd" "layer1_3.grd" "layer1_4.grd" "layer1_5.grd"
Also, just to clarify your terminology: your list of files is not really a list; it's a character vector.
The stringr alternative is lof[str_detect(lof, "layer1_.*.grd")].
In fact, in this case you can be even more specific about the missing characters, so "layer1_[[:digit:]].grd" would work as the pattern here, and might be faster if lof is very long.

Resources