start and end positions of a character; R - r

I am trying to get
start and end positions of "-" character in column V1
and its corresponding characters at these positions in column V2
Then length of it
Any help will be appreciated!
ip <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz")), class = "data.frame", row.names = c(NA,
-3L))
I tried stringi_locate but it outputs for individual position. For example, For this "ab---cdef" instead of 3-5 it outputs 3-3, 4-4, 5-5.
Expected output:
op <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz"), output = c("x:x-3:5-3",
"x:y-5:7-3", "x:x-2:3-2; y-z:6:7-2")), class = "data.frame", row.names = c(NA,
-3L))
the output column must have
The characters in V2 column with respect to start and end of "-" in V1
Then start and end position
Then its length
V1 V2 output
ab---cdef xxxxxxxyy x:x-3:5-3
Thanks!

Here's an example using grepexpr to get all the matches in a string.
x <- gregexpr("-+", ip$V1)
mapply(function(m, s, r) {
start <- m
len <- attr(m, "match.length")
end <- start + len-1
part <- mapply(substr, r, start, end)
paste0(part, "-", start, ":", end, "-", len, collapse=";")
}, x, ip$V1, ip$V2)
# [1] "xxx-3:5-3"
# [2] "xyy-5:7-3"
# [3] "xx-2:3-2;yz-6:7-2"
I'm not sure what your logic was for turning xxx into x:x or xyy to x-y or how that generalized to other sequences so feel free to change that part. But you can get the start and length of the matches using the attributes of the returned match object. It's just important to use -+ as the pattern so you match a run of dashes rather than just a single dash.

Related

Extract numbers from a character vector and adding leading zeros

I have a character-vector with the following structure:
GDM3
PER.1.1.1_1
PER.1.10.2_1
PER.1.1.32_1
PER.1.1.4_1
PER.1.1.5_1
PER.11.29.1_1
PER.1.2.2_1
PER.31.2.3_1
PER.1.2.44_1
PER.5.2.25_1
I want to extract the three numbers in the middle of middle of that ID and add leading numbers if they are only single digits. The finale vector can be a character vector again. In the end the result should look like this:
GDM3
010101
011002
010132
010104
010105
112901
010202
310203
010244
050225
tmp <- strcapture("\\.([0-9]+)\\.([0-9]+)\\.([0-9]+)_", X$GDM3,
proto = list(a=0L, b=0L, c=0L)) |>
lapply(sprintf, fmt = "%02i")
do.call(paste0, tmp)
# [1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
Explanation:
strcapture extracts the known patterns into a data.frame, with names and classes defined in proto (the actual values in proto are not used);
lapply(sprintf, fmt="%02i") zero-pads to 2 digits all columns of the frame
do.call(paste, tmp) concatenates each row of the frame into a single string.
Data
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
Assuming GDM3 shown in the Note at the end, read it creating a data frame and the use sprintf to create the result.
with( read.table(text = GDM3, sep = ".", comment.char = "_"),
sprintf("%02d%02d%02d", V2, V3, V4) )
giving:
[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203"
[9] "010244" "050225"
Note
GDM3 <- c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1",
"PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1",
"PER.1.2.44_1", "PER.5.2.25_1")
Another solution:
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
strsplit(X$GDM3, "\\.|_") |>
sapply(function(x) paste0(sprintf("%02i", as.numeric(x[2:4])), collapse = ""))
#[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"

extract until next "_" if contains

Is there a way to extract part of string, when there is a match (everything up to the next underscore) "_"?
From: mycampaign_s22uhd4k_otherinfo I need: s22uhd4k.
From: my_campaign_otherinfo_s22jumpto_otherinfo , I would need: s22jumpto
data:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo", "my_campaign_otherinfo_s22jumpto_otherinfo"
), b = c(1, 2)), class = "data.frame", row.names = c(NA, -2L))
Thanks Omar, based on your update/comment, this regex will solve your problem:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo",
"my_campaign_otherinfo_s22jumpto_otherinfo",
"e220041_pe_mx_aon_aonjulio_conversion_shop_facebook-network_ppl_primaria_s22test512gb_hotsale_20220620"
), b = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))
gsub(df$a, pattern = ".*(s22[^_]+(?=_)).*", replacement = "\\1", perl = TRUE)
#> [1] "s22uhd4k" "s22jumpto" "s22test512gb"
Created on 2022-07-17 by the reprex package (v2.0.1)
Explanation:
.*(s22[^_]+(?=_)).*
.* match all characters up until the first capture group
(s22 the first capture group starts with "s22"
[^_]+ after "s22", match any character except "_"
(?=_) until the next "_" (positive look ahead)
) close the first capture group
.* match all remaining characters
Then, the replacement = "\\1" means to just print the captured text (the part you want)

odd behavior when substituting parts of a string within a for loop

I'm trying to replace a series of numbers in a character string with information that comes from a dataframe.
My string comes from a text file that I imported using the readr package as follows: read_file("Human.txt")
I've checked the class, it is character. The string contains the following information (I've named it treeString):
"(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
My dataframe (labels.csv) was originally in factor format, but I changed the format of the second column to character using the following command: labels[,2] = as.character(labels[,2]). It looks like this
v1 v2
1 1 name1
2 2 name2
3 3 name3
My goal is to substitute every number in the string with the corresponding name (i.e. V2) in the dataframe. This should result in the following:
"(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Here is the code I am using to accomplish this:
for(i in 1:nrow(labels)){
gsub(as.character(i), labels[i,2], treeString)
}
The weird thing is that if I run the gsub() command on its own (with specified numbers - eg. 2) it does the substitution, however, when I run it in a loop it does not substitute the numbers.
As pointed out by Kumar Manglam in the comments, you forgot to assign the result of gsub() back to treeString.
There is something else you should be aware of: The way you specified the regular expression in your question it will also replace patterns like "(241)" with "(name24name1)". To avoid this behaviour, you should check whether the numbers you want to replace are preceded by a comma or opening parenthesis and succeeded by a comma or closing parenthesis:
# Option1
for(i in 1:nrow(labelnames)){
reg_pattern <- paste0("(?<=[(,])(", i, ")(?=[),])")
treeString <- gsub(reg_pattern, labelnames$v2[i], treeString, perl=T)
}
Another, nicer, option is drop the for-loop and do it all at once:
# Option2
reg_pattern <- paste0("(?<=[(,])([1-", nrow(labelnames), "])(?=[),])")
treeString <- gsub(reg_pattern, "name\\1", treeString, perl=T)
# Result
treeString
# "(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Data
treeString <- "(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
labelnames <- structure(list(v1 = 1:3, v2 = c("name1", "name2", "name3")), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA, -3L))

Paste value after certain delimiter

I have data in the following format:
In Column A:
String1__String2__String3
In Column B:
Value
I would like to paste the Value into the String after the first delimiter like this:
String1__Value__String2__String3
The crucial part of the code I am using now (where I paste the value) is the following line:
df2 <-cbind(df[1],apply(df[,2:ncol(df)],2,function(i)ifelse(is.na(i), NA, paste(df[,1],i,sep="_"))))
With this code it append the value after the string, like this:
String1__String2__String3__Value
Is there an easy way to rearrange this so the Values will be pasted at the correct place. Or do I have to redo the complete code ?
Thanks
Update, Example:
Column A:
Jennifer__DoesSomething__inaCity
Column B:
2
Result now:
Jennifer__DoesSomething__inaCity__2
Desired result:
Jennifer__2__DoesSomething__inaCity
The strings Jennifer, DoesSomething, inaCity change and are not the same length. Only the delimiter stays the same. I want to paste after the first delimiter.
Thanks !
Here is an idea. Using sub we only replace the first seen pattern. So using mapply we replace all the numbers in one column with their corresponding strings on the second column.
mapply(function(x, y) sub('__', paste0('__', y, '__'), x), df$v1, df$v2)
# atsfs__dsfgg__sdgsdg eeee__FFFF__GGGG
#"atsfs__3__dsfgg__sdgsdg" "eeee__5__FFFF__GGGG"
DATA
dput(df)
structure(list(v1 = c("atsfs__dsfgg__sdgsdg", "eeee__FFFF__GGGG"
), v2 = c(3, 5)), .Names = c("v1", "v2"), row.names = c(NA, -2L
), class = "data.frame")

extract string using substring in column data.frame using another column's integer as argument

If I have a data.frame how can I use the v2 values to substring v1.
df <- data.frame(v1 = c("jsdlfkjs", "fjdslkkkkfj", "jdkskksjdjslak"),
v2 = c(3,4,2))
What to apply something like this :
res <- substring(df$v1, start = df$v2-1, stop = df$v2+1)
and get
res
# [1] "sdl" "dsl" "jdk"
You're using the wrong arguments for substring. Look at ?substring for more information. You want to use first, last not start, stop
res <- substring(df$v1, first = df$v2-1, last = df$v2+1)

Resources