Is there a way to extract part of string, when there is a match (everything up to the next underscore) "_"?
From: mycampaign_s22uhd4k_otherinfo I need: s22uhd4k.
From: my_campaign_otherinfo_s22jumpto_otherinfo , I would need: s22jumpto
data:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo", "my_campaign_otherinfo_s22jumpto_otherinfo"
), b = c(1, 2)), class = "data.frame", row.names = c(NA, -2L))
Thanks Omar, based on your update/comment, this regex will solve your problem:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo",
"my_campaign_otherinfo_s22jumpto_otherinfo",
"e220041_pe_mx_aon_aonjulio_conversion_shop_facebook-network_ppl_primaria_s22test512gb_hotsale_20220620"
), b = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))
gsub(df$a, pattern = ".*(s22[^_]+(?=_)).*", replacement = "\\1", perl = TRUE)
#> [1] "s22uhd4k" "s22jumpto" "s22test512gb"
Created on 2022-07-17 by the reprex package (v2.0.1)
Explanation:
.*(s22[^_]+(?=_)).*
.* match all characters up until the first capture group
(s22 the first capture group starts with "s22"
[^_]+ after "s22", match any character except "_"
(?=_) until the next "_" (positive look ahead)
) close the first capture group
.* match all remaining characters
Then, the replacement = "\\1" means to just print the captured text (the part you want)
Related
I have a character-vector with the following structure:
GDM3
PER.1.1.1_1
PER.1.10.2_1
PER.1.1.32_1
PER.1.1.4_1
PER.1.1.5_1
PER.11.29.1_1
PER.1.2.2_1
PER.31.2.3_1
PER.1.2.44_1
PER.5.2.25_1
I want to extract the three numbers in the middle of middle of that ID and add leading numbers if they are only single digits. The finale vector can be a character vector again. In the end the result should look like this:
GDM3
010101
011002
010132
010104
010105
112901
010202
310203
010244
050225
tmp <- strcapture("\\.([0-9]+)\\.([0-9]+)\\.([0-9]+)_", X$GDM3,
proto = list(a=0L, b=0L, c=0L)) |>
lapply(sprintf, fmt = "%02i")
do.call(paste0, tmp)
# [1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
Explanation:
strcapture extracts the known patterns into a data.frame, with names and classes defined in proto (the actual values in proto are not used);
lapply(sprintf, fmt="%02i") zero-pads to 2 digits all columns of the frame
do.call(paste, tmp) concatenates each row of the frame into a single string.
Data
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
Assuming GDM3 shown in the Note at the end, read it creating a data frame and the use sprintf to create the result.
with( read.table(text = GDM3, sep = ".", comment.char = "_"),
sprintf("%02d%02d%02d", V2, V3, V4) )
giving:
[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203"
[9] "010244" "050225"
Note
GDM3 <- c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1",
"PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1",
"PER.1.2.44_1", "PER.5.2.25_1")
Another solution:
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
strsplit(X$GDM3, "\\.|_") |>
sapply(function(x) paste0(sprintf("%02i", as.numeric(x[2:4])), collapse = ""))
#[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
I have the following vector:
a <- c("teste3/Nova pasta3/texto33.txt", "teste3/texto3.txt", "teste3/Nova pasta3",
"teste3")
In certain cases I have not a vector, but a dataframe
structure(list(filename = c("teste1/", "teste1/Nova pasta1/",
"teste1/Nova pasta1/texto11.txt", "teste1/texto1.txt", "teste1/New Folder/"
)), class = "data.frame", row.names = c(NA, -5L))
I would to get the names that are between slash bar (/*/).
In this case just the name (Nova pasta3) for the vector and the name (Nova pasta1) for the dataframe.
Thanks
I am trying to get
start and end positions of "-" character in column V1
and its corresponding characters at these positions in column V2
Then length of it
Any help will be appreciated!
ip <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz")), class = "data.frame", row.names = c(NA,
-3L))
I tried stringi_locate but it outputs for individual position. For example, For this "ab---cdef" instead of 3-5 it outputs 3-3, 4-4, 5-5.
Expected output:
op <- structure(list(V1 = c("ab---cdef", "abcd---ef", "a--bc--def"),
V2 = c("xxxxxxxyy", "xxxxxyyyy", "xxxyyyzzzz"), output = c("x:x-3:5-3",
"x:y-5:7-3", "x:x-2:3-2; y-z:6:7-2")), class = "data.frame", row.names = c(NA,
-3L))
the output column must have
The characters in V2 column with respect to start and end of "-" in V1
Then start and end position
Then its length
V1 V2 output
ab---cdef xxxxxxxyy x:x-3:5-3
Thanks!
Here's an example using grepexpr to get all the matches in a string.
x <- gregexpr("-+", ip$V1)
mapply(function(m, s, r) {
start <- m
len <- attr(m, "match.length")
end <- start + len-1
part <- mapply(substr, r, start, end)
paste0(part, "-", start, ":", end, "-", len, collapse=";")
}, x, ip$V1, ip$V2)
# [1] "xxx-3:5-3"
# [2] "xyy-5:7-3"
# [3] "xx-2:3-2;yz-6:7-2"
I'm not sure what your logic was for turning xxx into x:x or xyy to x-y or how that generalized to other sequences so feel free to change that part. But you can get the start and length of the matches using the attributes of the returned match object. It's just important to use -+ as the pattern so you match a run of dashes rather than just a single dash.
I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
I have a dataframe
A | B
A_Long_String | 7.1123
Another_String | 1234124
and I need to change it into a long string to feed into sql proc.
So the format needs to be
' "A_Long_String": "7.1123" \n "Another_String": "1234124" \n '
etc etc
What is the best way of doing this. Do I just have to loop?
The dataframe can be upwards of 100,000 rows.
Thanks
Try
df1[] <- lapply(df1, dQuote)
paste(do.call(paste, c(df1, sep=": ")), '\n', collapse=' ')
#[1] "“A_Long_String”: “7.1123” \n “Another_String”: “1234124” \n"
data
df1 <- structure(list(A = c("A_Long_String", "Another_String"),
B = c(7.1123,
1234124)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -2L))