Split and re-concatenate a string

Split and re-concatenate a string - r

I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.

With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.

Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.

One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).

substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"

We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"

Related

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")

Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"

here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

How can I paste a comma (,) in a string of numbers in R Statistics?

I'm quite newbie at R Statistics. I have a vector with multiple objects inside (numbers), and I want to put a comma between the first and second number for the whole objects.
x gives this result:
[8] -8196110 -7681989 -8042092 -8196660 -7606310 -7217828 -7634887
[15] -7401244 -7211947 -7636932 -7606444 -7598894 -7398965```
My question is how to automatically put a comma in all those objects between the first and the second numbers. The desired output would be:
```[1] -8,385772 -7,390682 -8,019960 -8,300000 -8,069984 -8,786782 -7,414995
[8] -8,196110 -7,681989 -8,042092 -8,196660 -7,606310 -7,217828 -7,634887
[15] -7,401244 -7,211947 -7,636932 -7,606444 -7,598894 -7,398965```

We can use sub to capture the first digit from the start (^) of the string and replace with the backreference (\\1) followed by the,
sub("^(-?\\d)", "\\1,", x)
-output
[1] "-8,196110" "-7,681989" "-8,042092" "-8,196660" "-7,606310" "-7,217828" "-7,634887" "-7,401244" "-7,211947" "-7,636932" "-7,606444" "-7,598894" "-7,398965"
data
x <- c(-8196110, -7681989, -8042092, -8196660, -7606310, -7217828,
-7634887, -7401244, -7211947, -7636932, -7606444, -7598894, -7398965
)

We can use strsplit to split our numeric vector into a list where each element has the first digit and then the rest of the number. Then pass that into an sapply call that inserts a comma in the right spot:
x_split = strsplit(as.character(x), split = '')
sapply(x_split, function(k){paste0(c(k[1], ',',k[2:length(k)]), collapse = '')})

Cut every element of a vector of strings at the second occurrence of a pattern in R

I have a chr vector:
> head(strings)
[1] "10_88517_0" "10_88521_1" "10_88542_2" "10_280230_3" "10_280258_4" "10_280310_5"
I want to create a new vector of substrings, obtained by cutting each element of this vector at the second _. E.g.:
> head(cut_strings)
[1] "10_88517" "10_88521" "10_88542" "10_280230" "10_280258" "10_280310"
My idea was to first grep for the position of second _ in each string:
cut_pts <- sapply(stringr::str_locate_all(strings, "_"), "[", 2)
All I can come up with though is an awkward for loop that goes through the strings vector and calls substr for each element, e.g.:
cut_strings <- strings
for(i in 1:length(strings)){
string <- strings[i]
cut_pt <- cut_pts[i]
string <- substr(string, 1, cut_pt-1)
cut_strings[i] <- string
}
I'm thinking maybe there's a way to use apply in this context, to cut each element of strings based on the appropriate element of cut_pts?

We could capture those characters in sub and remove the substring afterwards i..e below pattern matches the one or more characters not an underscore ([^_]+) followed by an underscore, then characters not an underscore and remove the character starting from second underscore by not including in the capture group ((...)). Note that we specified the start of the string (^). In the replacement, use the backreference (\\1) of the captured group
sub("^([^_]+_[^_]+)_.*", "\\1", strings)

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.

One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"

Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")

Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"

These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!

You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"

You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.

This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)

Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split and re-concatenate a string - r

We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions: format(numeric_version(ips)[, 1:2]) [1] "140.112" "132.212" "31.2" "7.112"

Related

R splitting string on predefined location

How can I paste a comma (,) in a string of numbers in R Statistics?

Cut every element of a vector of strings at the second occurrence of a pattern in R

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

Extract first X Numbers from Text Field using Regex

Categories

Resources