How to separate values in a string only after the second space - r

I have a string of names, for example:
st <- 'IKE IROEGBU NIMROD LEVI KYLE GIBSON CHAVAUGHN LEWIS BRYCE WASHINGSON'
and I want the output to be a vector like this:
c('IKE IROEGBU', 'NIMROD LEVI', 'KYLE GIBSON', 'CHAVAUGHN LEWIS', 'BRYCE WASHINGSON')
how can I do this?

You can do:
st <- 'IKE IROEGBU NIMROD LEVI KYLE GIBSON CHAVAUGHN LEWIS BRYCE WASHINGSON'
c(stringr::str_match_all(st, "\\S+\\s\\S+")[[1]])
#> [1] "IKE IROEGBU" "NIMROD LEVI" "KYLE GIBSON" "CHAVAUGHN LEWIS"
#> [5] "BRYCE WASHINGSON"

An other, non-regex friendly way:
sst <- strsplit(st, " ")[[1]]
paste(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
# [1] "IKE IROEGBU" "NIMROD LEVI" "KYLE GIBSON" "CHAVAUGHN LEWIS" "BRYCE WASHINGSON"

Another way:
unlist(strsplit(st, " ")) -> names
i = 0
while (i < length(names) / 2) {
print(
paste0(names[1:2 + i * 2], collapse = " ")
)
i = i + 1
}
# [1] "IKE IROEGBU"
# [1] "NIMROD LEVI"
# [1] "KYLE GIBSON"
# [1] "CHAVAUGHN LEWIS"
# [1] "BRYCE WASHINGSON"

Related

Apply regmatches function to a list of chr in R

I have this list of character stored in a variable called x:
x <-
c(
"images/logos/france2.png",
"images/logos/cnews.png",
"images/logos/lcp.png",
"images/logos/europe1.png",
"images/logos/rmc-bfmtv.png",
"images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
pattern <- "images/logos/\\s*(.*?)\\s*.png"
regmatches(x, regexec(pattern, x))[[1]][2]
I wish to extract a portion of each chr string according to a pattern, like this function does, which works fine but only for the first item in the list.
pattern <- "images/logos/\\s*(.*?)\\s*.png"
y <- regmatches(x, regexec(pattern, x))[[1]][2]
Only returns:
"france2"
How can I apply the regmatches function to all items in the list in order to get a result like this?
[1] "france2" "europe1" "sudradio"
[4] "cnews" "rmc-bfmtv" "franceinfo"
[7] "lcp" "rmc" "lcp"
FYI this is a list of src tags that comes from a scraper
Try gsub
gsub(
".*/(.*)\\.png", "\\1",
c(
"images/logos/france2.png", "images/logos/cnews.png",
"images/logos/lcp.png", "images/logos/europe1.png",
"images/logos/rmc-bfmtv.png", "images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
)
which gives
[1] "france2" "cnews" "lcp" "europe1" "rmc-bfmtv"
[6] "sudradio" "franceinfo"
Output of regmatches(..., regexec(...)) is a list. You may use sapply to extract the 2nd element from each element of the list.
sapply(regmatches(x, regexec(pattern, x)), `[[`, 2)
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
You may also use the function basename + file_path_sans_ext from tools package which would give the required output directly.
tools::file_path_sans_ext(basename(x))
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
strings = c("images/logos/france2.png","images/logos/cnews.png",
"images/logos/lcp.png","images/logos/europe1.png",
"images/logos/rmc-bfmtv.png","images/logos/sudradio.png",
"images/logos/franceinfo.png")
)
df %>%
mutate(strings = str_remove(strings, "images/logos/") %>%
str_remove("\\.png"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo
Or even simpler:
library(tidyverse)
df %>%
mutate(strings = str_extract(strings, "(?<=images/logos/)(.*)(?=\\.png)"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo

Pattern Matching & Replacement / Cleaning of Data in R

I'm looking to plot geospatial data, thus I require coordinates. The information I've been provided is very messy and I need a good system to convert a vector of coordinates in multiple formats into one useful format as per below:
Input:
- lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6",
"38.720647")
- lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0",
"-9.22522")
Output:
- lat <- c(41.122333, 40.39156, 41.10589, 38.31106, 38.720647)
- lon <- c(8.194066, 7.523195, 8.37832, 8.54170, -9.22522)
Does anyone have a creative solution? Any response is much appreciated!
lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0", "-9.22522")
gsub(" ", "", sub("\\s", ".", gsub("º|\\'|\\.", " ", lat)))
[1] "41.122333" "40.39156" "41.10589" "38.31106" "38.720647"
gsub(" ", "", sub("\\s", ".", gsub("º|\\'|\\.", " ", lon)))
[1] "8.194066" "7.523195" "8.37832" "8.54170" "-9.22522"
1.: replace all º, ' and . with a white space
2.: replace the first white space with a decimal point
3.: replace all remaining spaces by "" to have your strings pasted together again
With Base R could you please try following and let me know if this helps you.
lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
for (i in lat)
{
i <- gsub("º| ","#",i)
i <- gsub("'|\\.","",i)
i <- gsub("#",".",i)
print(i)
}
Output will be as follows.
[1] "41.122333"
[1] "40.39156"
[1] "41 10 589"
[1] "38 31 106"
[1] "38720647"
This function will also work:
# DATA
lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0", "-9.22522")
# FUNCTION
convert_coordinates <- function(x) {
splits <- x %>% strsplit(. , "º| |[.]|'") # Remove unwanted punctuation. Note that you can add more characters to replace here, just separate them with a |
splits <- lapply(splits, function(x){x[!x ==""]}) # Remove any empty strings
output <- c()
for (i in 1:length(splits)) {
output[i] <- paste0(splits[[i]][1], ".", paste0(splits[[i]][2:(length(splits[[i]]))], collapse=""), collapse="")
}
return(output)
}
# RESULTS
convert_coordinates(lat)
# [1] "41.122333" "40.39156" "41.10589" "38.31106" "38.720647"
convert_coordinates(lon)
# [1] "8.194066" "7.523195" "8.37832" "8.54170" "-9.22522"

In R, how do I wrap text around all words in a string, but a specific one(going from left to right)? Iteration and string manipulation

I know my question is a little vague, so I have an example of what I'm trying to do.
input <- c('I go to school')
#Output
'"I " * phantom("go to school")'
'phantom("I ") * "go" * phantom("to school")'
'phantom("I go ") * "to" * phantom("school")'
'phantom("I go to ") * "school"'
I've written a function, but I'm having a lot of trouble figuring out how to make it applicable to strings with different numbers of words and I can't figure out how I can include iteration to reduce copied code. It does generate the output above though.
Right now my function only works on strings with 4 words. It also includes no iteration.
My main questions are: How can I include iteration into my function? How can I make it work for any number of words?
add_phantom <- function(stuff){
strings <- c()
stuff <- str_split(stuff, ' ')
strings[1] <- str_c('"', stuff[[1]][[1]], ' "', ' * ',
'phantom("', str_c(stuff[[1]][[2]], stuff[[1]][[3]], stuff[[1]][[4]], sep = ' '), '")')
strings[2] <- str_c('phantom("', stuff[[1]][[1]], ' ")',
' * "', stuff[[1]][[2]], '" * ',
'phantom("', str_c(stuff[[1]][[3]], stuff[[1]][[4]], sep = ' '), '")')
strings[3] <- str_c('phantom("', str_c(stuff[[1]][[1]], stuff[[1]][[2]], sep = ' '), ' ")',
' * "', stuff[[1]][[3]], '" * ',
'phantom("', stuff[[1]][[4]], '")')
strings[4] <- str_c('phantom("', str_c(stuff[[1]][[1]], stuff[[1]][[2]], stuff[[1]][[3]], sep = ' '), ' ")',
' * "', stuff[[1]][[4]], '"')
return(strings)
}
this is some butcher work but it gives the expected output :):
input <- c('I go to school')
library(purrr)
inp <- c(list(NULL),strsplit(input," ")[[1]])
phantomize <- function(x,leftside = T){
if(length(x)==1) return("")
if(leftside)
ph <- paste0('phantom("',paste(x[-1],collapse=" "),' ") * ') else
ph <- paste0(' * phantom("',paste(x[-1],collapse=" "),'")')
ph
}
map(1:(length(inp)-1),
~paste0(phantomize(inp[1:.x]),
inp[[.x+1]],
phantomize(inp[(.x+1):length(inp)],F)))
# [[1]]
# [1] "I * phantom(\"go to school\")"
#
# [[2]]
# [1] "phantom(\"I \") * go * phantom(\"to school\")"
#
# [[3]]
# [1] "phantom(\"I go \") * to * phantom(\"school\")"
#
# [[4]]
# [1] "phantom(\"I go to \") * school"
This is a bit of a hack, but I think it gets at what you're trying to do:
library(corpus)
input <- 'I go to school'
types <- text_types(input, collapse = TRUE) # all word types
(loc <- text_locate(input, types)) # locate all word types, get context
## text before instance after
## 1 1 I go to school
## 2 1 I go to school
## 3 1 I go to school
## 4 1 I go to school
The return value is a data frame, with columns of type corpus_text. This approach seems crazy, but it doesn't actually allocate new strings for the before and after contexts (both of which have type corpus_text)
Here's the output you wanted:
paste0("phantom(", loc$before, ") *", loc$instance, "* phantom(", loc$after, ")")
## [1] "phantom() *I* phantom( go to school)"
## [2] "phantom(I ) *go* phantom( to school)"
## [3] "phantom(I go ) *to* phantom( school)"
## [4] "phantom(I go to ) *school* phantom()"
If you want to really get crazy and ignore punctuation:
phantomize <- function(input, ...) {
types <- text_types(input, collapse = TRUE, ...)
loc <- text_locate(input, types, ...)
paste0("phantom(", loc$before, ") *", loc$instance, "* phantom(",
loc$after, ")")
}
phantomize("I! go to school (?)...don't you?", drop_punct = TRUE)
## [1] "phantom() *I* phantom(! go to school (?)...don't you?)"
## [2] "phantom(I! ) *go* phantom( to school (?)...don't you?)"
## [3] "phantom(I! go ) *to* phantom( school (?)...don't you?)"
## [4] "phantom(I! go to ) *school* phantom( (?)...don't you?)"
## [5] "phantom(I! go to school (?)...) *don't* phantom( you?)"
## [6] "phantom(I! go to school (?)...don't ) *you* phantom(?)"
I would suggest something like this
library(tidyverse)
library(glue)
test_string <- "i go to school"
str_split(test_string, " ") %>%
map(~str_split(test_string, .x, simplify = T)) %>%
flatten() %>%
map(str_trim) %>%
keep(~.x != "") %>%
map(~glue("phantom({string})", string = .x))
This code snippet can easily be implemented in a function and will return the following output.
[[1]]
phantom(i)
[[2]]
phantom(i go)
[[3]]
phantom(i go to)
[[4]]
phantom(go to school)
[[5]]
phantom(to school)
[[6]]
phantom(school)
I might have misinterpreted your question -- i am not quite sure if you really want the output to have the same format as in your examplary output.

How to break a string up into overlapping sets of 3

If I have multiple strings like:
skhdsfiiuwkncyeuhrsl
sdskkjheocbsill
sldkjflsdkjb
How can I program the output to be the overlapping triplets, for example, I want it to output:
skh, khd, hds, ..., rsl
sds, dsk, skk, ..., ill
sld, ldk, dkj, ..., kjb
substring works:
x = c("skhdsfiiuwkncyeuhrsl", "sdskkjheocbsill", "sldkjflsdkjb", "ab")
n = 3
lapply(x, function(z)
if ((nc <- nchar(z)) >= n)
substring(z, seq(1, nc - n + 1), seq(n, nc))
else
character(0)
)
which gives
[[1]]
[1] "skh" "khd" "hds" "dsf" "sfi" "fii" "iiu" "iuw" "uwk" "wkn" "knc" "ncy"
[13] "cye" "yeu" "euh" "uhr" "hrs" "rsl"
[[2]]
[1] "sds" "dsk" "skk" "kkj" "kjh" "jhe" "heo" "eoc" "ocb" "cbs" "bsi" "sil"
[13] "ill"
[[3]]
[1] "sld" "ldk" "dkj" "kjf" "jfl" "fls" "lsd" "sdk" "dkj" "kjb"
[[4]]
character(0)
Taking inspiration from this answer, Here's a one-liner:
strings <- c("skhdsfiiuwkncyeuhrsl",
"sdskkjheocbsill",
"sldkjflsdkjb")
sapply(strings, function(x) substring(x, seq(1,nchar(x)-2,1), seq(3,nchar(x),1)))
# $skhdsfiiuwkncyeuhrsl
# [1] "skh" "khd" "hds" "dsf" "sfi" "fii" "iiu" "iuw" "uwk" "wkn" "knc" "ncy" "cye" "yeu" "euh"
# [16] "uhr" "hrs" "rsl"
# $sdskkjheocbsill
# [1] "sds" "dsk" "skk" "kkj" "kjh" "jhe" "heo" "eoc" "ocb" "cbs" "bsi" "sil" "ill"
# $sldkjflsdkjb
# [1] "sld" "ldk" "dkj" "kjf" "jfl" "fls" "lsd" "sdk" "dkj" "kjb"
a <- "skhdsfiiuwkncyeuhrsl"
b <- "sdskkjheocbsill"
c <- "sldkjflsdkjb"
make_triplets <-
function(X){
nTriplets <- length(2:(nchar(X)-1))
triplets <- character(nTriplets)
for(i in 2:(nchar(X)-1)){
triplets[i-1] <- substr(X, i - 1, i + 1)
}
return(triplets)
}
make_triplets(a)
make_triplets(b)
make_triplets(c)

Issue with strsplit not storing searched field

I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street

Resources