string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string
[1] " A B C"
text
[1] " I love chocolate pudding"
I am trying to match letters in "string" with text in "text" such that to the letter A corresponds the text "I love" to B corresponds "chocolate" and to C "pudding". Ideally, I would like to put A, B, C in column 1 and three different rows of a dataframe (or tibble) and the text in column 2 and the corresponding rows. Any suggestion?
It is hard to know whether the strings in which you are trying to manipulate and then collate into columns in a data.frame follow a pattern. But for the example you posted, I suggest creating a list with the strings (strings):
strings <- list(string, text)
Then use lapply() which will in turn create a list for each element in strings.
res <-lapply(strings, function(x){
grep(x=trimws(unlist(strsplit(x, "\\s\\s"))), pattern="[[:alpha:]]", value=TRUE)
})
In the code above, strsplit() splits the string whenever two spaces are found (\\s\\s). But the resulting split is a list with the strings as inner elements. Therefore you need to use unlist() so you can use it with grep(). grep() will select only those strings with an alphanumeric character --which is what you want.
You can then use do.call(cbind, list) to bind the elements in the resulting lapply() list into columns. The dimension must match for this work.
do.call(cbind, res)
Result:
> do.call(cbind, res)
[,1] [,2]
[1,] "A" "I love"
[2,] "B" "chocolate"
[3,] "C" "pudding"
You can wrap it up into a as.data.frame() for instance to get the desired result:
> as.data.frame(do.call(cbind, res), stringsAsFactors = FALSE)
V1 V2
1 A I love
2 B chocolate
3 C pudding
You can use read.fwf and get the positions using nchar.
read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1]
# V2 V3 V4
#1 I love chocolate pudding
In case the white spaces should be removed use also trimws:
trimws(read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1])
#[1] "I love" "chocolate" "pudding"
Based on your data, I came up with this workaround by using the package stringr. This only works with that kind of pattern, so in case you have erratic ones you need to adjust it.
The output is a data.frame with two columns given by your two input data and rows according to the matches.
library(stringr)
string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string_nospace <- str_replace_all( string, "\\s{1,20}", " " )
string_nospace <- str_trim( string_nospace )
string_nospace <- data.frame( string = t(str_split(string_nospace, "\\s", simplify = TRUE)))
text_nospace <- str_replace_all( text, "\\s{2,20}", "_" )
text_nospace <- str_sub(text_nospace, start = 2)
text_nospace <- data.frame(text = t(str_split(text_nospace, "_", simplify = TRUE)))
df = data.frame(string = string_nospace,
text = text_nospace )
df
#> string text
#> 1 A I love
#> 2 B chocolate
#> 3 C pudding
Created on 2020-06-08 by the reprex package (v0.3.0)
Related
I need help with ideas for parsing this text.
I want do it the most automatic way possible.
This is the text
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
I need this result:
a
b
JOHN DEERE
PMWF2126
NEW HOLLAND
441702A1
HIFI
WE 2126
CUMMINS
4907485
This is an example, there is a different marks an item id
I try:
str_split(text, " ")
[[1]]
[1] "JOHN" "DEERE:" "PMWF2126" "NEW" "HOLLAND:" "441702A1" "HIFI:" "WE" "2126"
[10] "CUMMINS:" "4907485" "CUMMINS:" "3680433" "CUMMINS:" "3680315" "CUMMINS:" "3100310"
Thanks!
Edit:
Thanks for your answers, very helpfull
But there is anoter case where can end with a letter to
text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"
We can use separate_rows and separate from tidyr for this task:
library(tidyverse)
data.frame(text) %>%
# separate into rows:
separate_rows(text, sep = "(?<=\\d)\\s") %>%
# separate into columns:
separate(text,
into = c("a", "b"),
sep = ":\\s")
# A tibble: 4 × 2
a b
<chr> <chr>
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.
Data:
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not.
If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
lapply(
strsplit(
unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)),
":"), trimws)
[[1]]
[1] "JOHN DEERE" "PMWF2126"
[[2]]
[1] "NEW HOLLAND" "441702A1"
[[3]]
[1] "HIFI" "WE 2126"
[[4]]
[1] "CUMMINS" "4907485"
the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part.
This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z].
These positions are the used as splitting points
Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.
text |>
gsub("(\\d) ", "\\1\n", x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
giving
V1 V2
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.
library(gsubfn)
text |>
gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
my question is:
If i have the next df:
df<- data.frame(Respuestas=c("sí, acepto, a veces, no acepto",
"no acepto, sí, acepto, a veces, nunca",
"a veces, sí, acepto, nunca, bla bla"))
print(df)
Respuestas
1 sí, acepto, a veces, no acepto
2 no acepto, sí, acepto, a veces, nunca
3 a veces, sí, acepto, nunca, bla bla
So I needed to extract by columns all string they matched "Respuestas" column with a dictionary, so I applied the #divibisan solution here. So far, all very good, I get my output
vec<-c("sí, acepto", "a veces", "no acepto")
t(apply(df, 1, function(x)
str_extract_all(x[['Respuestas']], vec, simplify = TRUE)))
[,1] [,2] [,3]
[1,] "sí, acepto" "a veces" "no acepto"
[2,] "sí, acepto" "a veces" "no acepto"
[3,] "sí, acepto" "a veces" ""
But, finally I want to get a data frame with the non-match values between "Respuestas" column string and the vec dictionary, something like this:
wishDF<- data.frame(noMatch1=c(NA,
"nunca",
"nunca"),
noMatch2= c(NA,NA, "bla bla"))
print(wishDF)
noMatch1 noMatch2
1 <NA> <NA>
2 nunca <NA>
3 nunca bla bla
I was trying to use str_detect and invert_match from stringr library in the same way that #divibisan solution, but I dont get good result. What do you recommend me?
Thank you very much!
The easiest way to find the content of a string that doesn't match is simply to remove the content that does match:
str_remove(string, pattern)
But this function is vectorized on pattern, so it will remove only one entry from vec each time. We need to go to the implementation: str_remove is an alias for str_replace(string, pattern, "") which is based on the stringi package. So we can do this with:
stringi::stri_replace_all_coll(string, pattern, "", vectorize_all = FALSE)
Finally we want to do that for every row in Respuetas, we can do that simply with map:
map_chr(df$Respuestas,
~ stringi::stri_replace_all_coll(.x, vec, "", vectorize_all = FALSE))
# [1] ", , " ", , , nunca" ", , nunca, bla bla"
With regards to noMatch1 and noMatch2, it is possible to separate the result based on ",". But I don't know enough about your data to be sure it'll work: do you always have the same number of fields? How do you distinguish between the comma in "si, acepto" and the one between "si, acepto" and "nunca"?
Depending on your data, something like this may or may not work (and may or may not make any sense at all):
df %>%
mutate(no_match = map_chr(Respuestas,
~ stringi::stri_replace_all_coll(.x, vec, "", vectorize_all = FALSE))) %>%
separate(col = no_match,
into = c("first", "second", "third", "fourth"),
sep = ",",
extra = "merge",
fill = "left")
This is my first time posting; please let me know if I'm doing any beginner mistakes. In my specific case I have a vector of strings, and I want to collapse some adjacent rows. I have one vector indicating the starting position and one indicating the last element. How can I do this?
Here is some sample code and my approach that does not work:
text <- c("cat", "dog", "house", "mouse", "street")
x <- c(1,3)
y <- c(2,5)
result <- as.data.frame(paste(text[x:y],sep = " ",collapse = ""))
In case it's not clear, the result I want is a data frame consisting of two strings: "cat dog" and "house mouse street".
Not sure this is the best option, but it does the job,
sapply(mapply(seq, x, y), function(i)paste(text[i], collapse = ' '))
#[1] "cat dog" "house mouse street"
Either use base R with
mapply(function(.x,.y) paste(text[.x:.y],collapse = " "), x, y)
or use the purrr package as
map2_chr(x,y, ~ paste(text[.x:.y],collapse = " "))
Both yield
# [1] "cat dog" "house mouse street"
The output as a data frame depends on the structure you want: rows or columns
I think you want
result <- data.frame(combined = c(paste(text[x[1]:y[1]], collapse = " "),
paste(text[x[2]:y[2]], collapse = " ")))
Which gives you
result
#> combined
#> 1 cat dog
#> 2 house mouse street
Another base R solution, using parse + eval
result <- data.frame(new = sapply(paste0(x,":",y),function(v) paste0(text[eval(parse(text = v))],collapse = " ")),
row.names = NULL)
such that
> result
new
1 cat dog
2 house mouse street
I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"
I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")