How do I find differing words in two strings, sentence-wise? - r

I am comparing two similar texts. x1 is the model text and x2 is the text with mistakes (e.g spelling, new characters etc.). I am trying to remove words found in both texts. Since my actual text is not in English I cannot use the dictionary.
What I have tried is to step through each character of x1 and if it is same character in x2 then delete from x2 and move to next character of x1.
Code I've been working on:
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
library(tidyverse)
x1 <- str_split(x1, "(?<=\\.)\\s")
x1 <- lapply(x1,tolower)
x2 <- str_split(x2, "(?<=\\.)\\s")
x2 <- lapply(x2,tolower)
delete_a_from_b <- function(a,b) {
a_as_list <- str_remove_all(a,"word") %>%
str_split(boundary("character")) %>% unlist
b_n <- nchar(b)
b_as_list <- str_remove_all(b,"word") %>%
str_split(boundary("character")) %>% unlist
previous_j <-1
for (i in 1:length(a_as_list)) {
if(previous_j > length(b_as_list))
break
for (j in previous_j:length(b_as_list)){
if(a_as_list[[i]]==b_as_list[[j]]){
b_as_list[[j]] <- ""
previous_j <- j+1
break
}
}
}
print(paste0(b_as_list,collapse = ""))
paste0(b_as_list,collapse = "")
}
x3 <- delete_a_from_b(x1,x2)
x3 <- strsplit(x3,"\\s")
Output:
x3
[[1]]
[1] "text" "this" "i" "i" "d?am" "clueless" "thius" "coing.\","
[9] "\"ley’s" "dythsssoon" "as" "possibk"
What I want as result is: 'text' 'this' 'id' 'thius' 'ley’s' 'ythis' 'possiblke'

I take it you want to compare the two strings x1 and x2 by sentence - not really clear in the question. The previous solutions do not take this into account.
Try this:
First split, both strings into sentences:
x1_sentences <- unlist(strsplit(tolower(x1), split = "[.?!] "))
x2_sentences <- unlist(strsplit(tolower(x2), split = "[.?!] "))
length(x1_sentences) == length(x2_sentences) # Make sure same number of resulting sentences
Then, for each sentence, split the two vectors again and show difference in words:
for (i in 1:length(x1_sentences)) {
x1_vector <- unlist(strsplit(x1_sentences[i], split = "[ ]"))
x2_vector <- unlist(strsplit(x2_sentences[i], split = "[ ]"))
print(setdiff(x2_vector, x1_vector)) # The order here is important!
}
Gives (which you can easily turn into a new vector):
[1] "text"
[1] "this"
[1] "id"
[1] "thius"
[1] "ley’s" "ythis" "possiblke."

think i did it , is this what you need?
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
x1_w<-strsplit(paste(x1, collapse = " "), ' ')[[1]]
x2_w<-strsplit(paste(x2, collapse = " "), ' ')[[1]]
x1<- lapply(x1,tolower)
x2<- lapply(x2,tolower)
`%notin%` <- Negate(`%in%`)
x2_w[which(x2_w %notin% x1_w)]
# same as:
setdiff(x2_w,x1_w)
# out:
#> x2_w[which(x2_w %notin% x1_w)]
#[1] "text" "id" "thius" "ley’s" "ythis" "possiblke."

Related

Convert a quosure with dashes to a string?

When I do:
> quo(DLX6-AS1)
The output is:
<quosure>
expr: ^DLX6 - AS1
env: global
Which inserts spaces around the dash.
When I try to convert that to a string, I get either:
quo(DLX6-AS1) %>% quo_name
"DLX6 - AS1"
or
quo(DLX6-AS1) %>% rlang::quo_name
or
quo(`DLX6-AS1`) %>% rlang::quo_name
Error: Can't convert a call to a string
How can I make it possible to use strings with dashes in my function? The function takes in a gene name and looks up that row in a dataframe, but some of the genes are concatenated by a dash:
geneFn <- function(exp.df = seurat.object#data, gene = SOX2) {
gene <- enquo(gene)
exp.df <- exp.df[as_name(gene), ]
}
> geneFn(DLX6-AS1)
Thanks!
This has been asked before here: https://github.com/r-lib/rlang/issues/770 , but it doesn't answer how to actually do this.
What version of rlang do you have? For me this works:
quo(`DLX6-AS1`) %>% quo_name()
#> [1] "DLX6-AS1"
You do need to use backticks when column names have special characters, otherwise they are interpreted as code.
Note that it is recommended to use either as_name() or as_label() instead of quo_name(), the latter was a misleading misnomer and might be deprecated in the future.
One option would be to stick with bare row names but wrap names that aren't syntactically valid (like names with dashes) in backticks. This could be confusing if someone else is supposed to use this function.
Here's a small, reproducible example:
library(rlang)
dat = data.frame(x1 = letters[1:2],
x2 = LETTERS[1:2])
row.names(dat) = c("DLX6-AS1", "other")
geneFn <- function(exp.df = dat, gene = other) {
gene <- enquo(gene)
exp.df[as_name(gene), ]
}
geneFn(gene = other)
# x1 x2
# other b B
geneFn(gene = `DLX6-AS1`)
# x1 x2
# DLX6-AS1 a A
If you have many names like this, it may be simpler to pass quoted names instead of bare names. This also simplifies the function a bit since you don't need tidyeval.
geneFn2 <- function(exp.df = dat, gene = "other") {
exp.df[gene, ]
}
geneFn2(gene = "other")
# x1 x2
# other b B
geneFn2(gene = "DLX6-AS1")
# x1 x2
# DLX6-AS1 a A
Another option is to make syntactically valid names row names. The make.names() function can help with this.
make.names( row.names(dat) )
[1] "DLX6.AS1" "other"
Then you could assign these new row names to replace the old and go ahead with your original function with the new names.
row.names(dat) = make.names( row.names(dat) )
What about:
geneFn <- function(exp.df = seurat.object#data, gene = SOX2) {
gene <- sub(" - ","-", deparse(enexpr(gene)))
exp.df <- exp.df[gene, ]
}

How do I convert a string to a number in R if the string contains a letter?

I am currently helping a friend with his research and am gathering information about different natural disasters that occured from 2004-2016. The data can be found using this link:
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
when you import it to R it gives helpful information, however, my friend, and now I, am only interested in State, Year, Month, Event, Type, County, Direct & indirect deaths and injuries, and property damage. So first I am extracting the columns I need and will later in the code combine them back together, however the data is currently in string mode, for the Property Damage column I need it to present as numeric since it is in cash value. So for example, I have a data entry in that column that looks like "8.6k" and I need it as this 8600 and for all the "NA" entries to be replaced with a 0.
I have this so far but it gives me back a string of "NA"s. Can anyone think of a better way of doing this?
State<- W2004$STATE
Year<-W2004$YEAR
Month<-W2004$MONTH_NAME
Event<-W2004$EVENT_TYPE
Type<-W2004$CZ_TYPE
County<-W2004$CZ_NAME
Direct_Death<-W2004$DEATHS_DIRECT
Indirect_Death<-W2004$DEATHS_INDIRECT
Direct_Injury<-W2004$INJURIES_DIRECT
Indirect_Injury<-W2004$INJURIES_INDIRECT
W2004$DAMAGE_PROPERTY<-as.numeric(W2004$DAMAGE_PROPERTY)
Damage_Property<-W2004$DAMAGE_PROPERTY
l <- cbind( all the columns up there)
print(l)
We can try using a case when expression here, to map each type of unit to a bona fide number. Going with the two examples you actually showed us:
library(dplyr)
x <- c("1.00M", "8.6k")
result <- case_when(
grepl("\\d+k$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000,
grepl("\\d+M$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000000,
TRUE ~ as.numeric(sub("\\D+$", "", x))
)
You can extract the letter and use switch() which is easily maintainable, if you want to add additional symbols it is very easy.
First, the setup:
options(scipen = 999) # to prevent R from printing scientific numbers
library(stringr) # to extract letters
This is the sample vector:
numbers_with_letters <- c("1.00M", "8.6k", 50)
Use lapply() to loop through vector, extract the letter, replace it with a number, remove the letter, convert to numeric, and multiply:
lapply(numbers_with_letters, function(x) {
letter <- str_extract(x, "[A-Za-z]")
letter_to_num <- switch(letter,
k = 1000,
M = 1000000,
1) # 1 is the default option if no letter found
numbers_with_letters <- as.numeric(gsub("[A-Za-z]", "", x))
#remove all NAs and replace with 0
numbers_with_letters[is.na(numbers_with_letters)] <- 0
return(numbers_with_letters * letter_to_num)
})
This returns:
[[1]]
[1] 1000000
[[2]]
[1] 8600
[[3]]
[1] 50
[[4]]
[1] 0
Maybe I'm oversimplifying here, but . . .
library(tidyverse)
data <- tibble(property_damage = c("8.6k", "NA"))
data %>%
mutate(
as_number = if_else(
property_damage != "NA",
str_extract(property_damage, "\\d+\\.*\\d*"),
"0"
),
as_number = as.numeric(as_number)
)

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

How to find find the next space after N characters in an R character vector?

I have a few large character vectors of varying lengths that I need to break into smaller lengths for processing within spacyr. I'm currently using substr() within lapply() to split into a list where each list item is 500K characters long.
However, I would like to instead split on the next space after about 500K characters so as to avoid chopping a word in half. Not sure how to amend what I've come up with thus far. My current code sort of like so:
#Pretend 'text' is my list of words
chars = c(letters, " ", ".")
text<-paste0(sample(chars, 3000000, replace=TRUE), collapse="")
#split to list of smaller vectors
text_segments<-laply(seq(1,nchar(text),500000), function(i) substr(text, i, i+499999))
#do something with each
for(i in unique(text_segments)){
parsedtxt <- spacy_parse(i)
...
}
Each fake word in the above example is 3 letters long, but in my real files the words vary in length.
Any suggestions about approaching the space problem would be greatly appreciated. Code speed is not a concern, but I do appreciate efficiency suggestions nonetheless.
Maybe you can get inspiration from the code below and adapt it to your dataset.
First I will make up some data.
set.seed(1) # Make the results reproducible
y1 <- paste(sample(c(letters, " "), 1e3, TRUE), collapse = "")
y2 <- paste(sample(c(letters, " "), 1e3, TRUE), collapse = "")
str_list <- list(y1, y2)
Now, the function fun does the job. It uses gregexpr to get the locations of the spaces and then returns everything from the beginnig of the input string to the first space found.
fun <- function(x, threshold){
blanks <- gregexpr(" +", x)[[1]]
substr(x, 1, blanks[which(blanks > threshold)[1]] - 1)
}
thresh <- 100
sub <- lapply(str_list, fun, thresh)
lapply(sub, nchar)
#[[1]]
#[1] 103
#
#[[2]]
#[1] 154

Text argument from a dataframe

I try to test this nice solution using a dataframe as input in the your_sentence.
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
The data frame has 1 column and 351 rows. In lapply in your_sentences I use the call for my dataframe and the column name and I receive this error (the same error is when I use the dataframe without call the column):
> dfnew <- lapply(df$text, remove_words)
Error in writeLines(text, con = conn.tempfile) : invalid 'text' argument
What can I do to solve the error?
Example data:
df = data.frame(text = c('I blame myself for what happened', 'For what happened the blame is yours', 'I will blame you if my friend removes'))
What a bummer, hoped that its only a typo :-). But I have a second guess. You probably stepped into the difficulties caused by StringsAsFactors = TRUE. This might have caused passing the type factor instead of character to your function. Try the following:
df = data.frame(text = c('I blame myself for what happened'
, 'For what happened the blame is yours'
, 'I will blame you if my friend removes')
, stringsAsFactors = FALSE)
Your strings seem to be saved as factors and therefore remove_words is supplied with factor values, instead of strings. Using the stringsAsFactors = FALSE as an argument will solve the issue:
df <- data.frame(text = c('I blame myself for what happened',
'For what happened the blame is yours',
'I will blame you if my friend removes'),
stringsAsFactors=F)
Or, if you have already defined your df with factors, you can change that using df <- lapply(df, as.character)
lapply(df$text, remove_words)
[[1]]
[1] "I myself for what happened"
[[2]]
[1] "For what happened the blame is yours"
[[3]]
[1] "I will you if my friend removes"

Resources