here is the example data:
example_sentences <- data.frame(doc_id = c(1,2,3),
sentence_id = c(1,2,3),
sentence = c("problem not fixed","i like your service and would tell others","peope are nice however the product is rubbish"))
matching_df <- data.frame(x = c("not","and","however"))
Created on 2019-01-07 by the reprex package (v0.2.1)
I want to add/insert a comma just before a certain word in a character string. for example if my string is:
problem not fixed.
I want to convert this to
problem, not fixed.
The other matching_df contains the words to match (these are Coordinate conjunctions) so if the x is found in matching_df then insert comma + space before the detected word.
I have looked at stringr package but not sure how to achieve this.
Best,
I've no idea what the data frame you're talking about looks like, but I made a simple data frame containing some phrases here:
df <- data.frame(strings = c("problems not fixed.","Help how are you"),stringsAsFactors = FALSE)
I then made a vector of words to put a comma after:
words <- c("problems","no","whereas","however","but")
Then I put the data frame of phrases through a simple for loop, using gsub to substitute the word for a word + comma:
for (i in 1:length(df$strings)) {
string <- df$strings[i]
findWords <- intersect(unlist(strsplit(string," ")),words)
if (!is.null(findWords)) {
for (j in findWords) {
df$strings[i] <- gsub(j,paste0(j,","),string)
}
}
}
Output:
df
strings
1 problems, not fixed.
2 Help how are you
The gsubfn function in the gsubfn package takes a regular expression as the first argument and a list (or certain other objects) as the second argument where the names of the list are strings to be matched and the values in the list are the replacement strings.
library(gsubfn)
gsubfn("\\w+", as.list(setNames(paste0(matching_df$x, ","), matching_df$x)),
format(example_sentences$sentence))
giving:
[1] "problem not, fixed "
[2] "i like your service and, would tell others "
[3] "peope are nice however, the product is rubbish"
Related
I have a Text column with thousands of rows of paragraphs, and I want to extract the values of "Capacity > x%". The operation sign can be >,<,=, ~... I basically need the operation sign and integer value (e.g. <40%) and place it in a column next to the it, same row. I have tried, removing before/after text, gsub, grep, grepl, string_extract, etc. None with good results. I am not sure if the percentage sign is throwing it or I am just not getting the code structure. Appreciate your assistance please.
Here are some codes I have tried (aa is the df, TEXT is col name):
str_extract(string =aa$TEXT, pattern = perl("(?<=LVEF).*(?=%)"))
gsub(".*[Capacity]([^.]+)[%].*", "\\1", aa$TEXT)
genXtract(aa$TEXT, "Capacity", "%")
gsub("%.*$", "%", aa$TEXT)
grep("^Capacity.*%$",aa$TEXT)
Since you did not provide a reproducible example, I created one myself and used it here.
We can use sub to extract everything after "Capacity" until a number and % sign.
sub(".*Capacity(.*\\d+%).*", "\\1", aa$TEXT)
#[1] " > 10%" " < 40%" " ~ 230%"
Or with str_extract
stringr::str_extract(aa$TEXT, "(?<=Capacity).*\\d+%")
data
aa <- data.frame(TEXT = c("This is a temp text, Capacity > 10%",
"This is a temp text, Capacity < 40%",
"Capacity ~ 230% more text ahead"), stringsAsFactors = FALSE)
gsub solution
I think your gsub solution was pretty close, but didn't bring along the percentage sign as it's outside the brackets. So something like this should work (the result is assigned to the capacity column):
aa$capacity <- gsub(".*[Capacity]([^.]+%).*", "\\1", aa$TEXT)
Alternative method
The gsub approach will match the whole string when there is no operator match. To avoid this, we can use the stringr package with a more specific regular expression:
library(magrittr)
library(dplyr)
library(stringr)
aa %>%
mutate(capacity = str_extract(TEXT, "(?<=Capacity\\s)\\W\\s?\\d+\\s?%")) %>%
mutate(Capacity = str_squish(Capacity)) # Remove excess white space
This code will give NA when there is no match, which I believe is your desired behaviour.
I would like to use R to compare written text and extract sections which differ between the elements.
Consider a and b two text paragraphs. One is a modified version of the other:
a <- "This part is the same. This part is old."
b <- "This string is updated. This part is the same."
I want to compare the two strings and receive the part of the string which is unique to either of the two as output, preferably separate for both input strings.
Expected output:
stringdiff <- list(a = " This part is old.", b = "This string is updated. ")
> stringdiff
$a
[1] " This part is old."
$b
[1] "This string is updated. "
I've tried a solution from Extract characters that differ between two strings, but this only compares unique characters. The answer in Simple Comparing of two texts in R comes closer, but still only compares unique words.
Is there any way to get the expected output without too much of a hassle?
We concatenate both the strings, split at the space after the . to create a list of sentences ('lst'), get the unique elements from unlisting the 'lst' ('un1'), using setdiff we get the elements that are not in 'un1'
lst <- strsplit(c(a= a, b = b), "(?<=[.])\\s", perl = TRUE)
un1 <- unique(unlist(lst))
lapply(lst, setdiff, x= un1)
I need to write a function that finds the most common word in a string of text so that if I define a "word" as any sequence of words.
It can return the most common words.
For general purposes, it is better to use boundary("word") in stringr:
library(stringr)
most_common_word <- function(s){
which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)
Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.
# Create example string
string <- "This is a very short sentence. It has only a few words."
library(stringr)
most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}
most_common_word(string)
[1] "a"
Hope this helps:
most_common_word=function(x){
#Split every word into single words for counting
splitTest=strsplit(x," ")
#Counting words
count=table(splitTest)
#Sorting to select only the highest value, which is the first one
count=count[order(count, decreasing=TRUE)][1]
#Return the desired character.
#By changing this you can choose whether it show the number of times a word repeats
return(names(count))
}
You can use return(count) to show the word plus the number of time it repeats. This function has problems when two words are repeated the same amount of times, so beware of that.
The order function gets the highest value (when used with decreasing=TRUE), then it depends on the names, they get sorted by alphabet. In the case the words 'a' and 'b' are repeated the same amount of times, only 'a' get displayed by the most_common_word function.
Using the tidytext package, taking advantage of established parsing functions:
library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence,
stringsAsFactors = FALSE), word, sentence) %>%
count(word, sort = TRUE)
}
word_count("This is a very short sentence. It has only a few words.")
This gives you a table with all the word counts. You can adapt the function to obtain just the top one, but be aware that there will sometimes be ties for first, so perhaps it should be flexible enough to extract multiple winners.
This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngthā€¯))
Any help would be appreciated.
Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)
If I have a vector of strings:
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
and want to find the entires with '2011' in the string I can use
ifiles <- dd[grep("2011",dd)]
How do I search for entries with a combination of strings included, without using a loop?
For example, I would like to find the entries with both '2011' and 'sprd' in the string, which in this case will only return
sflxgrbfg_sprd_2011
How can this be done? I could define a variable
toMatch <- c('2011','sprd)
and then loop through the entries but I was hoping there was a better solution?
Note: To make this useful for different strings. Is it also possible to to determine which entries have these strings without them being in the order shown. For example, 'sflxlgrbfg_2011_sprd'
If you want to find more than one pattern, try indexing with a logical value rather than the number. That way you can create an "and" condition, where only the string with both patterns will be extracted.
ifiles <- dd[grepl("2011",dd) & grepl("sprd_",dd)]
Try
grep('2011_sprd|sprd_2011', dd, value=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
Or using an example with more patterns
grep('(?<=sprd_).*(?=2011)|(?<=2011_).*(?=sprd)', dd1,
value=TRUE, perl=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
#[3] "sfxl_2011_14334_sprd" "sprd_124334xsff_2011_1423"
data
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012",
"sflxlgrbfg_2011_sprd")
dd1 <- c(dd, "sfxl_2011_14334_sprd", "sprd_124334xsff_2011_1423")
If you want a scalable solution, you can use lapply, Reduce and intersect to:
For each expression in toMatch, find the indices of all matches in dd.
Keep only those indices that are found for all expressions in toMatch.
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
dd <- c(dd, "sflxgrbfh_sprd_2011")
toMatch <- c('bfg', '2011','sprd')
dd[Reduce(intersect, lapply(toMatch, grep, dd))]
#> [1] "sflxgrbfg_sprd_2011" "sflxgrbfg_sprd2_2011"
Created on 2018-03-07 by the reprex package (v0.2.0).