R Exact match strings in two columns - r

I have a data frame of the following form:
Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)
What I want to do is to look for and count the exact match of words in column 1 and column 2. Each column can have multiple words separated by a comma.
For example in row 1, we see there are two common words a) Starship Enterprise and b) Elephant. However, in row 4, even though the word "more" appears in both columns, the exact string (Some more words and Even more words) do not appear. The expected output would be something like this.
Any help will be appreciated.

Split columns on comma and count the intersection of words
mapply(function(x, y) length(intersect(x, y)),
strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0
Or a tidyverse way
library(tidyverse)
d1 %>%
mutate(Common = map2_dbl(Column1, Column2, ~
length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))
# Column1 Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2 Random word Ocean 0
#3 Word No 0
#4 Some more words, Even more words more 0

We can do this with cSplit
library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[,
length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
# Column1 Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2: Random word Ocean 0
#3: Word No 0
#4: Some more words, Even more words more 0

Related

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

How to add missing zeros in a unique identifier that is missing some values using R?

I have a unique id that should in total contain 13 characters, 15 with dash. It should look like this
2005-067-000043
However some entries might be like this
2005-067-00043 or 2005-67-000043 or 2005-067-0000043
I would like a script that says between first and second dash there should be three characters, if more cut zeros in front and if less add zero in front. Same goes for the last section where it says after last dash there should be six characters if less add zero in front or if more cut zero in front.
You can split up the data into 3 columns, keep only 3 and 6 characters in 2nd and 3rd column and combine the columns into one again.
library(dplyr)
library(tidyr)
separate(df, x, paste0('col', 1:3), sep = '-') %>%
mutate(col2 = sprintf('%03s', substring(col2, nchar(col2) - 2)),
col3 = sprintf('%06s', substring(col3, nchar(col3) - 5))) %>%
unite(result, starts_with('col'), sep = '-')
# result
#1 2005-067-000043
#2 2005-067-000043
#3 2005-067-000043
#4 2005-067-000043
x <- c('2005-067-000043', '2005-067-00043', '2005-67-000043', '2005-067-0000043')
df <- data.frame(x)
df
# x
#1 2005-067-000043
#2 2005-067-00043
#3 2005-67-000043
#4 2005-067-0000043

Filter rows where any values in a vector are contained in a column

I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?
One trick is to combine the goodcodes into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)
With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
data
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))

R: Applying gsub to data frames returns NAs

I am trying to convert a data frame that contains numbers and blanks to numeric. Currently, numbers are in factor format and some have ",".
df <- data.frame(num1 = c("123,456,789", "1,234,567", "1,234", ""), num2 = c("","1,012","","202"))
df
num1 num2
1 123,456,789
2 1,234,567 1,012
3 1,234
4 202
Remove "," and convert to numeric format:
df2 = as.numeric(gsub(",","",df))
Warning message:
NAs introduced by coercion
Interestingly, if I perform the same function column by column, it worked:
df$num1 = as.numeric(gsub(",","",df$num1))
df$num2 = as.numeric(gsub(",","",df$num2))
df
num1 num2
1 123456789 NA
2 1234567 1012
3 1234 NA
4 NA 202
My questions are 1. What is the cause and if there is a way to avoid converting them column by column since the actual data frame has lots more columns; and 2. What would be the best way to remove NAs or replace them by 0s for future numeric operations? I know I can use gsub to do so but just wondering if there is a better way.
We can use replace_na after replace the , with '' (str_replace_all)
library(dplyr)
library(stringr)
df %>%
mutate_all(list(~ str_replace_all(., ",", "") %>%
as.numeric %>%
replace_na(0)))
# num1 num2
#1 123456789 0
#2 1234567 1012
#3 1234 0
#4 0 202
The issue with gsub/sub is that it works on vector as described in the ?gsub
x, text -
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
We can loop over the columns, apply the gsub, and assign the output back to the original dataset
df[] <- lapply(df, function(x) as.numeric(gsub(",", "", x)))
df[is.na(df)] <- 0 # change the NA elements to 0

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

Resources