Detecting the differences between two string vectors - r

I've got a data_frame that looks like this.
df <- data_frame(name = c('john','bill','amy'),
name.2 = c('johhn','ball','ammy'))
df
# A tibble: 3 x 2
name name.2
<chr> <chr>
1 john johhn
2 bill ball
3 amy ammy
I want to add a column that shows the difference between the two name(.2) columns. Like this:
df %>%
mutate(diff = c('h','a','m'))
# A tibble: 3 x 3
name name.2 diff
<chr> <chr> <chr>
1 john johhn h
2 bill ball a
3 amy ammy m
I'd prefer to find a solution that uses elements of tidyverse and stringr if possible, but I'll take it like I get it.

Using base R we canndo something like:
diffc=diag(attr(adist(df$name,df$name.2, counts = TRUE), "trafos"))
transform(df,diff=regmatches(name.2,regexpr("[^M]",diffc)))
name name.2 diff
1 john johhn h
2 bill ball a
3 amy ammy m
Breakdown:
compute approximate string distance between df[,1] and df[,2]
d=adist(df$name,df$name.2, counts = TRUE)
obtain the diagonal of the transformation matrix:
e= diag(attr(d, "trafos"))
Find the position of those that are either deleted,substituted or inserted ie not maintained:
f=regexpr("[^M]",e)
extract the values of df[,2] at those specified positions:
dat$diff==regmatches(name.2,f)

You could use the vecsets library:
library(vecsets)
df$diff <- mapply(vsetdiff, strsplit(df$name.2, split = ""),
strsplit(df$name, split = ""))
df
# name name.2 diff
#1 john johhn h
#2 bill ball a
#3 amy ammy m
Note it looks like you just want the values in name.2 that are not in name which is why the first argument to mapply is the strsplit of name.2.

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1
The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1
Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

How I can rename elements in a column vector according to the starting name in R using dplyr?

I have a data frame that looks like this :
names
value
John123abc
1
George12894xyz
2
Mary789qwe
3
I want to rename all the name values of the column "names" and keep only the names (not the extra numbers and characters that its name has). Imagine that the code for each name changes and I have 100.000 rows.I thing that something like starts_with("John") ="John")
Ideally i want the new data frame to look like this:
names
value
John
1
George
2
Mary
3
How I can do this in R using dplyr?
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
Using strings::str_remove you could do:
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
dat |>
mutate(names = str_remove(names, "\\d+.*$"))
#> # A tibble: 3 × 2
#> names value
#> <chr> <dbl>
#> 1 John 1
#> 2 George 2
#> 3 Mary 3
Using base R
dat$names <- trimws(dat$names, whitespace = "\\d+.*")
-output
> dat
# A tibble: 3 × 2
names value
<chr> <dbl>
1 John 1
2 George 2
3 Mary 3

How to separate rows into columns based on variable number of pattern matches per row

I have a dataframe like this:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
I'd like to create as many new columns as there are matches to a pattern; the pattern is (?<="A":)\\d+(?=,), i.e., "match the number if you see the string "A":on the left and the comma ,on the right.
The problems: (i) the number of matches may vary from row to row and (ii) the maximum number of new columns is not known in advance.
What I've done so far is this:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
While 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))) may solve the problem of unknown number of new columns, I get a warning:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
and the assignment of the values clearly incorrect:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
The correct output would be this:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
Any idea?
Here's a somewhat pedestrian stringr solution:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
Created on 2020-07-06 by the reprex package (v0.3.0)

Colsum new dataframe

With this command it is possible to have a dataframe with the sum of every column
df <- data.frame(id = c(1,2,3), stock = c(3,1,4), bill = c(1,0,1), bear = c(3,2,5))
dfsum <- data.frame(colSums(df[-1]))
However this dataframe has only one column.
How is it possible to produce a dataframe with 2 column one with col names and second with the frequencies?
You can do:
stack(colSums(df[-1]))
values ind
1 8 stock
2 2 bill
3 10 bear
Or using tibble:
enframe(colSums(df[-1]))
name value
<chr> <dbl>
1 stock 8
2 bill 2
3 bear 10
We can use tidyverse approaches with summarise_at and pivot_longer
library(dplyr)
library(tidyr)
df %>%
summarise_at(vars(-id), sum) %>%
pivot_longer(everything())
# name value
#1 stock 8
#2 bill 2
#3 bear 10
You can simply try apply.
apply(df[-1], 2, sum)
Result
stock bill bear
8 2 10
For data.frame
(df2 <- data.frame( freq = apply(df[-1], 2, sum)))
df2$var <- rownames(df2)
Result
var freq
stock 8
bill 2
bear 10

Count Bigrams independently of order of appearance

I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.
Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
It counts separated like this:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
It should look like this:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
Thanks if anyone can help me.
This code works. There is probably something more efficient out there though.
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1
Tks Guys,
I considered your suggestions and tried a similar approach:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2

Resources