Count Bigrams independently of order of appearance - r

I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.
Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
It counts separated like this:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
It should look like this:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
Thanks if anyone can help me.

This code works. There is probably something more efficient out there though.
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1

Tks Guys,
I considered your suggestions and tried a similar approach:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2

Related

In R merge checks

For simple 1:1 merge for example:
data_merge <- (dataset1, dataset2, by.x = "name", by.y = "name")
Is there a way to run a check of the values successfully merged, i.e., a count or flag of those only in merged into the data set from the dataset1, and those only merged in from dataset2, and those successfully matched from both?
The tidylog package provides that for dplyr/tidyr operations. For instance,
library(dplyr); library(tidylog)
left_join(band_members, band_instruments)
Result
Joining, by = "name"
left_join: added one column (plays)
> rows only in x 1
> rows only in y (1)
> matched rows 2
> ===
> rows total 3
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones NA
2 John Beatles guitar
3 Paul Beatles bass

Unnesting related list-columns of different size

After parsing xml files I have data looking like this:
example_df <-
tibble(id = "ABC",
wage_type = "salary",
name = c("Description","Code","Base",
"Description","Code","Base",
"Description","Code"),
value = c("wage_element_1","51B","600",
"wage_element_2","51C","740",
"wage_element_3","51D"))
example_df
# A tibble: 8 x 4
id wage_type name value
<chr> <chr> <chr> <chr>
1 ABC salary Description wage_element_1
2 ABC salary Code 51B
3 ABC salary Base 600
4 ABC salary Description wage_element_2
5 ABC salary Code 51C
6 ABC salary Base 740
7 ABC salary Description wage_element_3
8 ABC salary Code 51D
with roughly 1000 different id, and each having three possible values for wage_type.
I want to change the values in the name column into columns.
I have tried to use pivot but I am struggling to handle the resulting list-cols: since not all salary have a Base, the resulting list-cols are of different size as below:
example_df <- example_df %>%
pivot_wider(id_cols = c(id, wage_type),
names_from = name,
values_from = value)
example_df
# A tibble: 1 x 5
id wage_type Description Code Base
<chr> <chr> <list> <list> <list>
1 ABC salary <chr [3]> <chr [3]> <chr [2]>
So when I try to unnest the cols it throws an error:
example_df%>%
unnest(cols = c(Description,Code,Base))
Error: Can't recycle `Description` (size 3) to match `Base` (size 2).
I understand that is because tidyr functions do not recycle, but I could not find a way around this or a base rsolution to my problem. I have tried to make a df with
unlist(strsplit(as.character(x)) solution as per how to split one row into multiple rows in R but also ran into a resulting column length issue.
Desired output is as follows:
desired_df <-
tibble(
id=c("ABC","ABC","ABC"),
wage_type=c("salary","salary","salary"),
Description = c("wage_element_1","wage_element_2","wage_element_3"),
Code = c("51B","51C","51D"),
Base = c("600","740",NA))
desired_df
id wage_type Description Code Base
<chr> <chr> <chr> <chr> <chr>
1 ABC salary wage_element_1 51B 600
2 ABC salary wage_element_2 51C 740
3 ABC salary wage_element_3 51D NA
I would love a tidyr solution but any help would be appreciated. Thanks.
I would suggest this approach using tidyverse functions. The issue you had was due to how functions manage different rows. So, by creating an id variable like id2 you can avoid list outputs in your final reshaped data:
library(tidyverse)
#Code
example_df %>%
arrange(name) %>%
group_by(id,wage_type,name) %>%
mutate(id2=1:n()) %>% ungroup() %>%
pivot_wider(names_from = name,values_from=value) %>%
select(-id2)
Output:
# A tibble: 3 x 5
id wage_type Base Code Description
<chr> <chr> <chr> <chr> <chr>
1 ABC salary 600 51B wage_element_1
2 ABC salary 740 51C wage_element_2
3 ABC salary NA 51D wage_element_3

R append last row to a data frame

I have a data frame (df) that shares a key column ($Name) with a list of data frames:
head(df)
# A tibble: 6 x 3 ##truncating to show first 2 rows only
Name var1 var2
<chr> <chr> <chr>
1 Tom Marks LAX ORD
2 Bob Sells MIA CHI
I have a list of data frames that contains historical data for each person contained in df$Name.
head(employees$'Tom Marks')
Name date var3
Tom Marks 2017-01-01 250
Tom Marks 2017-01-02 457
head(employees$'Bob Sells')
Name date var3
Bob Sells 2017-01-01 385
Bob Sells 2017-01-02 273
I would like to append the value in var3 from employees list to the df by the most recent date (which is always the last row in an employees list). For example, the output, after matching Tom Marks from df$Name to employees$'Tom Marks' would look like this:
head(df)
Name var1 var2 var3
<chr> <chr> <chr> <num>
1 Tom Marks LAX ORD 457
2 Bob Sells MIA CHI 273
I have spent a decent amount of time researching filtering joins, mutating joins, bind_rows, reduce() functions but have been unsuccessful in accomplishing what is probably an easy task for a decent programmer. I'm hoping someone out there can put me out of my misery and provide some better direction or better yet, an answer!
Thank you!
If you're always after the last row, you can use tail to get it:
library(tidyverse)
left_join(
df,
map_df(employees, ~ tail(.x, 1))
)
This solution relies on the fact that your data arranged as you said they were, but you can easily arrange the list by date if they were not so.
library(tidyverse)
df %>% left_join(
df_list$employees %>%
bind_rows() %>%
group_by(Name) %>%
summarise_at(vars(var3), last))
# Name var1 var2 var3
# 1 Tom Marks LAX ORD 457
# 2 Bob Sells MIA CHI 273
Data
df <- data.frame(Name = c("Tom Marks", "Bob Sells"),
var1 = c("LAX", "MIA"),
var2 = c("ORD", "CHI"))
df_list <- list(employees = list(
`Tom Marks` = data.frame(Name = "Tom Marks",
date = c("2017-01-01", "2017-01-02"),
var3 = c(250, 457)),
`Bob Sells` = data.frame(Name = "Bob Sells",
date = c("2017-01-01", "2017-01-02"),
var3 = c(385, 273))
))

Detecting the differences between two string vectors

I've got a data_frame that looks like this.
df <- data_frame(name = c('john','bill','amy'),
name.2 = c('johhn','ball','ammy'))
df
# A tibble: 3 x 2
name name.2
<chr> <chr>
1 john johhn
2 bill ball
3 amy ammy
I want to add a column that shows the difference between the two name(.2) columns. Like this:
df %>%
mutate(diff = c('h','a','m'))
# A tibble: 3 x 3
name name.2 diff
<chr> <chr> <chr>
1 john johhn h
2 bill ball a
3 amy ammy m
I'd prefer to find a solution that uses elements of tidyverse and stringr if possible, but I'll take it like I get it.
Using base R we canndo something like:
diffc=diag(attr(adist(df$name,df$name.2, counts = TRUE), "trafos"))
transform(df,diff=regmatches(name.2,regexpr("[^M]",diffc)))
name name.2 diff
1 john johhn h
2 bill ball a
3 amy ammy m
Breakdown:
compute approximate string distance between df[,1] and df[,2]
d=adist(df$name,df$name.2, counts = TRUE)
obtain the diagonal of the transformation matrix:
e= diag(attr(d, "trafos"))
Find the position of those that are either deleted,substituted or inserted ie not maintained:
f=regexpr("[^M]",e)
extract the values of df[,2] at those specified positions:
dat$diff==regmatches(name.2,f)
You could use the vecsets library:
library(vecsets)
df$diff <- mapply(vsetdiff, strsplit(df$name.2, split = ""),
strsplit(df$name, split = ""))
df
# name name.2 diff
#1 john johhn h
#2 bill ball a
#3 amy ammy m
Note it looks like you just want the values in name.2 that are not in name which is why the first argument to mapply is the strsplit of name.2.

R, import column names

I have a csv file (myNames) with column names. It is 1 x 66. I want to use those names to rename the 66 columns I have in another dataframe. I am trying to use colnames(df)[]<-(myNames) but I get the wrong result. I have tried to do this using as.vector, as.array, as.list, without success.
Is there a more direct way to read a csv file into an array?
or
How can I get an array from my dataframe that I can use in colnames()?
Here's myNames:
v1 v2 v3 v4 v5 v6 v7
Tom Dick Harry John Paul George Ringo
I want to make Tom, Dick, Harry my new column names in mydata.
Try this:
library(tidyverse)
# Reproducible example
df <- ("Tom Dick Harry John Paul George Ringo")
df <- read.table(text = df)
# Change column names
names(df) <- as.matrix(df[1, ])
# Remove row 1
df <- df[-1, ]
# Convert to a tibble
df %>%
as_tibble() %>%
mutate_all(parse_guess) %>%
glimpse()
The code above returns:
Observations: 0
Variables: 7
$ Tom <chr>
$ Dick <chr>
$ Harry <chr>
$ John <chr>
$ Paul <chr>
$ George <chr>
$ Ringo <chr>
You could turn this into a function:
rn_to_cn <- function(dataframe){
x <- length(colnames(dataframe))
y <- length(unique(matrix(dataframe)))
if(x > y){
stop("Can't have duplicate column names.")
} else {
message("It worked!")
}
names(dataframe) <- as.matrix(dataframe[1, ])
dataframe <- dataframe[-1, ]
dataframe %>%
as_tibble() %>%
mutate_all(parse_guess)
}
And then do this:
rn_to_cn(df)
# A tibble: 0 x 7
# ... with 7 variables: Tom <chr>, Dick <chr>, Harry <chr>, John <chr>,
# Paul <chr>, George <chr>, Ringo <chr>

Resources