pattern matching R - r

ca.df
id Category
1 Noun
2 Negative
3 Positive
4 adj
5 word
Each term is assigned to more than 1 category, therefore, it corresponds with more than 1 id. In terms.df all the ids are in one column.
terms.df
Terms id
Love 1 4 5 3
Hate 2 4 5
ice 1 5
id in terms is corresponded with category in ca.df. I want an output like this:
x.df
Category terms
Noun ice Love
Negative Hate
Positive Love
adj Hate Love
word ice Hate Love
How to do this?

Here's a possible data.table/splitstackshape packages solution
library(splitstackshape) ## loads `data.table` package too
terms.df <- cSplit(terms.df, "id", sep = " ", direction = "long")
setkey(terms.df, id)[ca.df, .(Category , Terms = toString(Terms)), by = .EACHI]
# id Category Terms
# 1: 1 Noun Love, ice
# 2: 2 Negative Hate
# 3: 3 Positive Love
# 4: 4 adj Love, Hate
# 5: 5 word Love, Hate, ice
Some explanations
We first split the id column by spaces according to the Terms column
Then we are performing a binary left join between the two data sets on the id column
While joining, we are concatenating the Terms column back according to each join using the by = .EACHI operator which allows us to perform different operations while joinig

A solution using tidyr and dplyr.
library(tidyr)
library(dplyr)
ca.df$id <- as.character(ca.df$id)
terms.df %>% separate(id,into=paste0("V",1:3),sep = " ",extra = "merge") %>%
gather(var,id,-Terms) %>%
filter(!is.na(id)) %>%
left_join(ca.df,by="id") %>%
select(-var,-id) %>%
group_by(Category) %>%
summarize(Terms=paste(Terms,collapse=" "))
Output :
Source: local data frame [4 x 2]
Category Terms
1 Negative Hate
2 Noun Love ice
3 adj Love Hate
4 word ice Love Hate
Data :
ca.df <- read.table(text =
"id Category
1 Noun
2 Negative
3 Positive
4 adj
5 word",head=TRUE,stringsAsFactors=FALSE)
terms.df <- read.table(text =
"Terms id
Love '1 4 5'
Hate '2 4 5'
ice '1 5'
",head=TRUE,stringsAsFactors=FALSE)

You can use merge to combine based on id
ca.df <- data.frame(id=1:5, Category=c("Noun", "Negative", "Positive", "adj", "word"))
terms.df <- data.frame(Terms=c(rep("Love", 3), rep("Hate", 3), rep("ice", 2)),
id = c(1,4,5,2,4,5,1,5))
x.df <- merge(ca.df, terms.df, by="id")
x.df
id Category Terms
1 1 Noun Love
2 1 Noun ice
3 2 Negative Hate
4 4 adj Love
5 4 adj Hate
6 5 word Love
7 5 word Hate
8 5 word ice

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1
The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1
Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

R: Merge rows that share same code and at least one or more strings in name-column

I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp

dplyr: include all elements in filter list, even if not in data set

df1
Row Taste Quantity
#1 Vanilla 3
#2 Chocolate 1
#3 Strawberry 6
I would like to filter the list and include a c(list) that has more flavors. But if the flavors in the list dont exist in the Taste column I would like to add a new row.
df1 %>% filter(Taste %in% c("Chocolate", "Strawberry", "Banana"))
but this only returns the chocolate and strawberry rows. I would like it to return:
Row Taste Quantity
#2 Chocolate 1
#3 Strawberry 6
#4 Banana 0 (or could be NA)
Is there a way to append the items in the list to the results even if the data doesn't exist in df1?
# example data frame
df = read.table(text = "
Row Taste Quantity
1 Vanilla 3
2 Chocolate 1
3 Strawberry 6
", header=T)
# vector of tastes to have in output
taste_vector = c("Chocolate", "Strawberry", "Banana")
library(dplyr)
data.frame(taste_vector) %>% # start with the vector of tastes you want to have
left_join(df, by=c("taste_vector"="Taste")) %>% # join original data to see what was found and what wasn't
mutate(Row = ifelse(is.na(Row), max(Row, na.rm = T) + cumsum(is.na(Row)), Row)) # update Row column
# taste_vector Row Quantity
# 1 Chocolate 2 1
# 2 Strawberry 3 6
# 3 Banana 4 NA
You can add mutate(Quantity = coalesce(Quantity, 0L)) if you don't want NAs in your Quantity column.
Using tidyverse (dplyr, forcats and tidyr)
First create a filter object (filter_vals) of the variables you want to filter on. In a mutate (assuming the variable is not a factor), we mutate Taste into a factor and expand the factor levels with values from the filter object. Next we use complete to expand the values in the data.frame with the missing levels that are in the factor and set empty values to 0. Finally filter the data.frame with the filter object.
library(tidyverse)
filter_vals <- c("Chocolate", "Strawberry", "Banana")
df1 %>%
mutate(Taste = as_factor(Taste),
Taste = fct_expand(Taste, filter_vals)) %>%
complete(Taste, fill = list(Quantity = 0))
filter(Taste %in% filter_vals)
# A tibble: 3 x 2
Taste Quantity
<fct> <dbl>
1 Chocolate 1
2 Strawberry 6
3 Banana 0

Match words in a data frame to a string in R

I have a data frame from a recall task where participants recall as many words as they can from a list they learned earlier. Here's a mock up of the data. Each row is a subject and each column (w1-w5) is a word recalled:
df <- data.frame(subject = 1:5,
w1 = c("screen", "toad", "toad", "witch", "toad"),
w2 = c("package", "tuna", "tuna", "postage", "dinosaur"),
w3 = c("tuna", "postage", "toast", "athlete", "ranch"),
w4 = c("toad", "witch", "tuna", "package", "NA"),
w5 = c("windwo", "mermaid", "NA", "NA", "NA")
)
Which produces the following data frame:
subject w1 w2 w3 w4 w5
1 1 screen package tuna toad windwo
2 2 toad tuna postage witch mermaid
3 3 toad tuna toast tuna NA
4 4 witch postage athlete package NA
5 5 toad dinosaur ranch NA NA
I want to match each word produced (columns w1 - w5) to a list of the correct words, which are:
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
I only want to award points for words that are spelled correctly and are not repeated. So for example, for the data above I'd like to end up with a data frame that looks like this:
subject nCorrect
1 1 4
2 2 5
3 3 3
4 4 3
5 5 2
Subject 1 would get four points because they misspelled one word.
Subject 2 would get five points.
Subject 3 would get 3 points because they repeated tuna and are missing one word.
Subject 4 would get three points because they have one incorrect word and one missing word.
Subject 5 would get two points because they have one incorrect word and two missing words.
data.frame(subject = df$subject
, nCorrect = apply(df[, -1], 1, function(x) sum(unique(x) %in% words)))
# subject nCorrect
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2
With data.table (same result)
setDT(df)
df[, sum(unique(unlist(.SD)) %in% words), by = subject]
Another option is to convert the data in long format. Group on subject to use dplyr::summarise to find correct number of matching answers.
library(tidyverse)
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
df %>% gather(key, value, -subject) %>%
group_by(subject) %>%
summarise(nCorrect = sum(unique(value) %in% words))
# # A tibble: 5 x 2
# subject nCorrect
# <int> <int>
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2

Getting a count of specific values in a data frame that appear in another

This question may sound similar to others, but I hope it is different enough.
I want to take a specific list of values and count how often they appear in another list of values where non-occurring values are retuned as '0'.
I have a Data Frame (df1) with the following values:
Items <- c('Carrots','Plums','Pineapple','Turkey')
df1<-data.frame(Items)
>df1
Items
1 Carrots
2 Plums
3 Pineapple
4 Turkey
And a second Data Frame (df2) that contains a column called 'Thing':
> head(df2,n=10)
ID Date Thing
1 58150 2012-09-12 Potatoes
2 12357 2012-09-28 Turnips
3 50788 2012-10-04 Oranges
4 66038 2012-10-11 Potatoes
5 18119 2012-10-11 Oranges
6 48349 2012-10-14 Carrots
7 23328 2012-10-16 Peppers
8 66038 2012-10-26 Pineapple
9 32717 2012-10-28 Turnips
10 11345 2012-11-08 Oranges
I know the word 'Turkey' only appears in df1 NOT in df2. I want to return a frequency table or count of the items in df1 that appears in df2 and return '0' for the count of Turkey.
How can I summarize values of on Data Frame column using the values from another? The closest I got was:
df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
But this return a list of items filtered between df1 and df2 so 'Turkey' gets excluded. So close!
> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
# A tibble: 3 x 2
Thing n
<fctr> <int>
1 Carrots 30
2 Pineapple 30
3 Plums 38
I want my output to look like this:
1 Carrots 30
2 Pineapple 30
3 Plums 38
4 Turkey 0
I am newish to R and completely new to dplyr.
I use this sort of thing all the time. I'm sure there's a more savvy way to code it, but it's what I got:
item <- vector()
count <- vector()
items <- list(unique(df1$Items))
for (i in 1:length(items)){
item[i] <- items[i]
count[i] <- sum(df2$Thing == item)
}
df3 <- data.frame(cbind(item, count))
Hope this helps!
Stephen's solution worked with a slight modification, adding the [i] to the item at the end of count[i] line. See below:
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df2$Thing == item[i])
}
df3 <- data.frame(cbind(item, count))
> df3
item count
1 Carrots 30
2 Plums 38
3 Pineapple 30
4 Turkey 0
dplyr drops 0 count rows, and you have the added complication that the possible categories of Thing are different between your two datasets.
If you add the factor levels from df1 to df2, you can use complete from tidyr, which is a common way to add 0 count rows.
I'm adding the factor levels from df1 to df2 using a convenience function from package forcats called fct_expand.
library(dplyr)
library(tidyr)
library(forcats)
df2 %>%
mutate(Thing = fct_expand(Thing, as.character(df1$Item) ) ) %>%
count(Thing) %>%
complete(Thing, fill = list(n = 0) ) %>%
filter(Thing %in% df1$Items,)
A different approach is to aggregate df2 first, to right join with df1 (to pick all rows of df1), and to replace NA by zero.
library(dplyr)
df2 %>%
count(Thing) %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
mutate(n = coalesce(n, 0L))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Plums 0
3 Pineapple 1
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same approach in data.table:
library(data.table)
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Note that in both implementations unique(df1) is used to avoid unintended duplicate rows after the join.
Edit 2019-06-22:
With development version 1.12.3 data.table has gained a coalesce() function. So, above statement can be written
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][, N := coalesce(N, 0L)][]
If df2 is large and df1 contains only a few Items it might be more efficient to join first and then to aggregate:
library(dplyr)
df2 %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
group_by(Thing) %>%
summarise(n = sum(!is.na(ID)))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Pineapple 1
3 Plums 0
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same in data.table syntax:
library(data.table)
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Edit 2019-06-22: Above can be written more concisely by aggregating in a join:
setDT(df2)[setDT(df1), on = .(Thing = Items), .N, by = .EACHI]

Resources