R - sum values if string contains part of a column - r

I have the following dataframe:
df1 <- data.frame( word = c("house, garden, flower", "flower, red", "garden, tree, forest", "house, window, door, red"),
value = c(10,12,20,5),
stringsAsFactors = FALSE
)
Now I would like to sum up the values for each single word. This means the table should look like this:
word | value
house | 15
garden | 30
flower | 22
...
I could not find a solution by now. Does anybody has a solution?

Here's an example using unnest_tokens from the tidytext library:
library(tidyverse)
library(tidytext)
df1 %>%
unnest_tokens(word, word) %>%
group_by(word) %>%
summarize(value = sum(value))

You can get all of the words to sum up using strsplit then use sapply to sum up by the word.
Words = unique(unlist(strsplit(df1$word, ",\\s*")))
sapply(Words, function(w) sum(df1$value[grep(w, df1$word)]))
house garden flower red tree forest window door
15 30 22 17 20 20 5 5

One option could be to separate word column in multiple columns using splitstackshape::cSplit and then use tidyr::gather. Finally process data in long format.
library(tidyverse)
library(splitstackshape)
df1%>% cSplit("word", sep = ",", stripWhite = TRUE) %>%
mutate_at(vars(starts_with("word")), funs(as.character)) %>%
gather(key, word, -value) %>%
filter(!is.na(word)) %>%
group_by(word) %>%
summarise(value = sum(value)) %>%
as.data.frame()
# word value
# 1 door 5
# 2 flower 22
# 3 forest 20
# 4 garden 30
# 5 house 15
# 6 red 17
# 7 tree 20
# 8 window 5

Related

Group strings that have the same words but in a different order

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2

splitting strings into columns in R

I have a vector with text in R data frame such as below:
string<-c("Real estate surface: 60m2 Number of rooms: 3 Number of bedrooms: 2 Number of bathrooms: 1 Number of toilets: 0 Year of construction: 1980 Last renovation: Floor: 1/15")
and I want to split text into 8 columns data frame with associated values, as e.g.
How can I do that?
Thanks!
An option would be to create NA for missing cases, then use separate_rows/separate to split the string
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
tibble(col = string) %>%
mutate(col = str_replace_all(col, ": (?![0-9])", ": NA ")) %>%
separate_rows(col, sep="(?<=:\\s\\w{1,5}) ") %>%
separate(col, into = c('col1', 'col2'), sep=":\\s+") %>%
deframe %>%
as.data.frame.list(check.names = FALSE) %>%
type.convert(as.is = TRUE)
#Real estate surface Number of rooms Number of bedrooms Number of bathrooms Number of toilets Year of construction
#1 60m2 3 2 1 0 1980
# Last renovation Floor
#1 NA 1/15

How can I speed up a function combining rbind and lapply?

I have a large dataframe(100K rows, 19 columns). I need to count the number of cases each month that contain each possible combination of 5 items.
The following code works for a small dataset but with my complete dataset it takes way too long. From my searching I suspect that pre-allocating a dataframe is the key, but I cannot figure out how to do that.
library(dplyr)
Case<-c(1,1,1,2,2,3,4,5,5,6,6,6,7,8,8,8,9,9,9)
Month<- c("Jan","Jan","Jan","Mar","Mar","Sep","Sep","Nov","Nov","Dec","Dec","Dec","Apr","Dec","Dec","Dec","Dec","Dec","Dec")
Fruits<-c("Apple","Orange","Grape","Grape","Orange","Apple","Apple","Orange","Grape","Apple","Orange","Grape","Grape","Apple","Orange","Grape","Apple","Orange","Grape")
df<-data.frame(Case,Month,Fruits)
Patterns <- with(df, do.call(rbind, lapply(unique(Case), function(x){
y <- subset(df, Case == x )
Date<-as.character(y$Month[1])
Fruits <- paste(unique(y$Fruits[order(y$Fruits)]), collapse = ' / ')
as.data.frame(unique (cbind(Case = y$Case, Date, Fruits)))
})))
Total<-Patterns %>%
group_by(Date,Fruits) %>%
tally()
The results I get are acceptable but the process takes too long and with a large dataset I run out of memory.
Over large datasets, data.table will be a lot quicker than dplyr:
library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]
We could do all of it in one command using dplyr. First we group_by Case and Month to paste all Fruits together by group and then grouping by Month and Fruits we add the number of rows for each group using tally.
library(dplyr)
df %>%
group_by(Case, Month) %>%
summarise(Fruits = paste(Fruits, collapse = "/")) %>%
group_by(Month, Fruits) %>%
tally()
# OR count()
# Month Fruits n
# <fct> <chr> <int>
#1 Apr Grape 1
#2 Dec Apple/Orange/Grape 3
#3 Jan Apple/Orange/Grape 1
#4 Mar Grape/Orange 1
#5 Nov Orange/Grape 1
#6 Sep Apple 2

Parsing a Hierarchy in a String Value

I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.
Code as follows:
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:
word_pairs <- review_words %>%
pairwise_count(word, title, sort = TRUE)
What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.
Is this possible using widyr? If not, is there another package that will allow me to do this?
Here is the full code:
#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)
#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")
#Split into one word per row
review_words <- prod_reviews_df %>%
unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
anti_join(stop_words, by = c("word" = "word"))
#Find phi coefficient
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")
I want to add in pairwise_count, but I need it alongside the phi coefficient.
Thank you!
If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.
library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE) %>%
left_join(austen_section_words %>%
pairwise_count(word, section, sort = TRUE),
by = c("item1", "item2"))
#> # A tibble: 154,842 x 4
#> item1 item2 correlation n
#> <chr> <chr> <dbl> <dbl>
#> 1 bourgh de 0.9508501 29
#> 2 de bourgh 0.9508501 29
#> 3 pounds thousand 0.7005808 17
#> 4 thousand pounds 0.7005808 17
#> 5 william sir 0.6644719 31
#> 6 sir william 0.6644719 31
#> 7 catherine lady 0.6633048 82
#> 8 lady catherine 0.6633048 82
#> 9 forster colonel 0.6220950 27
#> 10 colonel forster 0.6220950 27
#> # ... with 154,832 more rows
I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

Resources