I have a dataset with many rows that contain fruit descriptions e.g:
An apple hangs on an apple tree
Bananas are yellow and tasty
The apple is tasty
I need to find unique words in this description (
I've already done it) and then I have to count in how many rows are those unique words appear.
Example:
Apple 2 (rows)
Bananas 1 (rows)
tree 1 (rows)
tasty 2 (rows)
I've done something like that:
rows <- data_frame %>%
filter(str_detect(variable, "apple"))
count_rows <- as.data.frame(nrow(rows))
But the problem is that I have too many unique words so I don't want to do it manually. Any ideas?
One dplyr, tidyr and tibble option could be:
df %>%
rowid_to_column() %>%
mutate(sentences = strsplit(sentences, " ", fixed = TRUE)) %>%
unnest(sentences) %>%
mutate(sentences = tolower(sentences)) %>%
filter(sentences %in% list_of_words) %>%
group_by(sentences) %>%
summarise_all(n_distinct)
sentences rowid
<chr> <int>
1 apple 2
2 bananas 1
3 tasty 2
4 tree 1
Sample data:
df <- data.frame(sentences = c("An apple hangs on an apple tree",
"Bananas are yellow and tasty",
"The apple is tasty"),
stringsAsFactors = FALSE)
list_of_words <- tolower(c("Apple", "Bananas", "tree", "tasty"))
In base R this can be done like the following.
r <- apply(sapply(words, function(s) grepl(s, df[[1]], ignore.case = TRUE)), 2, sum)
as.data.frame(r)
# r
#Apple 2
#Bananas 1
#tree 1
#tasty 2
Data.
x <-
"'An apple hangs on an apple tree'
'Bananas are yellow and tasty'
'The apple is tasty'"
x <- scan(textConnection(x), what = character())
df <- data.frame(x)
words <- c("Apple", "Bananas", "tree", "tasty")
A base R solution would be to use grepl with sapply or lapply:
sapply(list_of_words, function(x) sum(grepl(x, tolower(df$sentences), fixed = T)))
apple bananas tree tasty
2 1 1 2
Related
Given a dataframe of types and values like so:
topic
keyword
cheese
cheddar
meat
beef
meat
chicken
cheese
swiss
bread
focaccia
bread
sourdough
cheese
gouda
My aim is to make a set of dynamic regexs based on the type, but I don't know how to make the variable names from the types. I can do this individually like so:
fn_get_topic_regex <- function(targettopic,df)
{
filter_df <- df |>
filter(topic == targettopic)
regex <- paste(filter_df$keyword, collapse = "|")
}
and do things like:
cheese_regex <- fn_get_topic_regex("cheese",df)
But what I'd like to be able to do is build all these regexes automatically without having to define each one.
The intended output would be something like:
cheese_regex: "cheddar|swiss|gouda"
bread_regex: "focaccia|sourdough"
meat_regex: "beef|chicken"
Where the start of the variable name is the distinct topic.
What's the best way to do that without defining each regex individually by hand?
You can use dplyr's group_by() and summarise()
df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
# A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
Or you can apply your function to every unique value in df$topic:
map_chr(unique(df$topic) %>% setNames(paste0(., "_regex")),
fn_get_topic_regex, df = df)
cheese_regex meat_regex bread_regex
"cheddar|swiss|gouda" "beef|chicken" "focaccia|sourdough"
Just remember to add return(regex) to the end of your function, or not to assign the last line to a variable at all. I would even put everything in a single pipe chain:
fn_get_topic_regex <- function(targettopic,df)
{
df |>
filter(topic == targettopic) |>
pull(keyword) |>
paste(collapse = "|")
}
Here is a base R solution with your intended output in a named list.
df <- structure(list(topic = c("cheese", "meat", "meat", "cheese", "bread", "bread", "cheese"),
keyword = c("cheddar", "beef", "chicken", "swiss", "focaccia", "sourdough", "gouda")),
class = "data.frame", row.names = c(NA, -7L))
#split into a list per topic
topics <- split(df, df$topic)
#collapse the keyword column
topics <- lapply(topics, function(t) {
paste(t$keyword, collapse = "|")
})
#rename
names(topics)<- paste0(names(topics), "_regex")
topics
$bread_regex
[1] "focaccia|sourdough"
$cheese_regex
[1] "cheddar|swiss|gouda"
$meat_regex
[1] "beef|chicken"
We could do something like this:
after grouping we could use summarise together with paste and collapse to get our regex s
Then, when the regex is needed we could refer to it by indexing like the example below:
library(dplyr)
library(stringr) #str_detect
my_regex <- df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
df %>%
mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
> df %>%
+ mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
topic keyword new_col
1 cheese cheddar it is not bread
2 meat beef it is not bread
3 meat chicken it is not bread
4 cheese swiss it is not bread
5 bread focaccia it is bread
6 bread sourdough it is bread
7 cheese gouda it is not bread
I have a data frame that includes all different types of goods, for example, apples, bananas, potatoes, tuna, salmon, oranges, and many more.
All these goods are under a variable "Item".
I am looking for a solution in R that can create a new variable as "Item Category" with Fruits, Vegetables, Seafood and assign all the items according to their category.
You may prepare a list for each category and match them in a case_when statement.
library(dplyr)
df <- data.frame(item = c('apples', 'bananas', 'potatoes', 'tuna', 'salmon', 'oranges'))
df <- df %>%
mutate(item_category = case_when(item %in% c('apples', 'bananas', 'oranges') ~ 'Fruits',
item %in% c('potatoes') ~ 'Vegetables',
item %in% c('tuna', 'salmon') ~ 'SeaFood'))
df
# item item_category
#1 apples Fruits
#2 bananas Fruits
#3 potatoes Vegetables
#4 tuna SeaFood
#5 salmon SeaFood
#6 oranges Fruits
You can use fct_collapse() in forcats to collapse factor levels into manually defined groups:
library(dplyr)
# Refer to #RonakShah's example
df <- data.frame(item = c('apples', 'bananas', 'potatoes', 'tuna', 'salmon', 'oranges'))
df %>%
mutate(
item_category = forcats::fct_collapse(item,
'Fruits' = c('apples', 'bananas', 'oranges'),
'Vegetables' = c('potatoes'),
'SeaFood' = c('tuna', 'salmon'))
)
or passing a named list to rename levels with !!!:
lev <- list('Fruits' = c('apples', 'bananas', 'oranges'),
'Vegetables' = c('potatoes'),
'SeaFood' = c('tuna', 'salmon'))
df %>%
mutate(item_category = forcats::fct_collapse(item, !!!lev))
# item item_category
# 1 apples Fruits
# 2 bananas Fruits
# 3 potatoes Vegetables
# 4 tuna SeaFood
# 5 salmon SeaFood
# 6 oranges Fruits
I'd like to create a custom function to try and standardise strings in multiple different columns, in multiple different data frames, with the ultimate intention of joining data from them together.
In order to do this, I'd like to be able to pass a column name into a custom function and have the function carry out operations on that column. With the example beneath, I'd like to clean columns a and c before joining them together to look like this:
library(tidyverse)
df1 <- tibble(a = c("apple & pear", "kiwi", "plum"), b = c("cat", "dog", "cow"))
df2 <- tibble(c = c("apple and pear", "kiwi.", "plum"), d = c("car", "bike", "truck"))
full_join(df1, df2, by = c("a" = "c") )
a b d
1 apple & pear cat car
2 kiwi dog bike
3 plum cow truck
Instead of how it currently turns out like, which is this:
# A tibble: 5 x 3
a b d
1 apple & pear cat NA
2 kiwi dog NA
3 plum cow truck
4 apple and pear NA car
5 kiwi. NA bike
To do this, I know I need to build custom functions, which I'd be relatively inexperienced at doing, especially with curly-curly. The two functions beneath should change the symbols and remove the trailing punctuation, and ideally these should be combined into the one function, with the flexibility to be able to add more if necessary, like this:
add_symbol <- function(col.name){
mutate({{col.name}} = gsub(" & ", " and ", {{col.name}}))
}
rm_trail_punc <- function(col.name){
mutate({{col.name}} = gsub("[[:punct:]]$", "", {{col.name}}))
}
standardise_col <- function(df, col.name){
df %>%
add_symbol({{col.name}}) %>%
rm_trail_punc({{col.name}})
}
df1 <- standardise_col(df1)
standardise_col(df2) %>%
full_join(., df1, by = c("a" = "c"))
However, these functions can't be created, and return an error unexpected '=' because the column name can't be passed to the left-hand side of the equal sign. Is there any way of passing these values to the mutate without hard-coding them?
I think you can achieve this more simply using with the following:
library(dplyr)
clean_func <- function(df){
df %>% mutate(across(everything(), ~gsub(" & ", " and ", .) %>%
gsub("[[:punct:]]$", "", .)))
}
df1 <- clean_func(df1)
df2 <- clean_func(df2)
You can make updates to the function by adding additional gsub, str_replace, or other calls as needed.
Edit:
Based on update, you can do something like this to target your variables specifically:
add_symbol <- function(col.name){
gsub(" & ", " and ", col.name)
}
rm_trail_punc <- function(col.name){
gsub("[[:punct:]]$", "", col.name)
}
standardise_col <- function(df, col.name){
col.name <- enquo(col.name)
df %>%
mutate(!!col.name := add_symbol(!!col.name),
!!col.name := rm_trail_punc(!!col.name))
}
Your code won't ever work as written, but you could do something like this:
new_df <- standardise_col(df1, a) %>%
left_join(., standardise_col(df2, c), by = c("a"="c"))
Which gives us:
# A tibble: 3 x 3
a b d
<chr> <chr> <chr>
1 apple and pear cat car
2 kiwi dog bike
3 plum cow truck
You can read up on tidy evaluation here: https://tidyeval.tidyverse.org/dplyr.html
As said in the comment by #1k monkeys and a single PC, your example data are different from what you show, so maybe the results could be different, but let's assume you've some data like this:
df1 <- tibble(a = c("apple & pear", "kiwi", "plum"),
b = c("cat","dog","cow"))
df2 <- tibble(c = c("apple and pear", "kiwi.", "orange"),
d = c("truck","bike","car"))
You can manage to use the package fuzzyjoin to merge them:
library(fuzzyjoin)
library(dplyr)
df1 %>%
stringdist_full_join(df2, by = c(a = "c") ,
max_dist = 3,
distance_col = "DIST")
# A tibble: 4 x 5
a b c d DIST
<chr> <chr> <chr> <chr> <dbl>
1 apple & pear cat apple and pear truck 3
2 kiwi dog kiwi. bike 1
3 plum cow <NA> <NA> NA
4 <NA> <NA> orange car NA
The result is different because I've based the data on your example and "plum" and "orange" doesn't match (so cow and car are not aligned). Clearly with a select() you can select the column you need, or with mutate() you can rename them.
Challenge: I have a column in which there are several rows. For eg., the first row is "Fruit name" and second row is "Fruit Color" and it repeats for another fruit. I want to grab the every second row (Fruit color) and create a new column. In the original column only the fruit names remain
library(tidyverse)
df_before <- tribble(~Singlecolumn,"Apple","Red","Banana","Yellow","Kiwi","Grey","Grapes","Green")
df_before
Singlecolumn
<chr>
Apple
Red
Banana
Yellow
Kiwi
Grey
Grapes
Green
#I would like to split this like below:
df_after <- tribble(~Column1, ~Column2, "Apple","Red","Banana","Yellow","Kiwi","Grey","Grapes","Green")
df_after
Column1 Column2
Apple Red
Banana Yellow
Kiwi Grey
Grapes Green
I'm sure there is a easier way to do this using functions from tidyverse but couldn't find any info with a good deal of search.
Would appreciate any pointers. Thanks in advance!
Easier option is to convert to a matrix with 2 columns and convert to data.frame in base R
as.data.frame(matrix(df_before$Singlecolumn, ncol = 2, byrow = TRUE))
But, we can also use tidyverse, where we create two groups with rep and then use pivot_wider to reshape from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df_before %>%
group_by(grp = str_c('Column', rep(1:2, length.out = n()))) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = grp, values_from = Singlecolumn) %>%
select(-rn)
# A tibble: 4 x 2
# Column1 Column2
# <chr> <chr>
#1 Apple Red
#2 Banana Yellow
#3 Kiwi Grey
#4 Grapes Green
We can use vector recycling of logical values to get alternate data from df_before .
data.frame(Column1 = df_before$Singlecolumn[c(TRUE, FALSE)],
Column2 = df_before$Singlecolumn[c(FALSE, TRUE)])
# Column1 Column2
#1 Apple Red
#2 Banana Yellow
#3 Kiwi Grey
#4 Grapes Green
You could do it by indexing the odd and even numbered columns
# dummy data (please provide code to make a reproducible example in the future)
df1 <- data.frame(v1 = c("A", "a", "B", "b", "C", "c"))
# solution
df2 <- data.frame(
"col1" = df1[seq(1,length(df1[,1]),2), "v1"],
"col2" = df1[seq(2,length(df1[,1]),2), "v1"])
Here sequence is being used to give a vector of integers separated by 2, running from 1 or 2 to the length of the original dataframe using the seq() function, e.g.
seq(2,length(df1[,1]),2)
## [1] 2 4 6
That's then passed to the rows in the square braces of df[rows, columns].
I have a csv file with a column named text like the following and would like to assign numbers to certain words and then add them.
text
I have apples oranges and mangos.
I like cats.
sports and exercise.
I've created a matrix called matrix_values with following values.
[,1] [,2]
[1,] "apples" "1"
[2,] "mangos" "3"
[3,] "sports" "78"
Below is the code I have.
data <- read.csv(file.choose(), header = TRUE, stringsAsFactors = FALSE)
values <- c('apples', 'mangos', 'sports', 1,3,78)
matrix_values = matrix(values,nrow =3, ncol = 2)
The output should look like this
text, Value
I have apples oranges and mangos, 4
I like cats, 0
sports and exercise, 78
Notice how the values from the matrix adds the value for apples and mangos and treats the other words with a value of 0.
How do I do this?
If you strsplit your sentence up, you can then match to your lookup table and sum.
x <- c(
"I have apples oranges and mangos.",
"I like cats.",
"sports and exercise."
)
lkup <- data.frame(
word = c("apples", "mangos", "sports"),
value = c(1, 3, 78)
)
vapply(
strsplit(x, "\\s+|[.,]+"),
function(x) sum(lkup$value[match(x,lkup$word)], na.rm=TRUE),
FUN.VALUE = numeric(1)
)
#[1] 4 0 78
To explain the regex more:
\\s+ whitespace, repeated 1 or more times
| OR
[.,]+ a period `.` or comma `,` repeated 1 or more times
Here's a way with dplyr and stringr. Note that this uses a cross-join so may have problems if your datasets are very large. -
df %>%
mutate(cj = 1) %>%
inner_join(mutate(lkup, cj = 1), by = "cj") %>%
mutate(test = str_detect(text, word)) %>%
group_by(text) %>%
summarize(value = sum(value*test))
# A tibble: 3 x 2
text value
<chr> <dbl>
1 I have apples oranges and mangos. 4
2 I like cats. 0
3 sports and exercise. 78
Data (thanks to #thelatemail) -
df <- read.table(text = "text
I have apples oranges and mangos.
I like cats.
sports and exercise.", header= T, stringsAsFactors = F, sep = "\t")
lkup <- tibble(
word = c("apples", "mangos", "sports"),
value = c(1, 3, 78)
)
Here is another approach which is kind of similar to #Shree but separating every word into separate_rows. Using #thelatemail's regex to separate them
library(dplyr)
df %>%
mutate(row = row_number(),
text1 = text) %>%
tidyr::separate_rows(text, sep = "\\s+|[.,]+") %>%
left_join(lkup, by = c("text" = "word")) %>%
group_by(row) %>%
summarise(text = first(text1),
value = sum(value, na.rm = TRUE)) %>%
select(-row)
# text value
# <fct> <dbl>
#1 I have apples oranges and mangos. 4
#2 I like cats. 0
#3 sports and exercise. 78