Delete duplicate word, comma and whitespace - r

How can I delete all the duplicate words alongside the following comma and whitespace using Regex in R?
So far I have come up with the following regular expression, that matches the duplicate, however not the comma and whitespace. :
(\b\w+\b)(?=[\S\s]*\b\1\b)
An example list would be:
blue, red, blue, yellow, green, blue
The output should look like:
blue, red, yellow, green
So it would have to match two of the "blue" in this case, as well as the following comma and whitespace (if there is any).

Depends if your list is truly a list or if it is a string with comma's
# your data is actually already a list/vector
v <- c("blue", "red", "blue", "yellow", "green", "blue")
unique(v)
[1] "blue" "red" "yellow" "green"
# if your data is actually a comma seperated string
s <- "blue, red, blue, yellow, green, blue"
# if output needs to be a vector
unique(strsplit(s, ", ")[[1]])
[1] "blue" "red" "yellow" "green"
# if output needs to be a string again
paste(unique(strsplit(s, ", ")[[1]]), collapse = ", ")
[1] "blue, red, yellow, green"
Example based on the list column in a data.table or data.frame
dt <- data.table(
id = rep(1:5),
colors = list(
c("blue", "red", "blue", "yellow", "green", "blue"),
c("blue", "blue", "yellow", "green", "blue"),
c("blue", "red", "blue", "yellow"),
c("red", "red", "yellow", "yellow", "green", "blue"),
c("black")
)
)
## using data.table
library(data.table)
setDT(dt)
# use colors instead of clean_list to just fix the existing column
dt[, clean_list := lapply(colors, function(x) unique(x))]
## using dplyr
library(dplyr)
# use colors instead of clean_list to just fix the existing column
dt %>% mutate(clean_list = lapply(colors, function(x) unique(x)))
dt
# id colors clean_list
# 1: 1 blue,red,blue,yellow,green,blue blue,red,yellow,green
# 2: 2 blue,blue,yellow,green,blue blue,yellow,green
# 3: 3 blue,red,blue,yellow blue,red,yellow
# 4: 4 red,red,yellow,yellow,green,blue red,yellow,green,blue
# 5: 5 black black
# or just simply in base
dt$colors <- lapply(dt$colors, function(x) unique(x))

We could use paste with unique and collapse:
paste(unique(string), collapse= (", "))
[1] "blue, red, yellow, green"
data:
string <- c("blue", "red", "blue", "yellow", "green", "blue")

Related

R - Finding duplicates in list entries

I am trying to figure out how to get duplicates out of list objects in R.
So my example list:
examplelist <- list(a = c("blue", "red", "yellow"),
b = c("red", "black", "green"),
c = c("black", "green", "brown"))
What I would like to get as a result:
duplicates: c("red", "black", "green")
vector of all entries, without double entries: c("blue", "red", "yellow", "black", "green", "brown")
I was not able to find a function for that other than duplicated() which just checks my list objects in total but not the entries itselves.
Thank you for your help :)
You can unlist first:
unlisted <- unlist(examplelist)
unlisted[duplicated(unlisted)]
# b1 c1 c2
# "red" "black" "green"
unlisted[!duplicated(unlisted)]
# a1 a2 a3 b2 b3 c3
# "blue" "red" "yellow" "black" "green" "brown"
If you only want the vector (without the names), use unname:
unlisted <- unname(unlist(examplelist))

How to change a two-words phrase across a column in R

I have a dataframe like my_df. Across column color, I would like to change the content of all the cells that contain the word 'blue' into just the word 'blue' and end up with a dataframe like my_df2.
clothes <- c("skirt", "jacket", "shirt")
colors <- c("light blue", "dark blue", "ice blue")
my_df <- as.data.frame(cbind(clothes, colors))
color_blue <- c("blue", "blue", "blue")
my_df2 <- as.data.frame(cbind(clothes, color_blue))
I have tried this:
my_df[grepl("blue", my_df$colors),] == "blue"
Thank you for your interest
You can subset your my_df using grepl(), which detects patterns and returns a logical vector.
clothes <- c("skirt", "jacket", "shirt", "shoes")
colors <- c("light blue", "dark blue", "ice blue", "grey")
my_df <- as.data.frame(cbind(clothes, colors))
my_df$color_blue[grepl("blue", my_df$colors)] <- "blue"
my_df
clothes colors color_blue
1 skirt light blue blue
2 jacket dark blue blue
3 shirt ice blue blue
4 shoes grey <NA>

Write a function in R to group factor levels by frequency, then keep the 2 largest categories and pool the rest in "other" [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to write a function in R that takes a single factor variable and a parameter n as inputs, computes the number of cases per category in the factor variable, and only keeps those n categories with the most number of cases and pools all other categories into a category "other." This function must be applied to multiple variables, keeping the 2 largest categories for each variable and pooling all other categories in each variable into a category "other."
Example:
var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
"square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")
var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
"blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")
df <- data.frame(var1, var2)
Thank you so much!
forcats::fct_lump_n() exists for precisely this:
library(forcats)
library(dplyr)
df %>%
mutate_all(fct_lump_n, 2)
var1 var2
1 square orange
2 square orange
3 square orange
4 circle orange
5 square blue
6 square orange
7 circle blue
8 square blue
9 circle orange
10 circle blue
11 circle blue
12 circle blue
13 square orange
14 circle orange
15 Other orange
16 circle orange
17 circle Other
18 Other Other
You can do that with data.table. There is probably a more elegant way to do it but it seems to work
library(data.table)
myfunc <- function(x, n = 10){
xvar <- data.table::as.data.table('x' = x)
dt <- xvar[,.('count' = .N), by = "x"][order(-get('count'))]
dt[, "category" := as.character(get("x"))]
dt[, 'rk' := (seq_len(.N)<=n)]
dt[!get('rk'), c('category') := "other"]
dt <- merge(xvar,dt, by = "x")
return(dt$category)
}
I coerce your example dataframe as a data.table object
var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
"square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")
var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
"blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")
df <- data.frame(var1, var2)
df2 <- as.data.table(df)
Then, the call is quite easy:
df2[,lapply(.SD, myfunc, n = 3)]
var1 var2
1: circle blue
2: circle blue
3: circle blue
4: circle blue
5: circle blue
6: circle blue
7: circle green
8: circle orange
9: circle orange
10: other orange
11: square orange
12: square orange
13: square orange
14: square orange
15: square orange
16: square orange
17: square orange
18: triangle other
data.table object is a special data.frame thus you don't need to coerce it back to data.frame class

create a vector in R using variable names

I have a variable called school_name
I am creating a vector to define colors that I will use later in ggplot2.
colors <- c("School1" = "yellow", "School2" = "red", ______ = "Orange")
In my code I am using the variable school_name for some logic want to add that as the third element of my vector. The value changes in my for loop and cannot be hard-coded.
I have tried the following but it does not work.
colors <- c("School1" = "yellow", "School2" = "red", get("school_name") = "Orange")
Please can someone help me with this
You can use structure:
school_name = "coolSchool"
colors <- structure(c("yellow", "red", "orange"), .Names = c("School1","School2", school_name))
You can just set the names of the colors using names():
colors <- c("yellow", "red", "orange")
names(colors) <- c("School1", "School2", school_name)
This also works:
school_name <- "school3"
colors <- c("School1" = "yellow", "School2" = "red")
colors[school_name] <- "Orange"
# School1 School2 school3
# "yellow" "red" "Orange"

Find Max Color & Count

I have a matrix in the following format:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "blue" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[2,] "green" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[3,] "yellow" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[4,] "red" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[5,] "blue" "red" "green" "blue" "blue" "red" "green" "blue" "blue"
[6,] "green" "red" "green" "blue" "blue" "red" "green" "blue" "blue"
...
How do I quickly calculate the max color and count per row.
For instance, for row 1, it would be "blue, 6". I am doing this via an apply command that calls "table".
However, my matrix has 1.9 million rows so it takes too long. How can I vectorize this?
How many different possibilities do you have for each cell of the matrix? Is it just like in your example? If yes something like the following may be faster
dat <- structure(c("blue", "green", "yellow", "red", "blue", "green",
"red", "red", "red", "red", "red", "red", "red", "red", "blue",
"blue", "blue", "blue", "green", "green", "red", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "red", "red", "red", "red",
"red", "blue", "green", "green", "green", "green", "green", "green",
"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "blue", "blue", "green"), .Dim = c(7L,
9L))
values <- c("blue", "red", "green", "yellow")
counts <- vapply(values, function(value) rowSums(dat == value),
numeric(nrow(dat))) # Thanks to #RichardScriven for the improvement :)
counts
# blue red green yellow
# [1,] 6 2 1 0
# [2,] 5 2 2 0
# [3,] 5 2 1 1
# [4,] 5 3 1 0
# [5,] 5 2 2 0
# [6,] 4 2 3 0
# [7,] 4 4 1 0
max.value.col <- max.col(counts)
max.value <- colnames(counts)[max.value.col]
max.counts <- counts[cbind(1:nrow(counts), max.value.col)]
paste(max.value, max.counts, sep = ", ")
# [1] "blue, 6" "blue, 5" "blue, 5" "blue, 5" "blue, 5" "blue, 4"
If you want to get the names of all columns, if there is a tie, the following would work but may take a while (not sure about the performance of apply in this case)
max.value.all.cols <- counts == counts[cbind(1:nrow(counts), max.value.col)]
paste(
apply(max.value.all.cols, 1, function(r) paste(paste(colnames(counts)[r],
collapse = ", "))),
max.counts, sep = ", ")
Here's an actual data.table solution I think. Leverages data.table's fast .N for counting row frequencies
library(data.table)
flip <- data.table(t(mat))
tally <- lapply(names(flip),
function(x) {
setnames(flip[, .N, by=eval(x)][order(-N)][1,],
c('clr', 'N')) } )
do.call(rbind, tally)
# clr N
# 1: blue 6
# 2: blue 5
# 3: blue 5
# 4: blue 5
# 5: blue 5
# 6: blue 4
I take the matrix and transpose it, then do counts by each column (i.e. by each row of the original matrix). The setnames bit is required so that we can conveniently collapse the results together, but if you are happy to get the results in list form it's not required.
I used the same data as others:
mat <-
matrix(c( "blue","red","blue","blue","blue","red","green","blue","blue",
"green","red","blue","blue","blue","red","green","blue","blue",
"yellow","red","blue","blue","blue","red","green","blue","blue",
"red","red","blue","blue","blue","red","green","blue","blue",
"blue","red","green","blue","blue","red","green","blue","blue",
"green","red","green","blue","blue","red","green","blue","blue"),
ncol = 9, byrow = TRUE)

Resources