How do you get duplicates in a list in elixir? - functional-programming

If you want to get duplicates instead of uniq values in a list, how would you do this in a quick, dense script that uses pattern matching?
For example, with an input of ["ash", "bob", "cat", "bob", "ash"] how could I get ["ash", "bob"] ?

Since you specified you wanted a quick, dense script, I think you should consider this solution:
l = ["ash", "bob", "cat", "bob", "ash", "ash"]
# to get all duplicates
l -- Enum.uniq(l) # => ["bob", "ash", "ash"]
# to get a unique list of duplicates
Enum.uniq(l -- Enum.uniq(l)) # => ["bob", "ash"]

Here is how I would do it:
["ash", "bob", "cat", "bob", "ash"]
|> (&((&1 -- (&1 |> Enum.uniq())) |> Enum.uniq())).()
This is the same as doing:
my_list = ["ash", "bob", "cat", "bob", "ash"]
(my_list -- (my_list |> Enum.uniq())) |> Enum.uniq()
What is happening:
get a list of all the unique values (the complement to what we want): my_list |> Enum.uniq()
Use list subtraction to get the complement of these unique values.
Use another call to Enum.uniq to get these "duplicates" in unique list.

If you want to get a unique list of all duplicates
def get_uniq_duplicates(all) do
all |> Enum.reduce({[], []}, fn val, {once, duplicates} ->
if once |> Enum.member?(val) do
if duplicates |> Enum.member?(val) do
{once, duplicates}
else
{once, duplicates ++ [val]}
end
else
{once ++ [val], duplicates}
end
end) |> elem(1)
end
If you want a list of duplicates where only one copy of each value has been removed, eg. ["a", "b", "c", "c", "c"] -> ["c", "c"]
Then you can use the simpler:
def get_duplicates(all) do
all |> Enum.reduce({[], []}, fn val, {once, duplicates} ->
if once |> Enum.member?(val) do
{once, duplicates ++ [val]}
else
{once ++ [val], duplicates}
end
end) |> elem(1)
end

Using Enum.group_by/3
["ash", "bob", "cat", "bob", "ash"]
|> Enum.group_by(&(&1))
|> Enum.filter(&((&1 |> Enum.count()) > 1))
|> Enum.map(&(&1 |> Enum.uniq()))
|> List.flatten()
This method supports searching structs for specific criteria.

Related

Automatically create data frames based on factor levels of a column

I have some fake case data with a manager id, type, and location. I'd like to automatically create data frames with the average number of cases a manager has at a given location.
# create fake data
manager_id <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
type <- c("A", "A", "B", "B", "B", "A", "A", "A", "C", "A", "B", "B", "C", "C", "C")
location <- c("Beach", "Beach", "Beach", "Beach", "Beach", "City", "City", "City", "Farm", "Farm", "Farm", "Farm", "Farm", "Farm", "City")
manager_id <- data.frame(manager_id)
type <- data.frame(type)
location <- data.frame(location)
df <- cbind(manager_id, type, location)
After creating fake data, I created a function that finds this average. The function works.
avgs_function <- function(dat){
dat1 <- dat %>% group_by(manager_id) %>% summarise(total = n())
total <- mean(dat1$total)
total <- round(total, 0)
total
}
I then loop through each location, create data frames using the avgs_function, and store them in a list. Then I call the data frames into my global environment. Something is going wrong here that I can't figure out. The weird thing is that is was working fine yesterday.
df_list <- unique(df$location) %>%
set_names() %>%
map(~avgs_function(df))
names(df_list) <- paste0(names(df_list), "_avg")
list2env(df_list, envir = .GlobalEnv)
Right now, the code is giving these values:
Beach_avg = 5
City_avg = 5
Farm_avg = 5
I would like:
Beach_avg = 5
City_avg = 2
Farm_avg = 3
I believe the issue is happening with the purrr package. Any help would be greatly appreciated!
I don't think you need purrr at all (just dplyr): this gets your desired output
result <-(df
%>% count(manager_id, location)
%>% group_by(location)
%>% summarise(across(n, mean))
)
(although without the _avg added to the location names: you could add mutate(across(location, paste0, "_avg")) (or something with glue) if you wanted)
This also doesn't create the separate variables you wanted (although obviously you can add more stuff e.g. with(result, setNames(list(n), location)) %>% list2env(), but in general workflows that populate your global workspace with a bunch of different named variables are a bad idea - collections like this can usually be handled better by keeping them inside a list/data frame/tibble ...

Find matching values in different row of dataframe with group_by and str_detect

Let's say I have a table:
library(tidyverse)
df <- tibble(group = c("a", "a", "b", "b", "c", "c"),
code = c("foo", "bar", "fuz", "baz", "fiz", "boz"),
child_code = c("bar", "", "baz", "", "biz", ""))
I'd like to group_by group and then search for the code in the child_codecolumn to get something like this:
group
code
child_code
code_in_child_code
a
foo
bar
FALSE
a
bar
TRUE
b
fuz
baz
FALSE
b
baz
TRUE
c
fiz
biz
FALSE
c
boz
FALSE
I've tried:
df %>% group_by(group) %>% mutate(code_in_child_code = str_detect(child_code, code))
But (I suppose obviously) that's just looking for the child_code in the same row's code column. I want to seach the child_code column for the any value in the whole group's codes.
Any help would be much appreciated.
I've found an answer:
df %>% group_by(group) %>%
mutate(codes = paste(code, collapse = "|"),
code_in_child_code = str_detect(child_code, codes))
Seems fairly facile now - of course I needed to get the codes in the same row as the value I was searching for. The values TRUE and FALSE are the "wrong" way round in this answer, but it's as useful either way.

Substitute digits with strings contained in a reference dataframe

I have adf_a looking like:
df_a <- tibble::tribble(
~id, ~string,
115088, "1-3-5-13",
678326, "1-9-13-3",
105616, "1-3-5-13"
)
Each id is associated with the string column, that stores strings composed by digits separated by "-".
I have a reference dataframe for which each id_string is associated with a string of text.
id <- tibble::tribble(
~name, ~id_string,
"aaa", 1,
"bbb", 3,
"ccc", 5,
"ddd", 13,
"eee", 9,
"fff", 8,
"ggg", 6
)
I would like to substitute the digits in the string column in df_a with the text stored in the reference dataframe id.
the result should be:
df_output <- tibble::tribble(
~id, ~string,
115088, "aaa-bbb-ccc-ddd",
678326, "aaa-eee-ddd- bbb",
105616, "aaa-bbb-ccc-ddd"
)
Yeah you got a pretty nasty one right here, this is the type of thing i would write a dedicated c++ method and call it from R because as I see it, it has asymetries.
I wrote an iterative loop for you- it might work -Im not sure, but even if it does and your data is over 200K rows it will become a problem and might take a long time to finish.
temp = strsplit(df_a$string, "-") %>% lapply(function(x) as.numeric(x))
temp.List = list()
actual.List = list()
for(i in 1:length(temp)){
for (j in 1:nrow(id)){
if(temp[[i]] %in% id$id_string[j]){
temp.List[j] = id$name[j]
}else{
temp.List[j] = NULL
}
}
actual.List[[i]]= temp.List %>% unlist %>% paste(sep ='-')
}
desired.Output = cbind(df_a$id,actual.List %>% unlist)
#cleanup
rm(temp,temp.List,actual.List)

Find the shortest string by categories R

I am quite a beginner in R, and I faced this problem.
I would like to find the length of the shortest string per each category in my tibble and then,
truncate all strings from the category according to width of the shortest one.
ex = tibble( category = c("A", "A", "C", "B", "C", "A"),
string = c("cat", "bird", "apple", "cloud", "banana", "elephant"))
I presume how to solve the problem theoretically, however, I am not able to put it together.
ex %>%
group_by(category) %>%
mutate(length = lapply(ex, function(x) min(nchar(x))) ) %>%
somehow str_trunc() ?
At the end I would like to see something like that:
ex = tibble( category = c("A", "A", "C", "B", "C", "A"),
string = c("cat", "bir", "apple", "cloud", "banan", "ele"))
This should do what you need
ex %>%
group_by(category) %>%
mutate(length = min(nchar(string)),
string = str_sub(string, 1, length))
We don't need the lapply inside the mutate to find the length. We can just run that transformation on the string column directly. And here I used stringr::str_sub to get the substring with the right number of characters since you already seem to be using tidyverse functions. You could also use the base substr function instead.
You can do this in base R with
aggregate(ex$string, list(ex$category),
function(s) min(nchar(as.character(s))))

Filtering rows based on "complex" strings in a column R/dplyr

I am able to filter my dataset using the strings in a particular column, here's a sample dataset and how I did it.
ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
df = data.frame(ID, String)
The problem is - I want to only pick the IDs that have N in them - or - IDs that don't have N in them.
df_2 <- dpylr::filter(df, !grepl('N', String))
Output: [2] [Y]
This will filter out the ID's with N, but it also removes ALL cases of N (including those that have 'No'. I'm new to R so I apologize if this is just me not understanding the syntax - but I cannot figure this out.
I could also try parsing out the string into individual columns, then selecting based on that - I need to do this anyway for later analysis. Below is the code that I use to achieve this.
df_2 <- df%>%mutate(String=gsub("\\b([A-Za-z]+)\\b","\\11",String),
name=str_extract_all(String,"[A-Za-z]+"),
value=str_extract_all(String,"\\d+"))%>%
unnest()%>%spread(name,value,fill=0)
This gives me
Output:
ID<chr> String<chr> N<chr> No <chr> Y<chr>
1 Y1 N1 No1 1 1 1
2 Y1 0 0 1
3 Y1 No1 0 1 1
4 Y1 N1 1 0 1
This way I could just select my rows based on whether or not N is zero or one - however, R doesn't like when I do this and I do not understand why.
Thank you for any help you could offer.
EDIT: Here is a sample of my actual data. I might have over simplified in my question.
m/z Column
241 C15 H22 O Na
265 C15 H15 N5
301 C16 H22 O4 Na
335 C19 H20 O4 Na
441 C26 H42 O4 Na
My goal is to filter out all of the N's in Column (They range from N, N1, N4, etc)
ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
df = data.frame(ID, String)
df %>% filter(!grepl("(N\\d+|N\\s)", String))
Output: [Y] [Y No]
This answer by #MauritsEvers also works for the more complicated dataset in the second paragraph - where digits that may also come after N (like N2 or N10) will also be included in the argument. Remove "!" for including "N".
I think your second approach is the way to go, especially if you going to split the columns for downstream analysis. It also (imo) meets "tidy" requirements. I also suggest standardising the String variable. Yes/Y, No/N are not acceptable.
The tidyr package has a two nice function for this separate and gather
library(dplyr)
library(tidyr)
ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
String <- gsub(pattern = "No", "N", String)
df = data.frame(ID, String)
#Separate the String var
df_sep <- separate(df, col = String, into = c("R1", "R2", "R3"), sep = " ", extra = "merge")
#gather the columns
df_gat <- gather(df_sep, Cols, StrValue, R1:R3, -ID)
#filter
filter(df_gat, StrValue == "N" | StrValue != "N")
Here is my modified answer:
library(dplyr)
library(tidyr)
#Separate the String var
df_sep <- separate(df, col = Column, into = c("E1", "E2", "E3", "E4"), sep = " ", extra = "merge")
#gather the columns, long data format
gather(df_sep, Cols, Element, E1:E4, -m.z) %>% select(m.z, Element) -> df_gat
#filter
filter(df_gat, !grepl("^N$|N\\d", df_gat$Element))
It produces a long dataset that works well with the filter function. Your data previously was wide (kinda). I suggest changing the symbol of sodium to something else, you may run into trouble if Na (sodium) is converted to NA.
You probably want to use sub to substitute "" for any pattern matching "N(\\d{1,3}|\\s|$)", meaning "N" followed by one of 1-3 digits or space or end of string.
I don't think you wnat to use filtering since as I understood the English description, you wanted to remove specific patterns from with character values. I was imagining that these were chemical symbols and that N was nitrogen and Na was sodium.

Resources