concentate 2 vectors to string by common element - r

I have a data.frame with 2 columns. If an element appears in both columns this should be the grouping criteria. I then want to create a new column which concentates all elements by group into a single, sorted string.
df <- tibble::tribble(
~col1, ~col2,
"a", "b",
"b","c",
"c","b",
"d",NA,
"e","d",
"f","d",
"g","d",
"h","i",
"i","h",
"j", NA
)
outcome <- tibble::tribble(
~result,
c("a_b_c"),
c("d_e_f_g"),
c("h_i"),
c("j")
)
any help is appreciated since I have not yet found a starting point to solve the question thanks!

Get the connected components from igraph and paste.
library(dplyr)
library(igraph)
df %>%
mutate(col2 = coalesce(col2, col1)) %>%
as.matrix %>%
graph_from_edgelist %>%
components %>%
groups %>%
sapply(paste, collapse = "_") %>%
stack
giving:
values ind
1 a_b_c 1
2 d_e_f_g 2
3 h_i 3
4 j 4

Related

Filter rows based on regex pattern and ID

I have a df like this:
df <- data.frame(
id = c("A", "A", "B", NA, "A", "B", "B", "B"),
speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)", "I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well")
)
I want to filter out those rows (1) where speech is made up entirely of an expression wrapped in square brackets [...] from string start to string end AND (2) those rows by the same ID which follow the row where [...] makes up the whole speech. I know how to filter out the rows with [...]:
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", speech))
but I don't know how to also filter out the same-ID rows that follow the [...] row. The desired output is this:
df
id speech
1 B [uh]
2 B [erm]
3 B (0.4)
4 B well
Create the grouping index with rleid asin the OP's code, then filter out groups that doesn't have a [ in the first element of 'speech', ungroup
library(dplyr)
library(data.table)
library(stringr)
df %>%
group_by(grp = rleid(id)) %>%
filter(str_detect(first(speech), "^\\[")) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 x 2
# id speech
# <chr> <chr>
#1 B [uh]
#2 B [erm]
#3 B (0.4)
#4 B well
EDIT: Based on #ChrisRuehlemann's comments

Convert categorical "other" value to NA in a data frame using dplyr

trial <- data.frame(c("A", "B", "C", "other"), c("a","b","Others","d"))
There are 2 categorical variables (attributes) in the data frame. I want to recode the value "other" as NA. I follow the link here: https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html
na_strings <- c("other", "Others")
trial %>%
replace_with_na_all(condition = ~.x %in% na_strings)
However, the "other" value does change to NA, but all other characters are turned into numbers. I want the rest of the values to remain character.
What should i do? Thanks in advance.
Here is a simple dplyr solution:
library(dplyr)
library(naniar)
trial %>%
mutate_if(is.factor,as.character) %>%
replace_with_na_all(condition = ~.x %in% na_strings)
You just need to change your variable class from factor to character before the replace_with_na_all function.
You can use base R :
trial[sapply(trial, `%in%`, na_strings)] <- NA
Or only dplyr to do this :
library(dplyr)
trial %>% mutate_all(~replace(., . %in% na_strings, NA))
# col1 col2
#1 A a
#2 B b
#3 C <NA>
#4 <NA> d
data
trial <- data.frame(col1 = c("A", "B", "C", "other"),
col2 = c("a","b","Others","d"))

dplyr: how to ignore NA in grouping variable

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Drop unused levels from a factor after filtering data frame using dplyr

I used dplyr function to create a new data sets which contain the names that have less than 4 rows.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
aa = df %>%
group_by(name) %>%
filter(n() < 4)
But when I type
table(aa$name)
I get,
a b c
3 2 0
I would like to have my output as follow
a b
3 2
How to completely separate new frame aa from df?
To complete your answer and KoenV's comment you can just, write your solution in one line or apply the function factor will remove the unused levels:
table(droplevels(aa$name))
table(factor(aa$name))
or because you are using dplyr add droplevels at the end:
aa <- df %>%
group_by(name) %>%
filter(n() < 4) %>%
droplevels()
table(aa$name)
# Without using table
df %>%
group_by(name) %>%
summarise(count = n()) %>%
filter(count < 4)
aaNew <- droplevels(aa)
table(aa$name)

How to get count-of-a-count with dplyr?

Let's say we have the data frame
df <- data.frame(x = c("a", "a", "b", "a", "c"))
Using dplyr count, we get
df %>% count(x)
x n
1 a 3
2 b 1
3 c 1
I now want to do a count on the resulting n column. If the n column were named m, the result I'm looking for is
m n
1 1 2
2 3 1
How can this be done with dplyr?
Thank you very much!
dplyr seems to have trouble with count(n).
For instance:
d <- data.frame(n = sample(1:2, 10, TRUE), x = 1:10)
d %>% count(n)
A workaround is to rename n:
df %>% # using data defined in question
count(x) %>%
rename(m = n) %>%
count(m)
EDIT: I was wrong. Didn't have the newest version of dplyr so I didn't have the count function.
With dplyr a way to count is with n() In your example you would do the following to obtain the first counts:
df <- data.frame(x = c("a", "a", "b", "a", "c"))
df %>% group_by(x) %>% summarise(count=n())
Then if you want to count the occurrences of particular counts you can do:
df %>% group_by(x) %>% summarise(count=n()) %>% group_by(count) %>% summarise(newCount=n())
This is a dplyr way.
sum((df %>% count(x))$n)
##[1] 5
If you are willing to give data.table a try, it could be quite straight forward.
df <- data.frame(x = c("a", "a", "b", "a", "c"))
library(data.table)
setDT(df)[, .N, by=x][, list(count_of_N=.N), by=N]
# N count_of_N
# 1: 3 1
# 2: 1 2
If you want to count:
df %>% count(x) %>% summarise(length(n))
# length(n)
#1 3
If you want the sum:
df %>% count(x) %>% summarise(sum(n))
# sum(n)
#1 5
Its not pure plyr but this may work:
countr<-function(x){data.frame(table(x))}
t<-count(df,x)
countr(t[,2])

Resources