Organize subgroup strings (text) - r

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!

A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)

Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

Related

Move subgroup under repeated main group while keeping main group once in data.frame R

I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14

Giving IDs to groups in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have in R data frame that is divided to groups, like this:
Row
Group
1
A
2
B
3
A
4
D
5
C
6
B
7
C
8
C
9
A
10
B
I would like to add a uniaque numeric ID to each group, so finally I would have something like this:
Row
Group
ID
1
A
1
2
B
2
3
A
1
4
D
4
5
C
3
6
B
2
7
C
3
8
C
3
9
A
1
10
B
2
How could I achieve this?
Thank you very much.
Here is a simple way.
df1$ID <- as.integer(factor(df1$Group))
There are 3 solutions posted, mine, TarJae's and akrun's, they can be timed with increasing data sizes. akrun's is the fastest.
library(microbenchmark)
library(dplyr)
library(ggplot2)
funtest <- function(x, n){
out <- lapply(seq_len(n), function(i){
for(j in seq_len(i)) x <- rbind(x, x)
cat("nrow(x):", nrow(x), "\n")
mb <- microbenchmark(
match = with(x, match(Group, sort(unique(Group)))),
dplyr = x %>% group_by(Group) %>% mutate(ID = cur_group_id()),
intfac = as.integer(factor(x$Group))
)
mb$n <- i
mb
})
out <- do.call(rbind, out)
aggregate(time ~ ., out, median)
}
df1 %>%
funtest(10) %>%
ggplot(aes(n, time, colour = expr)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
scale_y_continuous(trans = "log10") +
theme_bw()
Update
group_indices() was deprecated in dplyr 1.0.0.
Please use cur_group_id() instead.
df1 <- df %>%
group_by(Group) %>%
mutate(ID = cur_group_id())
First answer:
You can use group_indices
library(dplyr)
df1 <- df %>%
group_by(Group) %>%
mutate(ID = group_indices())
data
df <- tribble(
~Row, ~Group,
1, "A",
2, "B",
3, "A",
4, "D",
5, "C",
6, "B",
7, "C",
8, "C",
9, "A",
10,"B")
Row Group ID
<int> <chr> <int>
1 1 A 1
2 2 B 2
3 3 A 1
4 4 D 4
5 5 C 3
6 6 B 2
7 7 C 3
8 8 C 3
9 9 A 1
10 10 B 2
We can use match on the sorted unique values of 'Group' on the 'Group' to get the position index
df1$ID <- with(df1, match(Group, sort(unique(Group))))
data
df1 <- structure(list(Row = 1:10, Group = c("A", "B", "A", "D", "C",
"B", "C", "C", "A", "B")), class = "data.frame", row.names = c(NA,
-10L))

How to count the frequency of unique factor across each row in r dataframe

I have a dataset like the following:
Age Monday Tuesday Wednesday
6-9 a b a
6-9 b b c
6-9 c a
9-10 c c b
9-10 c a b
Using R, I want to get the following data set/ results (where each column represents the total frequency of each of the unique factor):
Age a b c
6-9 2 1 0
6-9 0 2 1
6-9 1 0 1
9-10 0 1 2
9-10 1 1 1
Note: My data also contains missing values
couple of quick and dirty tidyverse solutions - there should be a way to reduce steps though.
library(tidyverse) # install.packages("tidyverse")
input <- tribble(
~Age, ~Monday, ~Tuesday, ~Wednesday,
"6-9", "a", "b", "a",
"6-9", "b", "b", "c",
"6-9", "", "c", "a",
"9-10", "c", "c", "b",
"9-10", "c", "a", "b"
)
# pivot solution
input %>%
rowid_to_column() %>%
mutate_all(function(x) na_if(x, "")) %>%
pivot_longer(cols = -c(rowid, Age), values_drop_na = TRUE) %>%
count(rowid, Age, value) %>%
pivot_wider(id_cols = c(rowid, Age), names_from = value, values_from = n, values_fill = list(n = 0)) %>%
select(-rowid)
# manual solution (if only a, b, c are expected as options)
input %>%
unite(col = "combine", Monday, Tuesday, Wednesday, sep = "") %>%
transmute(
Age,
a = str_count(combine, "a"),
b = str_count(combine, "b"),
c = str_count(combine, "c")
)
In base R, we can replace empty values with NA, get unique values in the dataframe, and use apply row-wise and count the occurrence of values using table.
df[df == ''] <- NA
vals <- unique(na.omit(unlist(df[-1])))
cbind(df[1], t(apply(df, 1, function(x) table(factor(x, levels = vals)))))
# Age a b c
#1 6-9 2 1 0
#2 6-9 0 2 1
#3 6-9 1 0 1
#4 9-10 0 1 2
#5 9-10 1 1 1

How to add new column to R dataframe based on values in multiple columns

I have created the following dataframe
df<-data.frame("A"<-(1:5), "B"<-c("A","B", "C", "B",'C' ), "C"<-c("A", "A",
"B", 'B', "B"))
names(df)<-c("A", "B", "C")
I am triyng to obtain the duplicated values between columns A and C following output and add the corresponding values in column B . The expected dataframe should be
df2<- "B" "Dupvalues"
1 A
4 B
I am unable to do this. I request some help here
df<-data.frame(A = (1:5),
B = c("A","B", "C", "B",'C' ),
C = c("A", "A","B", 'B', "B"), stringsAsFactors = F)
library(dplyr)
df %>%
filter(B == C) %>% # keep rows when B equals C
group_by(A) %>% # for each A
transmute(DupValues = B) %>% # keep the duplicate value
ungroup() # forget the grouping
# # A tibble: 2 x 2
# A DupValues
# <int> <chr>
# 1 1 A
# 2 4 B
Note that this works if your variables are not factors, but character varaibles.

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Resources