Reverse of summarise() function in dplyr [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 6 years ago.
Let's consider the following data
data <- data.frame(V1 = c("A","A","A","B","B","C","C"), V2 = c("B","B","B","C","C","D","D"))
> data
V1 V2
1 A B
2 A B
3 A B
4 B C
5 B C
6 C D
7 C D
Now we aggregate data by both columns and obtain
library(dplyr)
group_by(data, V1, V2) %>% summarise(n())
V1 V2 n()
(fctr) (fctr) (int)
1 A B 3
2 B C 2
3 C D 2
Now we want to turn this data back into original data. Is there any function for this procedure?

We can use base R to do this
data1 <- as.data.frame(data1)
data1[rep(1:nrow(data1), data1[,3]),-3]
This is one of the cases where I would opt for base R. Having said that, there are package solutions for this type of problem, i.e. expandRows (a wrapper for the above) from splitstackshape
library(splitstackshape)
data %>%
group_by(V1, V2) %>%
summarise(n=n()) %>%
expandRows(., "n")
Or if we want to stick to a similar option as in base R within %>%
data %>%
group_by(V1, V2) %>%
summarise(n=n()) %>%
do(data.frame(.[rep(1:nrow(.), .$n),-3]))
# V1 V2
# (fctr) (fctr)
#1 A B
#2 A B
#3 A B
#4 B C
#5 B C
#6 C D
#7 C D
data
data1 <- group_by(data, V1, V2) %>% summarise(n())

Related

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

Computing ratio between elements with dplyr in R data frame?

Suppose we have a data frame (df) like this:
a b
1 2
2 4
3 6
If I want to compute the ratio of each element in vectors a and b and assign to variable c, we'd do this:
c <- df$a / df$b
However, I was wondering how the same thing could be done using the dplyr package? I.e. are there any ways that this can be achieved using functions from dplyr?
Maybe you can try the code below
df %>%
mutate(c = do.call("/", .))
or
df %>%
mutate(c = Reduce("/", .))
or
df %>%
mutate(c = a/b)
An option with invoke
library(dplyr)
library(purrr)
df %>%
mutate(c = invoke('/', .))
-output
# a b c
#1 1 2 0.5
#2 2 4 0.5
#3 3 6 0.5
data
df <- data.frame(a = c(1,2,3), b= c(2,4,6))
You can use mutate function from dplyr library:
df <- data.frame(a = c(1,2,3), b= c(2,4,6))
library(dplyr)
df <- df %>%
dplyr::mutate(c = a/b)
Console output:
a b c
1 1 2 0.5
2 2 4 0.5
3 3 6 0.5

Creating Nodes and Edges Dataframes from Tidy Dataframes

I have a data frame that's of this structure:
df <- data.frame(var1 = c(1,1,1,2,2,3,3,3,3),
cat1 = c("A","B","D","B","C","D","E","B","A"))`
> df
var1 cat1
1 1 A
2 1 B
3 1 D
4 2 B
5 2 C
6 3 D
7 3 E
8 3 B
9 3 A
And I am looking to create both nodes and edges data frames from it, so that I can draw a network graph, using VisNetwork. This network will show the number/strength of connections between the different cat1 values, as grouped by the var1 value.
I have the nodes data frame sorted:
nodes <- data.frame(id = unique(df$cat1))
> nodes
id
1 A
2 B
3 D
4 C
5 E
What I'd like help with is how to process df in the following manner:
for each distinct value of var1 in df, tally up the group of nodes that are common to that value of var1 to give an edges dataframe that ultimately looks like the one below. Note that I'm not bothered about the direction of flow along the edges. Just that they are connected is all I need.
> edges
from to value
1 A B 2
2 A D 2
3 A E 1
4 B C 1
5 B D 2
6 B E 1
7 D E 1
With thanks in anticipation,
Nevil
Update: I found here a similar problem, and have adapted that code to give, which is getting close to what I want, but not quite there...
> df %>% group_by(var1) %>%
filter(n()>=2) %>% group_by(var1) %>%
do(data.frame(t(combn(.$cat1, 2,function(x) sort(x))),
stringsAsFactors=FALSE))
# A tibble: 10 x 3
# Groups: var1 [3]
var1 X1 X2
<dbl> <chr> <chr>
1 1. A B
2 1. A D
3 1. B D
4 2. B C
5 3. D E
6 3. B D
7 3. A D
8 3. B E
9 3. A E
10 3. A B
I don't know if there is already a suitable function to achieve this task. Here is a detailed procedure to do it. Whith this, you should be able to define you own function. Hope it helps!
# create an adjacency matrix
mat <- table(df)
mat <- t(mat) %*% mat
as.table(mat) # look at your adjacency matrix
# since the network is not directed, we can consider only the (strictly) upper triangular matrix
mat[lower.tri(mat, diag = TRUE)] <- 0
as.table(mat) # look at the new adjacency matrix
library(dplyr)
edges <- as.data.frame(as.table(mat))
edges <- filter(edges, Freq != 0)
colnames(edges) <- c("from", "to", "value")
edges <- arrange(edges, from)
edges # output
# from to value
#1 A B 2
#2 A D 2
#3 A E 1
#4 B C 1
#5 B D 2
#6 B E 1
#7 D E 1
here's a couple other ways...
in base R...
values <- unique(df$var1[duplicated(df$var1)])
do.call(rbind,
lapply(values, function(i) {
nodes <- as.character(df$cat1[df$var1 == i])
edges <- combn(nodes, 2)
data.frame(from = edges[1, ],
to = edges[2, ],
value = i,
stringsAsFactors = F)
})
)
in tidyverse...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
filter(n() >= 2) %>%
mutate(cat1 = as.character(cat1)) %>%
summarise(edges = list(data.frame(t(combn(cat1, 2)), stringsAsFactors = F))) %>%
unnest(edges) %>%
select(from = X1, to = X2, value = var1)
in tidyverse using tidyr::complete...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
mutate(i.cat1 = cat1) %>%
complete(cat1, i.cat1) %>%
filter(cat1 < i.cat1) %>%
select(from = cat1, to = i.cat1, value = var1)
in tidyverse using tidyr::expand...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
expand(cat1, to = cat1) %>%
filter(cat1 < to) %>%
select(from = cat1, to, value = var1)

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).
One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4
An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

Pivoting data in R with duplicate rows

Trying to do a simple pivot in R, much like you would in SQL.
I understand this question has been asked however I am having trouble with duplicate rows.
Pivoting data in R
Currently the data is in this format (characters are just placeholders for ease of viewing. The actual data is numerical):
V1 V2 V3 V4
A B C Sales
D E F Sales
G H I Technical
J K L Technical
And it needs to be transformed into this format:
Variable Sales Technical
V1 A G
V1 D J
V2 B H
V2 E K
V3 C I
V3 F L
I've tried both reshape and tidyr packages and they either aggregate the data in the case of reshape or throw errors for duplicate row identifiers in the case of tidyr.
I don't care about duplicate row identifiers, infact it's necessary to identify them as factors for analysis.
Am I going about this the wrong way? Are these the correct packages to be using or can anyone suggest another method?
I hope this will work:
df %>% gather(Variable, Value, V1:V3) %>%
group_by(V4, Variable) %>%
mutate(g = row_number()) %>%
spread(V4, Value) %>% ungroup() %>%
select(-g)
# # A tibble: 6 x 3
# Variable Sales Technical
# * <chr> <chr> <chr>
# 1 V1 A G
# 2 V1 D J
# 3 V2 B H
# 4 V2 E K
# 5 V3 C I
# 6 V3 F L
Another option is melt/dcast from data.table
library(data.table)
dcast(melt(setDT(df1), id.var = 'V4'), variable + rowid(V4) ~
V4, value.var = 'value')[, V4 := NULL][]
# variable Sales Technical
#1: V1 A G
#2: V1 D J
#3: V2 B H
#4: V2 E K
#5: V3 C I
#6: V3 F L

Resources