This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
I have a dataframe and I want to replicate the input of a single cell n times dependent on the input of the next cell and display it in a new cell.
My dataframe looks like this:
data <- data.frame(c(1,1,2,3,4,4,4), c("A","B","A","C","D","E","A"), c(2,1,1,3,2,1,3))
colnames(data) <- c("document number", "term", "count")
data
This is my desired result:
datanew <- data.frame(c(1,2,3,4), c("A A B", "A", "C C C", "D D E A A A"))
colnames(datanew) <- c("document number", "term")
# document number term
# 1 1 A A B
# 2 2 A
# 3 3 C C C
# 4 4 D D E A A A
So basically, I like to multiplicate the input of the term cell with the input of the corresponding count cell. Does anyone has an idea how to code it in R?
We can use rep to replicate term count times and paste the data together.
library(dplyr)
data %>%
group_by(`document number`) %>%
summarise(new = paste(rep(term, count), collapse = " "))
# A tibble: 4 x 2
# `document number` new
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Similarly with data.table
library(data.table)
setDT(data)[, (new = paste(rep(term, count), collapse = " ")),
by = `document number`]
We can do this with tidyverse methods
library(dplyr)
library(tidyr)
library(stringr)
data %>%
uncount(count) %>%
group_by(`document number`) %>%
summarise(term = str_c(term, collapse=' '))
# A tibble: 4 x 2
# `document number` term
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Or with data.table
library(data.table)
setDT(data)[rep(seq_len(.N), count)][, .(term =
paste(term, collapse=' ')), `document number`]
Or using base R with aggregate
aggregate(term ~ `document number`, data[rep(seq_len(nrow(data)),
data$count),], FUN = paste, collapse= ' ')
Related
I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2
I've been using tidyverse quite a lot and now I'm interested in the possibilities of base R.
Let's take a look at this simple data.frame
df <- data.frame(id = 1:4, nested = c("a, b, f", "c, d", "e", "e, f"))
Using dplyr, stringr and tidyr we could do
df %>%
mutate(nested = str_split(nested, ", ")) %>%
unnest(nested)
to get (let's ignore the tibble part)
# A tibble: 8 x 2
id nested
<int> <chr>
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
Now we want to rebuild this one using base R tools. So
transform(df, nested = strsplit(nested, ", "))
gives use the mutate-part, but how can we unnest() this data.frame? I though of using unlist() but couldn't find a satisfying way.
We could use stack on a named list in a single line
with(df, setNames(stack(setNames(strsplit(nested, ","), id))[2:1], names(df)))
-output
id nested
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
If we use transform, then use rep to replicate based on the lengths of the list column
out <- transform(df, nested = strsplit(nested, ", "))
data.frame(id = rep(out$id, lengths(out$nested)), nested = unlist(out$nested))
I want to aggregate a data.frame with two columns: in one column I have "num", which is an identifier number and in the other I have text. It is important that the aggregated text has a space between the individual parts. My code is this:
data_aggr <- aggregate(
x = data_aggr,
FUN = paste,
by = list(data_aggr$num)
)
I have tried the obvious with FUN = paste(collapse = " ") and
FUN = paste,
collapse = " ",
but that doesn't work. How do I need to do this?
Aggregate can be used to paste together the rows with the same value of num as follows:
data_aggr <- data.frame(num=c(1,1,1,2,2), letters=letters[1:5])
aggregate(data_aggr$letters, list(data_aggr$num), FUN=paste, collapse= " ")
# Group.1 x
# 1 1 a b c
# 2 2 d e
A dplyr solution, the idea is to create a new column with row number to be able to conduct the operation on each row.
> library(dplyr)
> df.ask <- data.frame('Num' = 1:10,
+ 'Text' = letters[1:10])
>
> df.ask %>%
+ mutate(row_num = row_number()) %>%
+ group_by(row_num) %>%
+ mutate(together = paste(Num, Text, collapse = ' ')) %>%
+ ungroup() %>%
+ select(-row_num)
# A tibble: 10 x 3
Num Text together
<int> <fct> <chr>
1 1 a 1 a
2 2 b 2 b
3 3 c 3 c
4 4 d 4 d
5 5 e 5 e
6 6 f 6 f
7 7 g 7 g
8 8 h 8 h
9 9 i 9 i
10 10 j 10 j
I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).
One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4
An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4
I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)