Is there a base R version of tidyr's unnest() function? - r

I've been using tidyverse quite a lot and now I'm interested in the possibilities of base R.
Let's take a look at this simple data.frame
df <- data.frame(id = 1:4, nested = c("a, b, f", "c, d", "e", "e, f"))
Using dplyr, stringr and tidyr we could do
df %>%
mutate(nested = str_split(nested, ", ")) %>%
unnest(nested)
to get (let's ignore the tibble part)
# A tibble: 8 x 2
id nested
<int> <chr>
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
Now we want to rebuild this one using base R tools. So
transform(df, nested = strsplit(nested, ", "))
gives use the mutate-part, but how can we unnest() this data.frame? I though of using unlist() but couldn't find a satisfying way.

We could use stack on a named list in a single line
with(df, setNames(stack(setNames(strsplit(nested, ","), id))[2:1], names(df)))
-output
id nested
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
If we use transform, then use rep to replicate based on the lengths of the list column
out <- transform(df, nested = strsplit(nested, ", "))
data.frame(id = rep(out$id, lengths(out$nested)), nested = unlist(out$nested))

Related

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

How to replicate a String in a dataframe individually N times [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
I have a dataframe and I want to replicate the input of a single cell n times dependent on the input of the next cell and display it in a new cell.
My dataframe looks like this:
data <- data.frame(c(1,1,2,3,4,4,4), c("A","B","A","C","D","E","A"), c(2,1,1,3,2,1,3))
colnames(data) <- c("document number", "term", "count")
data
This is my desired result:
datanew <- data.frame(c(1,2,3,4), c("A A B", "A", "C C C", "D D E A A A"))
colnames(datanew) <- c("document number", "term")
# document number term
# 1 1 A A B
# 2 2 A
# 3 3 C C C
# 4 4 D D E A A A
So basically, I like to multiplicate the input of the term cell with the input of the corresponding count cell. Does anyone has an idea how to code it in R?
We can use rep to replicate term count times and paste the data together.
library(dplyr)
data %>%
group_by(`document number`) %>%
summarise(new = paste(rep(term, count), collapse = " "))
# A tibble: 4 x 2
# `document number` new
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Similarly with data.table
library(data.table)
setDT(data)[, (new = paste(rep(term, count), collapse = " ")),
by = `document number`]
We can do this with tidyverse methods
library(dplyr)
library(tidyr)
library(stringr)
data %>%
uncount(count) %>%
group_by(`document number`) %>%
summarise(term = str_c(term, collapse=' '))
# A tibble: 4 x 2
# `document number` term
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Or with data.table
library(data.table)
setDT(data)[rep(seq_len(.N), count)][, .(term =
paste(term, collapse=' ')), `document number`]
Or using base R with aggregate
aggregate(term ~ `document number`, data[rep(seq_len(nrow(data)),
data$count),], FUN = paste, collapse= ' ')

revert multiple (multiple columns to one column )

I have a dataset with answers from a survey of 17 questions (10 questions are 5 or 7 questions are 7 point scale), and now the data format gives me 5 or 7 columns for each question answer (True or False), which is like a one-hot encoding style. And I want to convert these columns back to 15 single column.
To be more specific, the data I have looks like the following
Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 .... Q17.1 Q17.2 ... Q17.5
row1 T F F F F F F F T F
... ...
row2000 F T F F F F F T F F
the desired format I want to have is
Q1 Q2 .... Q17
row1 1 4 2 # with number indicating the value that the column is True
....
row2000 2 3 1 #(e.g., if Q2.4 is T, then for Q2, it is 4).
Base R approach using split.default and max.col. Using split.default we can split the columns based on the pattern in their name, so that every question is divided into a list. Assuming every question would have only one TRUE value we can use max.col to find the TRUE index.
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q2
#[1,] 1 2
#[2,] 6 5
data
df <-read.table(text = "Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q2.1 Q2.2 Q2.3 Q2.4 Q2.5
T F F F F F F F T F F F
F F F F F T F F F F F T", header = T)
This is assuming class of your data is "logical". If "T"/"F" is stored in character format (like in #Maurits answer) we need to convert them to logical first.
Using data from #Maurits Evers
df[] <- lapply(df, as.logical)
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q17
#[1,] 1 2
#[2,] 2 1
Here is a tidyverse option:
library(tidyverse)
df %>%
rownames_to_column("row") %>%
gather(k, v, -row) %>%
separate(k, c("question", "part"), sep = "\\.") %>%
filter(v == "T") %>%
group_by(row) %>%
select(-v) %>%
spread(question, part)
## A tibble: 2 x 3
## Groups: row [2]
# row Q1 Q17
# <chr> <chr> <chr>
#1 row1 1 2
#2 row2000 2 1
I assume that your original data contains "T"/"F" as character entries. If they are in fact TRUE/FALSE, you should change filter(v == "T") to filter(v == TRUE).
Sample data
df <- read.table(text =
"Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q17.1 Q17.2 Q17.5
row1 T F F F F F F F T F
row2000 F T F F F F F T F F", colClasses = "character")

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).
One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4
An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

Targetting a specific column within a dplyr pipe

I very much enjoy working with the magrittr pipes in R %>% and try to use them as often / efficiently as possible. I quite often need to target specific columns in a pipe chain, for example to change the column type. This results in me having to break the chain / my workflow because I need to target only a specific column instead of my entire dataframe.
Consider the following example:
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b) %>%
recode(x = df$info, "a" = "x") #I'd like to target only the df$info column here
This obviously doesn't work, because dplyr doesn't expect me to change the x = argument for a function in a pipe chain.
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b)
df$info <- df$info %>% #this works as expected, but is not as elegant
recode("a" = "x")
This is how I think it should be done, but I feel that it is not as efficient / elegant as I would like it to be, especially if I plan on chaining more functions together after recoding.
Is there a convenient way around this, so I can tell a command in my pipe chain to target only a specific column?
We need to place it inside mutate
data_frame(a, b) %>%
rename(info = b) %>%
mutate(info = recode(info, a = "x"))
# A tibble: 20 x 2
# a info
# <int> <chr>
# 1 1 x
# 2 2 b
# 3 3 x
# 4 4 b
# 5 5 x
# 6 6 b
# 7 7 x
# 8 8 b
# 9 9 x
#10 10 b
#11 11 x
#12 12 b
#13 13 x
#14 14 b
#15 15 x
#16 16 b
#17 17 x
#18 18 b
#19 19 x
#20 20 b

Resources