Separate entries in dataframe in new rows in R [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 28 days ago.
I have data.frame df below.
df <- data.frame(id = c(1:12),
A = c("alpha", "alpha", "beta", "beta", "gamma", "gamma", "gamma", "delta",
"epsilon", "epsilon", "zeta", "eta"),
B = c("a", "a; b", "a", "c; d; e", "e", "e", "c; f", "g", "a", "g; h", "f", "d"),
C = c(NA, 4, 2, 7, 4, NA, 9, 1, 1, NA, 3, NA),
D = c("ii", "ii", "i", "iii", "iv", "v", "viii", "v", "viii", "i", "iii", "i"))
Column 'B' contains four entries with semicolons. How can I copy each of these rows and enter in column 'B' each of the separate values?
The expected result df2 is:
df2 <- data.frame(id = c(1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10, 11, 12),
A = c(rep("alpha", 3), rep("beta", 4), rep("gamma", 4), "delta", rep("epsilon", 3),
"zeta", "eta"),
B = c("a", "a", "b", "a", "c", "d", "e", "e", "e", "c", "f", "g", "a", "g", "h", "f", "d"),
C = c(NA, 4, 4, 2, 7, 7, 7, 4, NA, 9, 9, 1, 1, NA, NA, 3, NA),
D = c("ii", "ii", "ii", "i", "iii", "iii", "iii", "iv", "v", "viii", "viii", "v", "viii", "i", "i", "iii", "i"))
I tried this, but no luck:
df2 <- df
# split the values in column B
df2$B <- unlist(strsplit(as.character(df2$B), "; "))
# repeat the rows for each value in column B
df2 <- df2[rep(seq_len(nrow(df2)), sapply(strsplit(as.character(df1$B), "; "), length)),]
# match the number of rows in column B with the number of rows in df2
df2$id <- rep(df2$id, sapply(strsplit(as.character(df1$B), "; "), length))
# sort the dataframe by id
df2 <- df2[order(df2$id),]

We may use separate_rows here - specify the sep as ; followed by zero or more spaces (\\s*) to expand the rows
library(tidyr)
df_new <- separate_rows(df, B, sep = ";\\s*")
-checking with OP's expected
> all.equal(df_new, df2, check.attributes = FALSE)
[1] TRUE
In the base R, we may replicate the sequence of rows by the lengths of the list output
lst1 <- strsplit(df$B, ";\\s+")
df_new2 <- transform(df[rep(seq_len(nrow(df)), lengths(lst1)),], B = unlist(lst1))
row.names(df_new2) <- NULL

Related

Creating a vector of random strings with conditions in R

The problem is :
Create a vector of 50 random names, each made of 5 letters,out of which only the first is capital. The first, third and fifthletters are consonants and the second and fourth are vowels.
How can i do this?
Use letters and LETTERS which are built in constants in R, then define vowels as c(1, 5, 9, 15, 21), which is the subset to select or remove. Use sample to get 50 of each with replacement, and paste them together:
set.seed(69)
vowels <- c(1, 5, 9, 15, 21)
paste0( sample(LETTERS[-vowels], 50, TRUE),
sample(letters[vowels], 50, TRUE),
sample(letters[-vowels], 50, TRUE),
sample(letters[vowels], 50, TRUE),
sample(letters[-vowels], 50, TRUE))
#> [1] "Valif" "Cirer" "Tuniw" "Kimil" "Qehoc" "Jemif" "Senoy" "Jazic" "Hihuy"
#> [10] "Cezor" "Fuzic" "Menas" "Covay" "Rupov" "Xanij" "Pujur" "Qimin" "Dunop"
#> [19] "Xokez" "Zacox" "Muhac" "Yitab" "Gojob" "Dedah" "Nepan" "Dinel" "Ceyaw"
#> [28] "Foxiv" "Fiven" "Zotob" "Bezug" "Pusod" "Jawad" "Suluq" "Zubic" "Minax"
#> [37] "Gowex" "Debec" "Xaqut" "Duvov" "Lalal" "Zavuv" "Xobuk" "Zugil" "Gibac"
#> [46] "Yocan" "Voyuh" "Nigeh" "Yuqew" "Humup"
Created on 2020-04-05 by the reprex package (v0.3.0)
Just using a for loop to sample the built in vectors LETTERS and letters with conditional subsetting to subset to vowels and consonants
namesList <- list()
for(i in 1:50){
namesList[[i]] <- paste(c(sample(LETTERS[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 1),
sample(letters[(LETTERS %in% c("A", "E", "I", "O", "U"))], 1),
sample(letters[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 1),
sample(letters[(LETTERS %in% c("A", "E", "I", "O", "U"))], 1),
sample(letters[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 1)), collapse = "")
}
namesVec <- unlist(namesList)
Or in a vectorised fashion (which is better)
paste(
sample(LETTERS[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 50, replace = TRUE),
sample(letters[(LETTERS %in% c("A", "E", "I", "O", "U"))], 50, replace = TRUE),
sample(letters[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 50, replace = TRUE),
sample(letters[(LETTERS %in% c("A", "E", "I", "O", "U"))], 50, replace = TRUE),
sample(letters[!(LETTERS %in% c("A", "E", "I", "O", "U"))], 50, replace = TRUE),
sep = "")

Get the row name used to annotation

First of all, my question is related to these other ones:
Lazy evaluation to annotations expanding function
R nested map through columns
So, I got this example data:
t <- tibble(a = c("a", "b", "c", "d", "e", "f", "g", "h"),
b = c( 1, 1, 1, 1, 2, 2, 2, 2),
c = c( 1, 1, 2, 2, 3, 3, 4, 4),
d = c( NA, NA, NA, "D", "E", NA, NA, NA),
e = c("A", NA, "C", NA, NA, NA, "G", "H")
)
And this functions
f1 <- function(data, group_col, expand_col){ #, return_group_col = TRUE, name_group_col = "group_col"){
data %>%
dplyr::group_by({{group_col}}) %>%
dplyr::mutate(
{{expand_col}} := dplyr::case_when(
!is.na({{expand_col}}) ~ {{expand_col}} ,
any( !is.na({{expand_col}}) ) & is.na({{expand_col}}) ~
paste(unique(unlist(str_split(na.omit({{expand_col}}), " ")) ),
collapse = " "),
TRUE ~ NA_character_
)
) %>%
dplyr::ungroup()
}
f2 <- function(data, group_col, expand_col, fun=f1){
v1 <- rlang::syms( colnames(data)[group_col] )
v2 <- rlang::syms( colnames(data)[expand_col] )
V <- tidyr::crossing( v1, v2 )
purrr::reduce2( V$v1, V$v2, fun, .init=data )
}
The function f1 use two columns, the first one {{group_col}} is a group identifier the second one {{expand_col}} may contain an annotation or NA. After a group_by by the {{group_col}} the {{expand_col}} is filled with the data from the other rows from the same group if it is NA. Example: f1(t, c, d).
The function f2 just propagates the function f1 using two sets of columns, the first set refers to grouping columns and the second set refers to annotation columns.
Then, I want to modify the function f1 to create (if needed) another column which will contain the information about the which {{group_col}} the {{expand_col}} was felled.
That means, imagine you run: t %>% f2(3:2, 4:5) you will get this:
structure(list(a = c("a", "b", "c", "d", "e", "f", "g", "h"),
b = c(1, 1, 1, 1, 2, 2, 2, 2), c = c(1, 1, 2, 2, 3, 3, 4,
4), d = c("D", "D", "D", "D", "E", "E", "E", "E"), e = c("A",
"A", "C", "C", "G H", "G H", "G", "H")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
Which is the same to run:
t %>%
f1(c, d)# %>%
f1(b, d) %>%
f1(c, e) %>%
f1(b, e)
You may notice that some rows were annotated previously. These rows should be filled with 'self' or something equivalent.
Here the example of the output I want:
structure(list(a = c("a", "b", "c", "d", "e", "f", "g", "h"),
b = c(1, 1, 1, 1, 2, 2, 2, 2),
c = c(1, 1, 2, 2, 3, 3, 4, 4),
d = c("D", "D", "D", "D", "E", "E", "E", "E"),
e = c("A", "A", "C", "C", "G H", "G H", "G", "H"),
d_fill = c("b", "b", "c", "self", "self", "c", "b", "b"),
e_fill = c("self", "c", "self", "c", "b", "b", "self", "self")
),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L))
Then I tried this unsuccessful modification:
f1 <- function(data, group_col, expand_col){ #, return_group_col = TRUE, name_group_col = "group_col"){
fill_column <- str_c(deparse(substitute(group_col)), "fill", sep = "_")
data %>%
dplyr::group_by({{group_col}}) %>%
dplyr::mutate(
{{fill_column}} := dplyr::if_else(
!is.na({{expand_col}}) & is.na({{fill_column}}) ~ "self",
is.na({{expand_col}}) & is.na({{fill_column}}) ~ deparse(substitute(group_col)),
TRUE ~ NA_character_
),
{{expand_col}} := dplyr::case_when(
!is.na({{expand_col}}) ~ {{expand_col}} ,
any( !is.na({{expand_col}}) ) & is.na({{expand_col}}) ~
paste(unique(unlist(str_split(na.omit({{expand_col}}), " ")) ),
collapse = " "),
TRUE ~ NA_character_
)
) %>%
dplyr::ungroup()
}
But when I run t %>% f1(c, d) to test it, I got this:
Error: `condition` must be a logical vector, not a `formula` object
Run `rlang::last_error()` to see where the error occurred.
25.
stop(fallback)
24.
signal_abort(cnd)
23.
.abort(text)
22.
glubort(fmt_args(args), ..., .envir = .envir)
21.
bad_args("condition", "must be a logical vector, not {friendly_type_of(condition)}")
20.
dplyr::if_else(!is.na(~d) & is.na(~"c_fill") ~ "self", is.na(~d) &
is.na(~"c_fill") ~ deparse(substitute(group_col)), TRUE ~
NA_character_)
19.
mutate_impl(.data, dots, caller_env())
18.
mutate.tbl_df(., `:=`({
{
fill_column
} ...
17.
dplyr::mutate(., `:=`({
{
fill_column
} ...
16.
function_list[[i]](value)
15.
freduce(value, `_function_list`)
14.
`_fseq`(`_lhs`)
13.
eval(quote(`_fseq`(`_lhs`)), env, env)
12.
eval(quote(`_fseq`(`_lhs`)), env, env)
11.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
10.
data %>% dplyr::group_by({
{
group_col
} ...
9.
f1(., c, d)
8.
function_list[[k]](value)
7.
withVisible(function_list[[k]](value))
6.
freduce(value, `_function_list`)
5.
`_fseq`(`_lhs`)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
eval(quote(`_fseq`(`_lhs`)), env, env)
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1.
t %>% f1(c, d)
I didn't figure out what is wrong.
Thanks in advance.

Find the overlap of two datasets

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?
You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")
Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

Most elegant way to convert lists into igraph object for plotting

I am new to igraph and it seems to be a very powerful (and therefore also complex) package.
I tried to convert the following lists into an igraph object.
graph <- list(s = c("a", "b"),
a = c("s", "b", "c", "d"),
b = c("s", "a", "c", "d"),
c = c("a", "b", "d", "e", "f"),
d = c("a", "b", "c", "e", "f"),
e = c("c", "d", "f", "z"),
f = c("c", "d", "e", "z"),
z = c("e", "f"))
weights <- list(s = c(3, 5),
a = c(3, 1, 10, 11),
b = c(5, 3, 2, 3),
c = c(10, 2, 3, 7, 12),
d = c(15, 7, 2, 11, 2),
e = c(7, 11, 3, 2),
f = c(12, 2, 3, 2),
z = c(2, 2))
Interpretation is as follows: s is the starting node, it links to nodes a and b. The edges are weighted 3 for s to a and 5 for s to b and so on.
I tried all kinds of functions from igraph but only got all kinds of errors. What is the most elegant and easy way to convert the above into an igraph object for plotting the graph?
Create an edgelist and then a graph from that. Assign the weights and plot it.
set.seed(123)
e <- as.matrix(stack(graph))
g <- graph_from_edgelist(e)
E(g)$weight <- stack(weights)[[1]]
plot(g, edge.label = E(g)$weight)

Removing "unused" nodes in sankey network

I am trying to build a sankey network.
This is my data and code:
library(networkD3)
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "D", "E", "N", "O", "P", "Q", "R"))
names(nodes) <- "name"
nodes$name = as.character(nodes$name)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833,
1, 16, 9.6,
1, 17, 30.5485), nrow=16, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
links$source = as.integer(links$source)
links$target = as.integer(links$target)
links$value = as.numeric(links$value)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)
The problem is that A and B only have common links to D and E.
Although the links are correctly displayed, D and E are also shown at the right-bottom.
How can I avoid this ?
Note: If I specify
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"))
no network at all is created.
Nodes must be unique, see below example. I removed repeated nodes: "D" and "E", then in links, I removed links that reference to nodes that do not exist. We have only 16 nodes, zero based 0:15. And in your links dataframe, you have last 2 rows referencing to 16 and 17.
Or as #CJYetman (networkD3 author) comments:
Another way to say it... every node that is in the nodes data frame will be plotted, even if it has the same name as another node, because the index is technically the unique id.
library(networkD3)
nodes <- data.frame(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"),
ix = 0:15)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833), nrow=14, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)

Resources