numbering characters in a string - r

I want to number the letters in a large dataset. Some letters occur multiple times and are numbered ("A1", "A2"), others also occur multiple times but are not numbered. There are also letters that occur only once... but maybe it's easier to look at the example data below.
The numbers in df$nr are the desired result. How can I get df$nr from df$word and df$letter ?
df <-tibble(word=c(rep("Amamam", 17), rep("Bobob", 14)),
letter=c("A1", "A1", "A1", "A1", "A2", "A2", "m", "m", "m", "a", "a", "m", "m", "a", "a", "m", "m",
"B1", "B1", "B2", "B2", "B3", "B3", "o", "b", "b", "b", "o", "o", "o", "b"),
nr=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5) )

We can group by 'word', remove the numeric part from the 'letter' column, convert to run-length-id (rleid from data.table)
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
group_by(word) %>%
mutate(nr1 = rleid(str_remove(letter, "\\d+")))
all.equal(df1$nr, df1$nr1)
#[1] TRUE

Related

Separate entries in dataframe in new rows in R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 28 days ago.
I have data.frame df below.
df <- data.frame(id = c(1:12),
A = c("alpha", "alpha", "beta", "beta", "gamma", "gamma", "gamma", "delta",
"epsilon", "epsilon", "zeta", "eta"),
B = c("a", "a; b", "a", "c; d; e", "e", "e", "c; f", "g", "a", "g; h", "f", "d"),
C = c(NA, 4, 2, 7, 4, NA, 9, 1, 1, NA, 3, NA),
D = c("ii", "ii", "i", "iii", "iv", "v", "viii", "v", "viii", "i", "iii", "i"))
Column 'B' contains four entries with semicolons. How can I copy each of these rows and enter in column 'B' each of the separate values?
The expected result df2 is:
df2 <- data.frame(id = c(1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10, 11, 12),
A = c(rep("alpha", 3), rep("beta", 4), rep("gamma", 4), "delta", rep("epsilon", 3),
"zeta", "eta"),
B = c("a", "a", "b", "a", "c", "d", "e", "e", "e", "c", "f", "g", "a", "g", "h", "f", "d"),
C = c(NA, 4, 4, 2, 7, 7, 7, 4, NA, 9, 9, 1, 1, NA, NA, 3, NA),
D = c("ii", "ii", "ii", "i", "iii", "iii", "iii", "iv", "v", "viii", "viii", "v", "viii", "i", "i", "iii", "i"))
I tried this, but no luck:
df2 <- df
# split the values in column B
df2$B <- unlist(strsplit(as.character(df2$B), "; "))
# repeat the rows for each value in column B
df2 <- df2[rep(seq_len(nrow(df2)), sapply(strsplit(as.character(df1$B), "; "), length)),]
# match the number of rows in column B with the number of rows in df2
df2$id <- rep(df2$id, sapply(strsplit(as.character(df1$B), "; "), length))
# sort the dataframe by id
df2 <- df2[order(df2$id),]
We may use separate_rows here - specify the sep as ; followed by zero or more spaces (\\s*) to expand the rows
library(tidyr)
df_new <- separate_rows(df, B, sep = ";\\s*")
-checking with OP's expected
> all.equal(df_new, df2, check.attributes = FALSE)
[1] TRUE
In the base R, we may replicate the sequence of rows by the lengths of the list output
lst1 <- strsplit(df$B, ";\\s+")
df_new2 <- transform(df[rep(seq_len(nrow(df)), lengths(lst1)),], B = unlist(lst1))
row.names(df_new2) <- NULL

Creating boxplot based on three variables

How is it possible to create a box plot like this
data.frame(category = c(2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 1, 3), educational_years = c(6, 15, 16, 6, 6, 6, 12, 12, 12, 4, 12, 15, 6, 4), gender = c("M", "F", "F", "F", "M", "M", "F", "M", "F", "F", "M", "F", "M", "F"))
from a data frame like this and x axis have the category y axis age and factor gender
You don't have enough data to create a boxplot like the one shown. For example, you only have a single data point for category 1, so you will only get a single horizontal line here. You only have "M" values for category 2, so you will only get a single box here. You have only a single value for "M" in category three, so you will get a horizontal line instead of a box.
Assuming this is only a sample of your data, rather than the full data set, the code would look like this:
library(ggplot2)
ggplot(df, aes(factor(category), educational_years, fill = gender)) +
geom_boxplot()
At the moment, the result obtained looks like this:

Tidyverse: group_by, arrange, and lag across columns

I am working on a projection model for sports where I need to understand in a certain team's most recent game:
Who is their next opponent? (solved)
When is the last time their next opponent played?
reprex that can be used below. Using row 1 as an example, I would need to understand that "a"'s next opponent "e"'s most recent game was game_id_ 3.
game_id_ <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
game_date_ <- c(rep("2021-01-29", 6), rep("2021-01-30", 6))
team_ <- c("a", "b", "c", "d", "e", "f", "b", "c", "d", "f", "e", "a")
opp_ <- c("b", "a", "d", "c", "f", "e", "c", "b", "f", "d", "a", "e")
df <- data.frame(game_id_, game_date_, team_, opp_)
#Next opponent
df <- df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L))
If I can provide more details, please let me know.
We can use match to return the corresponding game_id_
library(dplyr)
df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L)) %>%
ungroup %>%
mutate(last_time = game_id_[match(next_opp, opp_)])

Find the overlap of two datasets

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?
You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")
Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

Removing "unused" nodes in sankey network

I am trying to build a sankey network.
This is my data and code:
library(networkD3)
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "D", "E", "N", "O", "P", "Q", "R"))
names(nodes) <- "name"
nodes$name = as.character(nodes$name)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833,
1, 16, 9.6,
1, 17, 30.5485), nrow=16, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
links$source = as.integer(links$source)
links$target = as.integer(links$target)
links$value = as.numeric(links$value)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)
The problem is that A and B only have common links to D and E.
Although the links are correctly displayed, D and E are also shown at the right-bottom.
How can I avoid this ?
Note: If I specify
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"))
no network at all is created.
Nodes must be unique, see below example. I removed repeated nodes: "D" and "E", then in links, I removed links that reference to nodes that do not exist. We have only 16 nodes, zero based 0:15. And in your links dataframe, you have last 2 rows referencing to 16 and 17.
Or as #CJYetman (networkD3 author) comments:
Another way to say it... every node that is in the nodes data frame will be plotted, even if it has the same name as another node, because the index is technically the unique id.
library(networkD3)
nodes <- data.frame(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"),
ix = 0:15)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833), nrow=14, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)

Resources