Calculate combinations of several categorical variables - r

I have a data frame with mainly categorical variables. I want to see the number of combinations of variables found in three of these columns with categorical variables.
The data in the columns looks like this:
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)
I was reading in another post to use the following code when comparing two columns:
library(dplyr)
library(stringr)
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements), ' - ',
pmax(number_arms, arrangements)), name = "count")
This is the result:
combination count
12 - single 1
16 - single 1
4 - paired 3
4 - single 4
5 - paired 4
5 - single 2
6 - ornament 1
6 - paired 81
However, the code does not give me the wanted results if I add the third column, like this:
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements, approx_position), ' - ',
pmax(number_arms, arrangements, approx_position)), name = "count")
It still runs the code without error but I get wrong results.
Do I need a different code to calculate the combinations of three variables?

If you're looking for the count of each combination of the variables, excluding 0, you can do:
subset(data.frame(table(rg2)), Freq > 0)
number_arms arrangements approx_position Freq
1 12 ornament bottom 1
15 8 paired middle 1
26 6 single top 1
or combined:
subset(data.frame(table(rg2)), Freq > 0) |>
tidyr::unite("combn", -Freq, sep = " - ")
combn Freq
1 12 - ornament - bottom 1
15 8 - paired - middle 1
26 6 - single - top 1
data
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)

Tidyverse option (updated to remove group_by):
library(dplyr)
rg2 %>%
count(number_arms, arrangements, approx_position)
Result:
number_arms arrangements approx_position n
<chr> <chr> <chr> <int>
1 12 ornament bottom 1
2 6 single top 1
3 8 paired middle 1

You can try dplyr::count() + paste():
library(dplyr)
rg2 %>%
count(combination = paste(number_arms, arrangements, approx_position, sep = " - "), name = "count")
# combination count
# 1 12 - ornament - bottom 1
# 2 6 - single - top 1
# 3 8 - paired - middle 1

Related

R - ggplot showing distribution of binary flag variable (0/1) over time as normalized bar chart (%)

I have a data set looking sth like this ....
Date Remaining Volume ID
1990-01-01 0 1000 1
1990-01-01 1 2000 2
1990-01-01 1 5000 3
1990-02-01 0 200 4
1990-03-01 1 4000 5
1990-03-01 0 3000 6
I filter the data according to a series of conditional statements and assign the binary flag variable to the data.table. A value of 0 means that the particular row entry doesn't meet the defined requirements and will subsequently be excluded; 1-flagged rows remain in the data.table. The key is ID and is unique for each row.
I would like to show two relationships.
(1) A stacked normalized/percentage bar chart over the monthly time series to show the percentage of entries remaining/being excluded in the data.set for each month,
f.ex. Jan 1990 --> 2/3 values remaining --> 66.6% vs. 33.3% of entries remain vs. are excluded
(2) A stacked normalized/percentage bar chart showing the normalized percentage of volume remaining/ being excluded by the filtering operation for each month,
f.ex. Jan 1990 --> 2k + 5k out of 8k remaining --> 87.5% vs. 12.5% of volume remains vs. is excluded
I tried various things so far, f.ex. compute the number of occurences of each flag-value per month and the sum of the corresponding "bucket" (0/1) volume, but all my attempts failed so far.
# dt_1 is the original data.table
id.vec <- dt_1[ , id]
dt_2 <- dt_1
# dt_1 is filterd subsequently
id_remaining.vec <- dt_1[ , id]
dt_2 <- dt_2[id.vec %in% id_remaining.vec, REMAIN := 1]
dt_2 <- dt_2[id.vec %notin% id_remaining.vec, REMAIN := 0]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN := .N]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN_MON := .N]
# Tried the code below to no avail
ggplot(data = dt_2, aes(x = Date, y = REMAIN, color = REMAIN, fill = REMAIN)) +
geom_bar(position = "fill", stat = "identity")
Usually, I find ggplot grammar very intuitive, but I guess I am overlooking sth here or maybe the data set is not in the right format.
Any pointer or idea highly appreciated!
Here's how I'd do it with dplyr:
library(dplyr)
dt_2 %>%
mutate(Remaining = as.character(Remaining)) %>% # just to make the charts use scale_fill_discrete by default
group_by(Date, Remaining) %>%
summarize(entries = n(),
volume = sum(Volume)) %>%
mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
> dt_2_summary
# A tibble: 5 x 6
Date Remaining entries volume share_entries share_volume
<chr> <chr> <int> <int> <dbl> <dbl>
1 1990-01-01 0 1 1000 0.333 0.125
2 1990-01-01 1 2 7000 0.667 0.875
3 1990-02-01 0 1 200 1 1
4 1990-03-01 0 1 3000 0.5 0.429
5 1990-03-01 1 1 4000 0.5 0.571
Then to chart:
dt_2_summary %>%
ggplot(aes(Date, share_entries, fill = Remaining)) +
geom_col()
dt_2_summary %>%
ggplot(aes(Date, share_volume, fill = Remaining)) +
geom_col()
Just as an appendix to Jon's great soution.
I had a large project with >25 libraries loaded and while the proposed code seemingly worked, it only did work for the share_entries and not for share_volume. Output of dt_2_summary was weird. The share_entries column was apparently computed to the total number of entries and not within each group and the share_volume column only showed NAs.
After hours of troubleshooting, I identified the culprit to be the pkg plyr, which did overwrite some functions. Thus, I had to specify which version of the applied functions I wanted to use.
The code below did the trick for me.
library(plyr) # the culprit
library(dplyr)
dt_2 %>%
dplyr::mutate(Remaining = as.character(Remaining)) %>%
group_by(Date, Remaining) %>%
dplyr::summarize(entries = n(),
volume = sum(Volume)) %>%
dplyr::mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
Thanks again Jon for your wonderful solutiopn!

no. of geom_point matches the value

I have an existing ggplot with geom_col and some observations from a dataframe. The dataframe looks something like :
over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0
The geom_col represents the runs data column and now I want to represent the wickets column using geom_point in a way that the number of points represents the wickets.
I want my graph to look something like this :
As
As far as I know, we'll need to transform your data to have one row per point. This method will require dplyr version > 1.0 which allows summarize to expand the number of rows.
You can adjust the spacing of the wickets by multiplying seq(wickets), though with your sample data a spacing of 1 unit looks pretty good to me.
library(dplyr)
wicket_data = dd %>%
filter(wickets > 0) %>%
group_by(over) %>%
summarize(wicket_y = runs + seq(wickets))
ggplot(dd, aes(x = over)) +
geom_col(aes(y = runs), fill = "#A6C6FF") +
geom_point(data = wicket_data, aes(y = wicket_y), color = "firebrick4") +
theme_bw()
Using this sample data:
dd = read.table(text = "over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0", header = T)

Replace row names in a column

I have a large data.frame in R with thousands of rows and 4 columns.
For example:
Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1
I have this data.frame with the chromosome names accordingly
LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1
I want to replace all row names of the large matrix with the chromosome names as listed above and get:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Does anybody know which is the less painful way to do this?
It might be easy (or not) but my experience in R is limited.
Many thanks!
As discussed in the comments here is the dplyr solution if people are looking:
library(dplyr)
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count)
This gives a warning message that the two merging columns has different factor levels. You can either ignore that and work with characters or convert the merged column to a factor like:
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count) %>%
mutate(Chromosome = as.factor(Chromosome))
Here is a Base R solution:
merged = merge(df, chromo_names,
by.x = "Chromosome",
by.y = "V2",
sort = FALSE)
merged = merged[c(5,2:4)]
names(merged)[1] = "Chromosome"
Result:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Data:
df = read.table(text = " Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1", header = TRUE)
chromo_names = read.table(text = "LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1", header = FALSE)

R- Trimming a string in a dataframe after a particular pattern

I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.

Plotting a dot for every n observations

I want to archieve the following plot type using ggplot:
using the following data:
t <- read.table(header=T, row.names=NULL,
colClasses=c(rep("factor",3),"numeric"), text=
"week team level n.persons
1 A 1 50
1 A 2 20
1 A 3 30
1 B 1 50
1 B 2 20
2 A 2 20
2 A 3 40
2 A 4 20
2 B 3 30
2 B 4 20")
so far, by applying this transformation
t0 <- t[ rep(1:nrow(t), t$n.persons %/% 10 ) , ]
and plotting
ggplot(t0) + aes(x=week, y=level, fill=team) +
geom_dotplot(binaxis="y", stackdir="center",
position=position_dodge(width=0.2)
i could generate
A: How to archieve that dots of different teams dodge each other vertically and do not overlap?
B: Is it possible that the whole pack of dots is always centered, i.e.
no dodging occurs if there are only dots of one team in one place?
The following code stops the overlap:
t0 <- t[ rep(1:nrow(t), t$n.persons %/% 10 ) , ]
t0$level <- as.numeric(t0$level) # This changes the x-axis to numerics
t0$level <- ifelse(t0$team == "B", (t0$level+.1), t0$level) # This adds .1 to the position on the x-axis if the team is 'B'
ggplot(t0) + aes(x=week, y=level, fill=team) + geom_dotplot(binaxis="y", stackdir="center",
position=position_dodge(width=0.2))
Here is the output:
You could also minus a value to move the dot downwards if you would prefer that.
If you want the line exactly between the dots this code should do it:
t0$level <- ifelse(t0$team == "B", (t0$level+.06), t0$level)
t0$level <- ifelse(t0$team == "A", (t0$level-.06), t0$level)
Output:
I'm not sure off the top of my head how to skip the above ifelse when there is only one team at a given coordinate. I'd imagine you'd need to do a count of unique team labels at each coordinate and only if that count was > 1 then run the code above.

Resources