I'm trying to create a venn diagram to help me inspect how many shared variables (species) there are between participant groups. I have a dataframe with dimensions 97 (participants) x 320. My first 2 columns are participant_id and participant_group respectively, and the rest 318 columns are the names of the species with their respective counts. I want to create a venn diagram which will tell me how many species are shared between all the groups.
Here is a reproducible example.
participant_id <- c("P01","P02","P03","P04","P05","P06","P07","P08","P09","P10", "P11", "P12", "P13", "P14", "P15")
participant_group <- c("control", "responsive", "resistant", "non-responsive", "control", "responsive", "resistant", "non-responsive", "resistant", "non-responsive", "control", "responsive", "non-responsive", "control", "resistant")
A <- c (0, 54, 23, 4, 0, 2, 0, 35, 0, 0, 45, 0, 1, 99, 12)
B <- c (10, 0, 1, 0, 4, 65, 0, 1, 52, 0, 0, 15, 20, 0, 0)
C <- c (0, 0, 0, 5, 35, 0, 0, 45, 0, 0 , 0, 22, 0, 89, 50)
D <- c (0, 0, 45, 0, 1, 0, 0, 0, 56, 32, 0, 0, 40, 0, 0)
E <- c (0, 0, 40, 5, 0, 0, 0, 45, 0, 1, 76, 0, 34, 56, 31)
F <- c (0, 64, 1, 5, 0, 0, 80, 0, 0, 1, 76, 0, 34, 0, 32)
G <- c (12, 5, 0, 0, 80, 45, 0, 0, 76, 0, 0, 0, 0, 32, 11)
H <- c (0, 0, 0, 5, 0, 0, 80, 0, 0, 1, 0, 0, 34, 0, 2)
example_df <- data.frame(participant_id, participant_group, A, B, C, D, E, F, G, H)
I can see all the wonderful venn diagram packages out there, but I'm struggling to format my data correctly.
I have started with:
example_df %>%
group_by(participant_group) %>%
dplyr::summarise(across(where(is.numeric), sum)) %>%
mutate_if(is.numeric, ~1 * (. > 0))
So now I have an indication whether a species (A,B,C, etc) is present (1) or absent (0) within every group. Now, I want to see the overlap of species between the groups through a venn diagram (something like this https://statisticsglobe.com/venn-diagram-with-proportional-size-in-r ). However, I am a little bit stuck on what to do next. Does anybody have any ideas?
I hope this makes sense! Thanks for your time.
When using the code from #Paul Stafford Allen, I get this diagram but the goal here is to have something that shows shared presence/absence for species (A,B,C, etc) between groups irrespective of the counts.
using
library(VennDiagram)
library(dplyr)
library(magrittr)
I managed the following start point:
groupSums <- example_df %>%
group_by(participant_group) %>%
summarise(across(where(is.numeric), sum))
forVenn <- lapply(groupSums$participant_group, function(x) {
rep(names(groupSums)[-1], times = groupSums[groupSums$participant_group == x,-1])
})
names(forVenn) <- groupSums$participant_group
venn.diagram(forVenn, filename = "Venn.png", force.unique = FALSE)
Related
This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
I am trying to display a data frame using kable function. The dataframe consists of 29 columns but only 9 columns are displayed and the remaining columns are repeated.
The dataframe used is
structure(list(Name = c("Grand Total", "B", "C", "D", "E", "F"
), GrandTotal = c(3416, 297, 410, 326, 125, 29), English = c(1096,
18, 64, 0, 55, 0), Science = c(211, 5, 39, 0, 55, 0), Language = c(149,
5, 0, 0, 10, 0), Maths = c(22, 0, 0, 0, 0, 0), Social = c(0,
0, 0, 0, 0, 0), English = c(211, 5, 39, 0, 55, 0), Science = c(149,
5, 0, 0, 10, 0), Maths = c(0, 0, 0, 0, 0, 0), Social = c(22,
0, 0, 0, 0, 0), English = c(1096, 18, 64, 0, 55, 0), Science = c(211,
5, 39, 0, 55, 0), Language = c(149, 5, 0, 0, 10, 0), Maths = c(22,
0, 0, 0, 0, 0), Social = c(0, 0, 0, 0, 0, 0), English = c(211,
5, 39, 0, 55, 0), Science = c(149, 5, 0, 0, 10, 0), ACIntern = c(0,
0, 0, 0, 0, 0), PAM = c(22, 0, 0, 0, 0, 0), Maths = c(1096, 18,
64, 0, 55, 0), Social = c(211, 5, 39, 0, 55, 0), English = c(149,
5, 0, 0, 10, 0), Science = c(22, 0, 0, 0, 0, 0), Language = c(0,
0, 0, 0, 0, 0), Maths = c(211, 5, 39, 0, 55, 0), Social = c(149,
5, 0, 0, 10, 0), English = c(0, 0, 0, 0, 0, 0), Science = c(22,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
The code used for displaying the data frame as a table format is as follows
monthSelected <- c("April","May","June")
month1 <- paste0(monthSelected[1],' ',yearSelected)
month2 <- paste0(monthSelected[2],' ',yearSelected)
month3 <- paste0(monthSelected[3],' ',yearSelected)
myHeader <- c(" " = 2, month1 = 9, month2 = 9, month3 = 9)
names(myHeader) <- c(" ", month1, month2, month3)
kable(df[1:ncol(df)],"html") %>%
kable_styling(c("striped", "bordered")) %>%
add_header_above(c(" "=2, "IND" = 5, "US" = 4,"IND" = 5, "US" = 4,"IND" = 5, "US" = 4)) %>%
add_header_above(header = myHeader)
The output displayed is as follows
I can't figure out where I went wrong. Can anyone help me out with this issue?
In addition to it, is it possible to freeze first two columns when the table is scrolled horizontally?
Thanks in advance!!
Currently I have a data.table of this form:
USER active reason days # of elements by hour
4q7C0o 1 NA 28 c(0, 0, 0, 0, 0, 0, 5, 98, 167, 211, 246)
2BrKY63 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 5, 15, 24, 89, 187)
3drUy6I 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 0, 1, 112, 265, 309)
G5ALtO 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 2, 20, 153, 170)
Where in the column "#elements by hour" each list is 24 elements long (i ommited the rest just for clarity)
However I don know how to perform the 2 following things:
1) plot all the #elements by hour in a single plot and label them by "user" or "active" (something that appears like a time series)
2) apply a function also to the column "elements by hour"
I tried following but it gives nothing:
plotserieslines <- function(yvar){
ggplot(tickets_by_hour_2019031, aes_(x=c(0:23) ,y=yvar)) +
geom_line()
}
lapply(names(tickets_by_hour_2019031[,#elements by hour,]), plotserieslines)
and $tickets_by_hour_2019031$ is my data.table
I have the raw totals for three values that I was looking to display over time in a stacked bar chart, but I don't know how to display this.
I have the percentage values (.22, et cetera), and the raw numbers.
How would I create a stacked bar chart using ggplot2 considering I have three proportions I am trying to graph. Do I need to melt the data?
I would like to do something like:
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar( stat="identity", position="fill")
But I do not know how to do this as my data isn't formatted right. Should I use dplyr?
Here is my df:
structure(list(date = structure(c(17405, 17406, 17407, 17408,
17409, 17410, 17411, 17412, 17413, 17414), class = "Date"), total_membership = c(1,
1, 1, 1, 1, 188, 284, 324, 354, 390), full_members = c(1, 1,
1, 1, 1, 188, 284, 324, 354, 390), guests = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), daily_active_members = c(1, 1, 1, 1, 1, 169,
225, 214, 203, 254), daily_members_posting_messages = c(1, 0,
1, 0, 1, 111, 110, 96, 67, 70), weekly_active_members = c(1,
1, 1, 1, 1, 169, 270, 309, 337, 378), weekly_members_posting_messages = c(1,
1, 1, 1, 1, 111, 183, 218, 234, 255), messages_in_public_channels = c(4,
0, 0, 0, 1, 252, 326, 204, 155, 135), messages_in_private_channels = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), messages_in_shared_channels = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), messages_in_d_ms = c(1, 0, 0, 0,
0, 119, 46, 71, 70, 122), percent_of_messages_public_channels = c(0.8,
0, 0, 0, 1, 0.6792, 0.8763, 0.7418, 0.6889, 0.5253), percent_of_messages_private_channels = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), percent_of_messages_d_ms = c(0.2,
0, 0, 0, 0, 0.3208, 0.1237, 0.2582, 0.3111, 0.4747), percent_of_views_public_channels = c(0.2857,
1, 1, 1, 1, 0.8809, 0.9607, 0.945, 0.9431, 0.9211), percent_of_views_private_channels = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), percent_of_views_d_ms = c(0.7143,
0, 0, 0, 0, 0.1191, 0.0393, 0.055, 0.0569, 0.0789), name = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), public_channels_single_workspace = c(10,
10, 11, 11, 12, 12, 12, 13, 13, 13), messages_posted = c(35,
35, 37, 38, 66, 1101, 1797, 2265, 2631, 3055)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Here is an example using a toy data set, where the original data are first grouped and summarised to get the 'proportions', then piped to ggplot, which will automatically create a stacked bar plot
df <- data.frame(group=sample(letters[1:10],1000,T),
species=sample(1:4,1000,T),
amount=sample(10:30,1000,T))
df %>% group_by(group,species) %>% summarise(perc=mean(amount)) %>%
ggplot(aes(group,perc,fill=factor(species))) +
geom_bar(stat='identity')
UPDATE
This will calculate the proportion that 'species' occurs within each 'group'.
df %>% group_by(group,species) %>% summarise(n=n()) %>%
group_by(group) %>% mutate(perc=n/sum(n)) %>%
ggplot(aes(group,perc,fill=factor(species))) +
geom_bar(stat='identity')
Which is the fastest way to do this? I have many 'value' columns (>100) in which I have to replace values when 'valueAux' is zero.
'Value1' column should be set to zero always that 'value1Aux' (for the same row) is zero
Original data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(23, 0, 4, 1, 0, 0, 8, 12),
value2 = c(0, 12, 56, 7, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Result desired data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(0, 0, 4, 1, 0, 0, 0, 12),
value2 = c(0, 12, 0, 0, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Code to optimize:
names <- colnames(df[2:3])
names2 <- colnames(df[4:5])
for (i in 1:nrow(df)){
df[i,names] <- replace (df[i,names], df[i,names2] == 0, 0)}