Related
still fairly new to R and have stepped away for a while, so please bear with me.
I have a set of data which describes the degree of mobility (categorical data) after an operation across 3 days. I have been looking for a way to demonstrate the flow across those 3 days.
I've tried using geom_jitter with x and y being Day 1 and 2, and aes(colour) being Day 3 but this doesn't really convey what I want to show. I've done some reading around Sankey Diagram and Parallel Coordinates but have not got the understanding to quite fit the samples posed by others to fit my data.
This is what I've tried:
test %>% filter(!is.na(Mob_D1.factor) & !is.na(Mob_D2.factor) & !is.na(Mob_D3.factor)) %>%
ggplot(aes(x = Mob_D1.factor, y = Mob_D2.factor, colour = Mob_D3.factor)) +
geom_jitter(size = 5, alpha = 0.25, height = 0.25, width = 0.2) +
scale_colour_brewer(palette = "Dark2", name = "Mobilisation on Day 3") +
xlab("Mobilisation on Day 1") +
ylab("Mobilisation on Day 2") + theme_minimal()
As I said, not quite what I want.
This is a sample of the data:
structure(list(Mob_D1.factor = structure(c(2L, 2L, 2L, 2L, 4L,
1L, 2L, 2L, 1L, 4L, 2L, 4L, 2L, 1L, 2L, 4L, 4L, 2L, 4L, 4L, 2L,
4L, 2L, 2L, 4L, 2L, 1L, 4L, 4L, 3L, 4L, 2L, 3L, 2L, 2L, 2L, 2L,
2L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 4L, 2L, 4L, 4L, 4L), .Label = c("None",
"Bed", "Stand", "Assisted Walk"), class = "factor"), Mob_D2.factor = structure(c(2L,
3L, 2L, 4L, 4L, 1L, 3L, 4L, 4L, 4L, 3L, 4L, 2L, 2L, 2L, 4L, 4L,
4L, 4L, 4L, 1L, 4L, 2L, 2L, 4L, 2L, 1L, 4L, 4L, 4L, 4L, 2L, 3L,
2L, 2L, 2L, 4L, 4L, 2L, 4L, 3L, 4L, 4L, 2L, 2L, 4L, 4L, 4L, 4L,
4L), .Label = c("None", "Bed", "Stand", "Assisted Walk"), class = "factor"),
Mob_D3.factor = structure(c(2L, 3L, 2L, 4L, 4L, 1L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 4L, 2L,
2L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 4L, 4L,
3L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("None",
"Bed", "Stand", "Assisted Walk"), class = "factor")), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance to anyone who takes the time to reply. Any extended explanation would be appreciated as I am still learning.
Larry
I am not entirely sure what the expected result should be, but could a barplot be helpful?
Edit
I now think I understand what you need and I found the package ggalluvial that can help you with this.
Hope this helps.
library(tidyverse)
library(ggalluvial)
# Some data wrangling first. Add row_number to give a unique ID for each patient
d <- df %>% mutate(Patient = row_number()) %>%
# transform it to longer format
pivot_longer(col=(-Patient), values_to = "Stage", names_to = "Day")
# Make the plot
ggplot(d,
aes(x = Day, stratum = Stage, alluvium = Patient,
fill = Stage, label = Stage)) +
scale_fill_brewer(type = "qual", palette = "Set2") +
geom_flow(stat = "alluvium", lode.guidance = "frontback",
color = "darkgray") +
geom_stratum()
Created on 2020-02-24 by the reprex package (v0.3.0)
I have data that I want to know the number of specific rows that are with specific character. The data looks like the following
df<-structure(list(Gene.refGene = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1BG", "A1BG-AS1", "A1CF",
"A1CF;PRKG1"), class = "factor"), Chr = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr10", "chr19"
), class = "factor"), Start = c(58858232L, 58858615L, 58858676L,
58859052L, 58859055L, 58859066L, 58859510L, 58863162L, 58864479L,
58864150L, 58864867L, 58864879L, 58865857L, 52566433L, 52569637L,
52571047L, 52573510L, 52576068L, 52580561L, 52603659L, 52619845L,
52625849L, 52642500L, 52650951L, 52675605L, 52703952L, 52723140L,
52723638L), End = c(58858232L, 58858615L, 58858676L, 58859052L,
58859055L, 58859066L, 58859510L, 58863166L, 58864479L, 58864150L,
58864867L, 58864879L, 58865857L, 52566433L, 52569637L, 52571047L,
52573510L, 52576068L, 52580561L, 52603659L, 52619845L, 52625849L,
52642500L, 52650958L, 52675605L, 52703952L, 52723140L, 52723638L
), Ref = structure(c(3L, 5L, 2L, 2L, 3L, 2L, 5L, 7L, 6L, 6L,
2L, 1L, 5L, 6L, 5L, 3L, 2L, 5L, 6L, 3L, 3L, 6L, 3L, 4L, 3L, 6L,
6L, 3L), .Label = c("-", "A", "C", "CTCTCTCT", "G", "T", "TTTTT"
), class = "factor"), Alt_df1 = structure(c(1L, 1L, 4L, 4L, 1L,
4L, 5L, 1L, 3L, 3L, 4L, 4L, 3L, 1L, 2L, 5L, 1L, 2L, 1L, 5L, 5L,
2L, 5L, 1L, 4L, 3L, 4L, 2L), .Label = c("-", "A", "C", "G", "T"
), class = "factor")), class = "data.frame", row.names = c(NA,
-28L))
I want to know how many rows of the column named "alt_df1" is missing or - or NA
Here is an answer using which and utilising base R's LETTERS data:
length(which(!df$Alt_df1%in%LETTERS))
#[1] 8
Or using just which:
length(which(df$Alt_df1=="-"))
#[1] 8
One way would be to create a logical vector using %in% and then sum over them to count the number of occurrences.
sum(df$Alt_df1 %in% c("-", NA))
#[1] 8
Or we can also subset and count the number of rows.
nrow(subset(df, Alt_df1 %in% c("-", NA)))
which can also be done in dplyr by
library(dplyr)
df %>% filter(Alt_df1 %in% c("-", NA)) %>% nrow
Another option using grepl
with(df, sum(grepl("-", Alt_df1)) + sum(is.na(Alt_df1)))
and I am sure there are multiple other ways.
I am new to ggplot2 and trying to plot a continuous histogram showing the evolution of reviews by date and rating.
My data set look like this:
date rating reviews
1 2017-11-24 1 some text here
2 2017-11-24 1 some text here
3 2017-12-02 5 some text here
4 2017-11-24 3 some text here
5 2017-11-24 3 some text here
6 2017-11-24 4 some text here
What I want to get is something like this:
for rating == 1
date count
1 2017-11-24 2
2 2017-11-25 7
.
.
.
and so on for rating == 2 and 3
I've tried
ggplot(aes(x = date, y = rating), data = df) + geom_line()
but it gives me only rating on the y axis and not counts:
You can use dplyr to get the desired dataset and pass that into ggplot();
library(dplyr)
library(ggplot2)
sample_data %>% group_by(rating,date) %>% summarise(n=n()) %>%
ggplot(aes(x=date, y=n, group=rating, color=as.factor(rating))) +
geom_line(size=1.5) + geom_point()
Data:
sample_data <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L, 1L,
2L, 3L, 3L, 4L, 5L, 6L, 1L, 2L, 2L, 2L, 3L, 4L, 5L, 6L), date = structure(c(1L,
1L, 3L, 7L, 1L, 1L, 1L, 1L, 5L, 2L, 3L, 8L, 8L, 3L, 4L, 5L, 5L,
6L, 6L, 6L, 9L, 6L, 6L, 6L), .Label = c("2017-11-24", "2017-11-25",
"2017-11-26", "2017-11-27", "2017-11-28", "2017-11-29", "2017-12-02",
"2017-12-04", "2017-12-08"), class = "factor"), rating = c(1L,
1L, 1L, 5L, 3L, 3L, 3L, 4L, 4L, 1L, 1L, 5L, 5L, 3L, 3L, 4L, 1L,
1L, 1L, 1L, 5L, 3L, 3L, 4L), reviews = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "review", class = "factor")), .Names = c("id",
"date", "rating", "reviews"), row.names = c(NA, 24L), class = "data.frame")
Just using some dummy data:
library(tidyverse)
set.seed(999)
df <- data.frame(date = sample(seq(as.Date('2017/01/01'), as.Date('2017/04/01'), by="day"), 2000, replace = T),
rating = sample(1:5,2000,replace = T))
df$rating <- as.factor(df$rating)
df %>%
group_by(date,rating) %>%
summarise(n = length(rating)) %>%
ggplot(aes(date,n, color = rating)) +
geom_line() +
geom_point()
Let's start with the example of the data:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor")), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors"), class = "data.frame", row.names = c(NA,
-20L))
I would like to compare the two pairs of column. First pair which I would like to comapre is P1_location_subacon with P2_location_subacon. The second pair is P1_location_all_predictors with P2_location_all_predictors.
How I want to compare them ? In each column you have different "locations" of the fruit/vegetable. So:
if the location is the same in the first pair (P1/2_location_subacon) I would like to put number 2 in the additional column.
if the location is the same in the second pair (P1/2_location_all_predictors) I would like to put number 1 in the additional column. That one is a bit more complicated because not all of the locations have to be the same. At least one of them has to be the same for both fruits/vegetables.
if in both cases they are different put 0. You won't see such situation in the example data.
To summarize I show you the output which I would like to achieve:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor"), X = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Correct = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors", "X", "Correct"), class = "data.frame", row.names = c(NA,
-20L))
EDIT: using feedback from here Test two columns of strings for match row-wise in R I have improved my answer.
Where DT is your table:
library(data.table)
setDT(DT)
DT <- data.table(sapply(DT,as.character))
DT[, P1_location_all_predictors := gsub(",","|",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub(",","|",P1_location_subacon)]
DT[, match_all_pred := grepl(P1_location_all_predictors, P2_location_all_predictors) + 0, by = P1_location_all_predictors]
DT[, match_subacon := grepl(P1_location_subacon, P2_location_subacon), by = P1_location_subacon]
DT[, P1_location_all_predictors := gsub("\\|",",",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub("\\|",",",P1_location_subacon)]
I instead opted for two columns vs your 0/1/2 notation; it makes the code less straightforward as you have to rely on nested ifs. I also think that multiple columns is better as you can clearly see the F/F, T/F, F/T, and T/T cases.
If you must create the 0/1/2, you can call
DT[, MyCol := match_all_pred - match_subacon*match_all_pred+match_subacon*2]
which assumes that subacon supersedes the all location.
Here is another way:
myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)
doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}
myData$Correct <- 0
myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2
> myData$Correct
[1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
I would like to create a plot, like is shown here: https://stats.stackexchange.com/a/25156, the stacked bar chart type.
I have been trying to get it to work with the likert package all afternoon.
I did manage to get a plot, but I couldn't get it to group by pre/post.
Here's a sample of data:
data <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), Survey = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("pre", "post"), class = "factor"), Optimistic = structure(c(3L,
3L, 2L, 3L, NA, 2L, 3L, 2L, 3L, 1L, 3L, 3L, 2L, 2L, 1L, 2L, 2L,
2L, 3L, 2L), .Label = c("All of the time", "Often", "Some of the time"
), class = "factor"), Useful = structure(c(4L, 4L, 2L, 4L, NA,
2L, 4L, 2L, 4L, 1L, 4L, 3L, 4L, 2L, 2L, 2L, 1L, 2L, 4L, 2L), .Label = c("All of the time",
"Often", "Rarely", "Some of the time"), class = "factor"), Relaxed = structure(c(4L,
4L, 3L, 4L, NA, 4L, 3L, 2L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 1L,
2L, 4L, 4L), .Label = c("All of the time", "Often", "Rarely",
"Some of the time"), class = "factor"), Handling = structure(c(3L,
2L, 2L, 4L, 1L, 2L, 3L, 2L, 4L, 2L, 3L, 4L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L), .Label = c("All of the time", "Often", "Rarely",
"Some of the time"), class = "factor"), Thinking = structure(c(4L,
2L, 4L, 4L, 1L, 2L, 3L, 2L, 4L, 2L, 4L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 4L), .Label = c("All of the time", "Often", "Rarely",
"Some of the time"), class = "factor"), Closeness = structure(c(3L,
3L, 4L, 4L, 2L, 2L, 3L, 2L, 4L, 1L, 4L, 4L, 4L, 2L, 1L, 2L, 1L,
2L, 4L, 4L), .Label = c("All of the time", "Often", "Rarely",
"Some of the time"), class = "factor"), Mind = structure(c(3L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 4L, 1L, 4L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("All of the time", "Often", "Rarely",
"Some of the time"), class = "factor")), .Names = c("ID", "Survey",
"Optimistic", "Useful", "Relaxed", "Handling", "Thinking", "Closeness",
"Mind"), row.names = c(NA, -20L), class = "data.frame")
And here is what I have done to it so far:
surveylevels <- c("pre","post") # I put this bit in when I started trying to do the grouping
data$Survey <- factor(data$Survey, levels = surveylevels)
test <- data
test$ID <- NULL # because the likert package thing apparently can't deal with anything else :(
test$Survey = NULL # ditto
levels <- c("All of the time", "Often", "Some of the time", "Rarely", "None of the time")
test$Mind <- factor(test$Mind, levels = levels)
test$Optimistic <- factor(test$Optimistic, levels = levels)
test$Useful <- factor(test$Useful, levels = levels)
test$Relaxed <- factor(test$Relaxed, levels = levels)
test$Handling <- factor(test$Handling, levels = levels)
test$Thinking <- factor(test$Thinking, levels = levels)
test$Closeness <- factor(test$Closeness, levels = levels)
thing <- likert(test)
plot(thing)
Anyway, the next I wanted to try was:
likert(test, grouping = data$Survey)
Which just wasn't working, I read that it just doesn't work anymore and you have to mess around in the files to get it sorted.
Additionally, I also see that it isn't recognising all the levels of the data (some are missing). I amended the code to thing <- likert(test, nlevels = 5) but it did not fix it.
So my question is, is there a simpler way to do this? All the questions/answers I found on the internet are a year old or more, has anything happened since then that might make this more straight forward?
It looks like the behavior of cast may have changed which is what was causing the problem with grouping=. It appears you can fix it with this hack
body(likert)[[c(7,3,3,4,4)]]<-quote(t <- apply(as.matrix(t), 2, FUN = function(x) {
x/sum(x) * 100
}))
basically we are just adding in a call to as.matrix(). Then calling the function with the grouping parameter I get