Creating an alluvial plot in R to demonstrate web traffic flow - r

I have a dataset that reads like a log file showing each user interaction with a website. I'm trying to visualize this data to show the most common sequences/pathways through the site (no, I do not have access to Google Analytics - just a data dump.) I've been able to distill the data down to a format that contains the page and the number of times it is the first, second, third page visited, etc.
I thought I might create an alluvial plot (using ggaluvial) stratified by the sequential position. I've roughed together a version of what I'm going for:
Here is a way to generate some sample data that is structured like mine:
pages <- rep(c("Home", "About", "People", "Contact", "Products"), each=6)
positions <- sample(c(1,2,3,4,5))
counts <- sample(1:100, 30)
df_colnames <- c("Page", "Position", "Count")
df <- data.frame(pages, positions, counts)
colnames(df) <- df_colnames
But I cannot seem to get ggaluvial to accept a single column as repeated strata, if that makes sense. Here's what I've got, but it's not much to go on:
library(ggalluvial)
ggplot(df,
aes(axis1 = Page,
axis2 = Position,
y = Count)) +
geom_alluvium() +
geom_stratum() +
geom_text(stat = "stratum",
label.strata = TRUE) +
theme_minimal()
This is just something that I have been trying. If you know of a better way to visualize this information, I'm all ears.
Thank you in advance.

Related

Plotting graph from Text file using R

I am using an NS3 based simulator called NDNsim. I can generate certain trace files that can be used to analyze performance, etc. However I need to visualize the data generated.
I am a complete Novice with R, and would like a way to visualize. Here is how the output looks from which I would to plot. Any help is appreciated.
It's pretty difficult to know what you're looking for, since you have almost 50,000 measurements across 9 variables. Here's one way of getting a lot of that information on the screen:
df <- read.table(paste0("https://gist.githubusercontent.com/wuodland/",
"9b2c76650ea37459f869c59d5f5f76ea/raw/",
"6131919c105c95f8ba6967457663b9c37779756a/rate.txt"),
header = TRUE)
library(ggplot2)
ggplot(df, aes(x = Time, y = Kilobytes, color = Type)) +
geom_line() +
facet_wrap(~FaceDescr)
You could look into making sub structures from your input file and then graphing that by node, instead of trying to somehow invoke the plotter in just the right way.
df <- read.table(paste0("https://gist.githubusercontent.com/wuodland/",
"9b2c76650ea37459f869c59d5f5f76ea/raw/",
"6131919c105c95f8ba6967457663b9c37779756a/rate.txt"),
header = TRUE)
smaller_df <- df[which(df$Type=='InData'), names(df) %in% c("Time", "Node",
"FaceId", "FaceDescr", "Type", "Packets", "Kilobytes",
"PacketRaw", "KilobyteRaw")]
ggplot(smaller_df, aes(x = Time, y = Kilobytes, color = Type))
+ geom_line()
+ facet_wrap (~ Node)
The above snippet makes a smaller data frame from your original text data using only the "InData" Type, and then plots that by nodes.

How to draw a violin plot with the color showing the expression of gene value?

I am trying to plot the gene expression of "gene A" among several groups.
I use ggplot2 to draw, but I fail
p <- ggplot(MAPK_plot, aes(x = group, y = gene_A)) + geom_violin(trim = FALSE , aes( colour = gene_A)) + theme_classic()
And I want to get the figure like this from https://www.researchgate.net/publication/313728883_Neuropilin-1_Is_Expressed_on_Lymphoid_Tissue_Residing_LTi-like_Group_3_Innate_Lymphoid_Cells_and_Associated_with_Ectopic_Lymphoid_Aggregates
You would have to provide data to get a more specific answer, tailored to your problem. But, I do not want that you get demotivated by the down-votes you got so far and, based on your link, maybe this example can give you some food for thought.
Nice job on figuring out that you have to use geom_violin. Further, you will need some form of faceting / multi-panels. Finally, to do the full annotation like in the given link, you need to make use of the grid package functionality (which I do not use here).
I am not familiar with gene-expression data sets, but I use a IMDB movie rating data set for this example (stored in the package ggplot2movies).
library(ggplot2)
library(ggplot2movies)
library(data.table)
mv <- copy(movies)
setDT(mv)
# make some variables for our plotting example
mv[, year_10 := cut_width(year, 10)]
mv[, rating_10yr_avg := mean(rating), by = year_10]
mv[, length_3gr := cut_number(length, 3)]
ggplot(mv,
aes(x = year_10,
y = rating)) +
geom_violin(aes(fill = rating_10yr_avg),
scale = "width") +
facet_grid(rows = vars(length_3gr))
Please do not take this answer as a form on encouragement of not posting data relevant to your problem.

visualizing statistical test results with ggplot2

I would like to get my statistical test results integrated to my plot. Example of my script with dummy variables (dummy data below generated after first post):
cases <- rep(1:1:5,times=10)
var1 <- rep(11:15,times=10)
outcome <- rep(c(1,1,1,2,2),times=10)
maindata <- data.frame(cases,var1,outcome)
df1 <- maindata %>%
group_by(cases) %>%
select(cases,var1,outcome) %>%
summarise(var1 = max(var1, na.rm = TRUE), outcome=mean(outcome, na.rm =TRUE))
wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
ggplot(df1, aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) + geom_boxplot()
With these everything works just fine, but I can't find a way to integrate my wilcox.test results to my plot automatically (of course I can make use annotation() and write the results manually but that's not what I'm after.
My script produces two boxplots with max-value of var1 on the y-axis and grouped by outcome on the x-axis (only two different values for outcome). I would like to add my wilcox.test results to that boxplot, all other relevant data is present. Tried to find a way from forums and help files but can't find a way (at least with ggplot2)
I'm new to R and trying learn stuff through using ggplot2 and dplyr which I see as most intuitive packages for manipulation and visualization. Don't know if they are optimal for the solution which I'm after so feel free to suggest solutions from alternative packages also...
I thinks this figure shows what you want. I also added some parts to the code because you're new with ggplot2. Take or leave them, but there're things I do make publication quality figures:
wtOut = wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
exampleOut <- ggplot(df1,
aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) +
geom_boxplot() +
scale_fill_gradient(name = paste0("P-value: ",
signif(wtOut$p.value, 3), "\nOutcome")) +
ylab("Variable 1") + xlab("Outcome") + theme_bw()
ggsave('exampleOut.jpg', exampleOut, width = 6, height = 4)
If you want to include the p-value as its own legend, it looks like it is some work, but doable.
Or, if you want, just throw signif(wtOut$p.value, 3) into annotate(...). You'll just need to come up with rules for where to place it.

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Resources