How to create a network chart? - r

I am trying to use networkD3::forceNetwork to create a chart of employers and colleges from which employers hire employees.
Right now, I have something like this:
forceNetwork(Links= Links, Nodes= netDf ,
Source = 'collegeName', Target = 'organizationName', Value='count',
NodeID = 'collegeName', Group = 'organizationName')
But the output doesn't look as expected. What I would like to have, is:
One bubble for each college
One bubble for each employer
Colleges connected to employer, with number of employers (count) mapped to the width of the connection lines.
Colleges are never connected to each other, and the same holds for employers.
This is the dataset I am using netDf:
structure(list(collegeName = c("college1", "college1", "college2",
"college3", "college3", "college3", "college4", "college5", "college5",
"college6", "college6", "college6", "college7", "college7", "college7",
"college8", "college9", "college10", "college10", "college11"
), organizationName = c("employer2", "employer3", "employer2",
"employer1", "employer2", "employer3", "employer2", "employer2",
"employer3", "employer1", "employer2", "employer3", "employer1",
"employer2", "employer3", "employer2", "employer2", "employer2",
"employer3", "employer2"), count = c(858, 176, 461, 201, 2266,
495, 430, 1992, 290, 127, 1754, 549, 136, 2839, 686, 638, 275,
1388, 387, 188), group = c(2, 3, 2, 1, 2, 3, 2, 2, 3, 1, 2, 3,
1, 2, 3, 2, 2, 2, 3, 2)), .Names = c("collegeName", "organizationName",
"count", "group"), row.names = c(NA, -20L), class = "data.frame")
And this is the Links dataset:
structure(list(collegeName = c(0, 0, 1, 2, 2, 2, 3, 4, 4, 5,
5, 5, 6, 6, 6, 7, 8, 9, 9, 10), organizationName = c(1, 2, 1,
0, 1, 2, 1, 1, 2, 0, 1, 2, 0, 1, 2, 1, 1, 1, 2, 1), count = c(858,
176, 461, 201, 2266, 495, 430, 1992, 290, 127, 1754, 549, 136,
2839, 686, 638, 275, 1388, 387, 188), group = c(2, 3, 2, 1, 2,
3, 2, 2, 3, 1, 2, 3, 1, 2, 3, 2, 2, 2, 3, 2)), .Names = c("collegeName",
"organizationName", "count", "group"), row.names = c(NA, -20L
), class = "data.frame")
Also, would it be possible to map a 4th variable to the bubble size? Say for instance that I want to map count to che size of the bubble pertaining the employees, how can I do that?

I think your Links and Nodes data frames do not meet the requirements as specified in ?forceNetwork. Instead, you could do:
library(networkD3)
set.seed(1)
nodes <- data.frame(Label = unique(c(netDf[,1], netDf[,2])))
nodes$Group <- as.factor(substr(nodes$Label, 1, 3))
nodes <- merge(
nodes,
aggregate(count~organizationName, netDf, sum),
by.x="Label", by.y="organizationName",
all.x=TRUE
)
nodes$count[is.na(nodes$count)] <- 1
links <- transform(netDf,
Source = match(netDf$collegeName, nodes$Label)-1,
Target = match(netDf$organizationName, nodes$Label)-1
)
forceNetwork(
Links = transform(links, count = count/min(count)),
Nodes = nodes,
Source = 'Source',
Target = 'Target',
Value='count',
NodeID = 'Label',
Group = "Group",
Nodesize = "count",
legend = TRUE,
opacity = 1,
radiusCalculation = JS("Math.log(d.nodesize)+6")
)

Related

ggbetweenstats: logarithmic y axis removes grouped analysis from plot

I am conducting a kruskal-wallis test to determine statistically significance between three groups of a measurement. I use ggbetweenstats to determine between which group there is a statistically significant association.
Here is the code for sample data and the plot:
sampledata <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), group = c(1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2), measurement = c(0,
1, 200, 30, 1000, 6000, 1, 0, 0, 10000, 20000, 700, 65, 1, 8,
11000, 13000, 7000, 500, 3000)), class = "data.frame", row.names = c(NA,
20L))
library(ggstatsplot)
library(ggplot2)
ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
)
You can see the results from the kruskal wallis test on the top of the plot as well as the groupes analysis in the plot. Now I want to change y axis to logarithmic scale:
ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
) +
ggplot2::scale_y_continuous(trans=scales::pseudo_log_trans(sigma = 1, base = exp(1)), limits = c(0,25000), breaks = c(0,1,10,100,1000,10000)
)
However, this removes the grouped analysis. I have tried different scaling solutions and browsed SO for a solution but couldn't find anything. Thank you for your help!
It seems that the y_position parameter in the geom_signif component is not affected by the y axis transformation. You will need to pass the log values of the desired bracket heights manually. In theory, you can pass these via the ggsignif.args parameter, but it seems that in the latest version of ggstatsplot this isn't possible because the y_position is hard-coded.
One way tound this is to store the plot then change the y positions after the fact. Here's a full reprex with the latest versions of ggplot2, ggstatsplot and their dependencies (at the time of writing)
sampledata <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), group = c(1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2), measurement = c(0,
1, 200, 30, 1000, 6000, 1, 0, 0, 10000, 20000, 700, 65, 1, 8,
11000, 13000, 7000, 500, 3000)), class = "data.frame", row.names = c(NA,
20L))
library(ggstatsplot)
library(ggplot2)
library(scales)
p <- ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
) + scale_y_continuous(trans = pseudo_log_trans(sigma = 1, base = exp(1)),
limits = c(0, exp(13)),
breaks = c(0, 10^(0:5)),
labels = comma)
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
i <- which(sapply(p$layers, function(x) inherits(x$geom, "GeomSignif")))
p$layers[[i]]$stat_params$y_position <- c(10, 10.8, 11.6)
p
Created on 2023-01-15 with reprex v2.0.2

Complicated filtering of data frame without loops

I have big data frame with positions, time stamps, trip ids etc.
I would like to in a simple way, to avoid double loops, filter out and save only some of the rows.
So for all the rows that have the same combination of trip_id and stop_id, I want to save the row where the speed was first equal to zero. Either by take the minimum timestamp where the speed is zero or simple just the first time the speed is zero since the frame is ordered by the timestamp.
So in the example below, I would like to find the three top rows (in the real data frame a lot more rows) and just save the second row where the speed first was zero.
Is there a way to do this without any loops?
trip_id.x stop_id latitude.x longitude.x bearing speed timestamp vehicle id
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 30 0.2 1541399400 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399401 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399402 9031005990005424
55700000048910300 9022005000050006 58.416879999999999 15.624510000000001 30 0.5 1541400000 9031005990005424
Edit:
Here is the dput() of a longer exampel with a simpler format of the data I have:
structure(list(trip_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3), stop_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1,
1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
3, 3), speed = c(5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0,
5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4
), timestamp = c(1, 2, 3, 4, 101, 102, 103, 104, 201, 202, 203,
204, 301, 302, 303, 304, 401, 402, 403, 404, 501, 502, 503, 504,
601, 602, 603, 604, 701, 702, 703, 704, 801, 802, 803, 804)), row.names = c(NA,
-36L), class = c("tbl_df", "tbl", "data.frame"))
And the wanted output:
structure(list(trip_id = c(1, 1, 2, 2, 2, 3, 3, 3), stop_id = c(1,
3, 1, 2, 3, 1, 2, 3), speed = c(0, 0, 0, 0, 0, 0, 0, 0), timestamp = c(2,
202, 302, 402, 502, 602, 702, 802)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Edit: Trying to change to code to have conditions in it. Tried with case_when and if but can't get it to work:
df_arrival_z <- df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0)
# Check if there is any rows where speed is zero
if (nrow(filter(speed == 0)) > 0){
# Take the first row if there is rows with zero
filter(speed == 0) %>% slice(1)
}
if (nrow(filter(speed == 0)) == 0){
# Take the middle point if there is no rows with speed = 0
slice(nrow%/%2)
}
Without desired output I can't be sure what you expect, but try this and let me know:
library(dplyr)
df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0) %>%
slice(1)

How to specify predictor matrix for stan data block?

Dear stackoverflow community. I want to use the variables w1 to w10 as predictor matrix matrix[N, W] weights; in my stan model. I am not certain how to accomplish that.
data frame
(dat <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4),
imput = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
A = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
B = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0),
Pass = c(278, 278, 278, 278, 278, 100, 100, 100, 100, 100, 153, 153, 153, 153, 153, 79, 79, 79, 79, 79),
Fail = c(740, 743, 742, 743, 740, 7581, 7581, 7581, 7581, 7581, 1231, 1232, 1235, 1235, 1232, 1731, 1732, 1731, 1731, 1731),
W_1= c(4, 3, 4, 3, 3, 1, 2, 1, 2, 1, 12, 12, 11, 12, 12, 3, 5, 3, 3, 3),
W_2= c(3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 3, 3, 3, 3, 3),
W_3= c(4, 3, 3, 3, 3, 1, 2, 1, 1, 1, 12, 12, 11, 12, 12, 3, 3, 3, 3, 3),
W_4= c(3, 3, 4, 3, 3, 1, 1, 1, 2, 1, 12, 12, 13, 12, 12, 3, 2, 3, 3, 3),
W_5= c(3, 3, 3, 3, 3, 1, 0, 1, 1, 1, 12, 12, 12, 12, 12, 3, 3, 3, 3, 3),
W_6= c(4, 3, 3, 3, 3, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 3, 3, 3, 3, 3),
W_7= c(3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 3, 3, 3, 3, 3),
W_8= c(3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 15, 12, 12, 12, 12, 3, 3, 3, 3, 3),
W_9= c(3, 3, 3, 4, 3, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 2, 3, 3, 3, 3),
W_10= c(3, 3, 4, 3, 3, 1, 1, 1, 1, 1, 12, 10, 12, 12, 12, 3, 3, 3, 3, 3)
))
creating list
N <- nrow(dat)
ncases <- dat$Pass
nn <- dat$Fail + dat$Pass
A <- dat$A
B <- dat$B
id <- dat$id
imput <- dat$imput
w_1 <- dat$W_1
w_2 <- dat$W_2
w_3 <- dat$W_3
w_4 <- dat$W_4
w_5 <- dat$W_5
w_6 <- dat$W_6
w_7 <- dat$W_7
w_8 <- dat$W_8
w_9 <- dat$W_9
w_10 <- dat$W_10
dat1 <- list (N = N,
ncases = ncases, A = A, B = B, id = id, P = imput, nn = nn,
w1 = w_1, w2 = w_2, w3 = w_3, w4 = w_4, w5 = w_5,
w6 = w_6, w7 = w_7, w8 = w_8, w9 = w_9, w10 = w_10)
data block
data{
int N; // number of observations
int ncases[N]; // independent variable
int A[N]; // independent variable
int B[N]; // independent variable
int nn[N]; // independent variable
int id[N]; //individual id
int W[N]; //vector of weights
int P[N]; // number of imputations
matrix[N, W] weights; // design matrix of weights
}
Thank you in advance for any help.
If W in the data block is actually an int (rather than a vector; i.e., W is the number of columns in weights), then I would expect this to do what you need:
dat1 <- list (N = N,
ncases = ncases, A = A, B = B, id = id, P = imput, nn = nn, W = 10,
weights = cbind(w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8, w_9, w_10))

How to mirror the outer positions with the variable with R

I have a data frame:
tes <- data.frame(x = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
d = c(10, 20, 30, 100, 11, 12, 403, 43, 21))
They look like this on the plot
ggplot(aes(x = x, y = y), data = tes) + geom_point(aes(color = factor(d)), size = 5)
I'd like to "mirror the outer rows in this data to obtain such data and plot
tes1 <- data.frame(x = c(0, 0, 0, 0,0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4),
y = c(0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
d = c(10, 10, 20, 30, 30, 10, 10, 20, 30, 30, 100, 100, 11, 12, 12, 403, 403, 43, 21, 21, 403, 403, 43, 21, 21))
ggplot(aes(x = x, y = y), data = tes1) + geom_point(aes(color = factor(d)), size = 4)
Does this do what you're after?
Explanation: We first convert tes into a flattened table with ftable(xtabs(...). Then we simply replicate the first and last column, and first and last row. We then give new column and row names to reflect the extra "flanking" rows and columns, and finally convert back to a long dataframe with data.frame(table(...))
# Convert to table then matrix
m <- ftable(xtabs(d ~ x + y, data = tes));
class(m) <- "matrix";
# Replicate first and last column/row by binding to the beginning
# and end, respectively of the matrix
m <- cbind(m[, 1], m, m[, ncol(m)]);
m <- rbind(m[1, ], m, m[nrow(m), ]);
# Set column/row names
rownames(m) <- seq(min(tes$x) - 1, max(tes$x) + 1);
colnames(m) <- seq(min(tes$y) - 1, max(tes$y) + 1);
# Convert back to long dataframe
tes.ext <- data.frame(as.table(m));
colnames(tes.ext) <- colnames(tes);
# Plot
ggplot(aes(x = x, y = y), data = tes.ext) + geom_point(aes(color = factor(d)), size = 5)
Data
tes <- data.frame(x = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
d = c(10, 20, 30, 100, 11, 12, 403, 43, 21))

ggplot2 not working like I would like

I try to plot this visual with ggplot2:
ggplot(Trump_df, aes(x=freq.dif, y=Spacing)) +
geom_text(aes(size=TrumpT.txt, label=row.names(Trump_df),
colour=freq.dif), alpha=0.7, family="Arial") +
geom_text(data=Kasich_df, aes(x=freq.dif, y=Spacing,
label=row.names(Kasich_df), size=KasichT.txt, color=freq.dif),
alpha=0.7, family="Arial") +
geom_text(data=both_df, aes(x=freq.dif, y=Spacing,
label=row.names(both_df), size=TrumpT.txt, color=freq.dif),
alpha=0.7, family="Arial") +
scale_size(range=c(3,11)) +
scale_colour_gradient(low="red3", high="blue3", guide="none") +
scale_x_continuous(breaks=c(min(Kasich_df$freq.dif), 0, max(Trump_df$freq.dif)),
labels=c("Twitted More by Kasich","Twitted Equally","Twitted More by Trump")) +
scale_y_continuous(breaks=c(0), labels=c("")) +
labs(title="Conway's Word Cloud, Tweets (Trump -vs- Kasich)", x="", y="", size="Word Frequency") +
theme_bw() +
theme(panel.grid.major.y = element_line(),
panel.grid.minor.x = element_line(),
plot.title = element_text(family="Arial", size=18)
But get a plot with the axis labels stacked:
Here are my data frames:
dput(df)
>dput(df)
structure(list(TrumpT.txt = c(25, 34, 8, 3, 11, 4, 3, 4, 4, 16,
4, 8, 21, 16, 3, 21, 3, 10, 9, 11, 4, 6, 3, 4, 3, 6, 4, 3, 5,
55), KasichT.txt = c(6, 5, 3, 4, 3, 4, 6, 3, 4, 7, 29, 3, 3,
5, 4, 4, 3, 6, 22, 3, 3, 4, 3, 3, 3, 6, 5, 4, 6, 4), freq.dif = c(19,
29, 5, -1, 8, 0, -3, 1, 0, 9, -25, 5, 18, 11, -1, 17, 0, 4, -13,
8, 1, 2, 0, 1, 0, 0, -1, -1, -1, 51)), .Names = c("TrumpT.txt",
"KasichT.txt", "freq.dif"), row.names = c("america", "amp", "back",
"best", "can", "candidate", "check", "day", "done", "gopdebate",
"john", "join", "just", "make", "nation", "new", "next", "now",
"ohio", "one", "see", "state", "talk", "tax", "thanks", "time",
"today", "voting", "watch", "will"), class = "data.frame")
dput(df$freq.dif)
>dput(df$freq.dif)
c(19, 29, 5, -1, 8, 0, -3, 1, 0, 9, -25, 5, 18, 11, -1, 17, 0,
4, -13, 8, 1, 2, 0, 1, 0, 0, -1, -1, -1, 51)
dput(Trump_df)
>dput(Trump_df)
structure(list(TrumpT.txt = c(3, 3, 4, 3, 9, 4, 3, 5), KasichT.txt = c(4,
6, 29, 4, 22, 5, 4, 6), freq.dif = c(-1, -3, -25, -1, -13, -1,
-1, -1), Spacing = c(-0.4, 0.0249155652709305, -0.184168514423072,
-0.2, 0.0361329583451152, 0, 0.2, 0.4)), .Names = c("TrumpT.txt",
"KasichT.txt", "freq.dif", "Spacing"), row.names = c("best",
"check", "john", "nation", "ohio", "today", "voting", "watch"
), class = "data.frame")
dput(Kasich_df)
>dput(Kasich_df)
structure(list(TrumpT.txt = c(25, 34, 8, 11, 4, 16, 8, 21, 16,
21, 10, 11, 4, 6, 4, 55), KasichT.txt = c(6, 5, 3, 3, 3, 7, 3,
3, 5, 4, 6, 3, 3, 4, 3, 4), freq.dif = c(19, 29, 5, 8, 1, 9,
5, 18, 11, 17, 4, 8, 1, 2, 1, 51), Spacing = c(-0.171893353201449,
0.113299035839736, -0.5, -0.5, -0.333333333333333, 0.0629540653899312,
0.5, -0.161544232908636, 0.166435130592436, 0.193869082257152,
0.0795861765742302, 0.5, 0, -0.139369543548673, 0.333333333333333,
0.117351376824081)), .Names = c("TrumpT.txt", "KasichT.txt",
"freq.dif", "Spacing"), row.names = c("america", "amp", "back",
"can", "day", "gopdebate", "join", "just", "make", "new", "now",
"one", "see", "state", "tax", "will"), class = "data.frame")
dput(both_df)
>dput(both_df)
structure(list(TrumpT.txt = c(4, 4, 3, 3, 3, 6), KasichT.txt = c(4,
4, 3, 3, 3, 6), freq.dif = c(0, 0, 0, 0, 0, 0), Spacing = c(-0.833333333333333,
-0.5, -0.166666666666667, 0.166666666666667, 0.5, 0.833333333333333)), .Names = c("TrumpT.txt", "KasichT.txt", "freq.dif", "Spacing"),
row.names = c("candidate", "done", "next", "talk", "thanks",
"time"), class = "data.frame")
I would like to have "Twitted More by Kasich" on the far left, "Twitted Equally" in the middle, and "Twitted More by Trump" on the far right. I have tried messing with the scale, but what I'm doing doesn't seem to work. I'd appreciate any help.

Resources