I have a table of 983 obs. of 27 variables; the data can be provided if need be, but I do not believe there is a need for it, as the following crosstable should summarise it well enough:
Kjønn Antall <> e f g s ug
Sex Count w d m s um
k 282 2 26 5 41 208
m 701 11 56 4 148 2 480
Abbreviations (with English translation):
e[nkemann], f[raskilt], g[ift], s[eparert], ug[ift]
w[idow(er)], d[ivorced], m[arried], s[eparated], u[n]m[arried]
I would like to create a variable width boxplot showing the distribution of these individuals, but as can be seen from the table, the NAs, the divorced and the separated would be such a small group that it would be hardly legible (and pointless. How can I join these groups creating a boxplot showing e, f+s, g, and ug?
My current code:
# The basis for the boxplot
dBox_SexAge <- ggplot(data = tblHoved) +
geom_boxplot(
mapping = aes(colour = KJONN, x = KJONN, y = 1875-FAAR),
notch = TRUE,
lwd = .5, fatten = .125,
varwidth = TRUE
)
# Create the final boxplot
dBox_SexAgeMStat <- dBox_SexAge +
facet_grid(SIVST ~ .) +
coord_flip()
# Run it
dBox_SexAgeMStat
Current plot, from which I would like to group f and s:
Create a sample data frame
tblHoved <- data.frame(FAAR = rnorm(10),
SIVST = rep(c("e", "f", "g", "s", "ug"),2),
stringsAsFactors = FALSE)
tblHoved
# FAAR SIVST
# 1 0.22499630 e
# 2 1.10236362 f
# 3 0.02220001 g
# 4 0.19062022 s
# 5 0.05103136 ug
# 6 0.09280887 e
# 7 -0.70574835 f
# 8 0.39331232 g
# 9 0.24817094 s
# 10 0.66631994 ug
Merge f and s
tblHoved$SIVST[tblHoved$SIVST %in% c("f","s")] <- "f+s"
tblHoved
# FAAR SIVST
# 1 0.22499630 e
# 2 1.10236362 f+s
# 3 0.02220001 g
# 4 0.19062022 f+s
# 5 0.05103136 ug
# 6 0.09280887 e
# 7 -0.70574835 f+s
# 8 0.39331232 g
# 9 0.24817094 f+s
# 10 0.66631994 ug
Related
I have an undirected network with edge weights.
each node has an attribute "group".
I want to change the existing weights discounting those edges between nodes within the same group.
for instance
if node1 and node2 have an edge weight of 10 and they both have attribute "group" equal to "A" then i want to divide the weight by - let's say- 2 , if not (they have belong to different groups) their weight shall remain 10.
I am not sure how to proceed.
I can visuaize the weights using E(g)$weight and the attributes with V(g)$group, but I can't think of a way to make to use the vertex attributes to change edge weights.
You can use the igraph::%--% operator to help. As in your example: if both vertices are in group A, divide weight by 2
v_a <- which(V(g)$group == "A") # All vertices in group A
e_a <- E(g)[v_a %--% v_a] # All edges between those vertices
edge_attr(g)$weight[e_a] <- edge_attr(g)$weight[e_a] / 2
If each group is handled in a similar way, you can use a for-loop. In this example, you divide the weight by some constant, depending on which group its vertices are in.
group_constants <- c("A" = 2, "B" = 20, "C" = 200)
for (i in unique(V(g)$group)) {
v_a <- which(V(g)$group == i)
e_a <- E(g)[v_a %--% v_a]
edge_attr(g)$weight[e_a] <- edge_attr(g)$weight[e_a] / group_constants[i]
}
Here's the data I used. It's an undirected graph with vertex groups and edge weights.
verts <- letters[1:10]
g_df <- data.frame(
from = sample(verts, 15, replace = TRUE),
to = sample(verts, 15, replace = TRUE),
weight = sample(1:10, 15, replace = TRUE)
)
g_df_v <- data.frame(
name = verts,
group = sample(LETTERS[1:3], 10, replace = TRUE)
)
g <- graph_from_data_frame(g_df, directed = FALSE, vertices = g_df_v)
# Edges
# from to weight
# 1 h i 5
# 2 i e 9
# 3 d i 8
# 4 c a 10
# 5 f c 7
# 6 a b 5
# 7 j i 1
# 8 j b 8
# 9 a f 4
# 10 h i 7
# 11 g a 7
# 12 f b 6
# 13 e c 8
# 14 b f 4
# 15 g c 8
# Vertices
# name group
# 1 a C
# 2 b B
# 3 c C
# 4 d B
# 5 e A
# 6 f A
# 7 g A
# 8 h B
# 9 i B
# 10 j B
I am looking for a way to find clusters of group 2 (pairs).
Is there a simple way to do that?
Imagine I have some kind of data where I want to match on x and y, like
library(cluster)
set.seed(1)
df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))
I want to find the closest pair of distances between the x_coord and y_coord:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.
Maybe there is a more effective alternative for doing this than clustering.
Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:
id x_coord y_coord pair_id
1 9 3 1
3 7 5 1
4 1 8 2
5 2 2 2
6 5 6 3
7 3 10 3
9 6 4 4
10 8 7 4
You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
df
#> id x_coord y_coord pair
#> 1 1 9 3 4
#> 3 3 7 5 4
#> 4 4 1 8 1
#> 5 5 2 2 1
#> 6 6 5 6 2
#> 7 7 3 10 2
#> 9 9 6 4 3
#> 10 10 8 7 3
Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line
df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
And we can see the method scales nicely:
df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I woould like to display a histogram with the allocation of school notes.
The dataframe looks like:
> print(xls)
# A tibble: 103 x 2
X__1 X__2
<dbl> <chr>
1 3 w
2 1 m
3 2 m
4 1 m
5 1 w
6 0 m
7 3 m
8 1 w
9 0 m
10 5 m
I create the histogram with:
hist(xls$X__1, main='Notenverteilung', xlab='Note (0 = keine Beurteilung)', ylab='Anzahl')
It looks like:
Why are there spaces between 1,2,3 but not between 0 & 1?
Thanks, BR Bernd
Use ggplot2 for that, and your bars will be aligned
library(ggplot2)
ggplot(xls, aes(x = X__1)) + geom_histogram(binwidth = 1)
You can try
barplot(table(xls$X__1))
or try
h <- hist(xls$X__1, xaxt = "n", breaks = seq(min(xls$X__1), max(xls$X__1)))
axis(side=1, at=h$mids, labels=seq(min(xls$X__1), max(xls$X__1))[-1])
and using ggplot
ggplot(xls, aes(X__1)) +
geom_histogram(binwidth = 1, color=2) +
scale_x_continuous(breaks = seq(min(xls$X__1), max(xls$X__1)))
Given a df in semi-long format with id variables a and b and measured data in columns m1and m2. The type of data is specified by the variable v (values var1 and var2).
set.seed(8)
df_l <-
data.frame(
a = rep(sample(LETTERS,5),2),
b = rep(sample(letters,5),2),
v = c(rep("var1",5),rep("var2",5)),
m1 = sample(1:10,10,F),
m2 = sample(20:40,10,F))
Looks as:
a b v m1 m2
1 W r var1 3 40
2 N l var1 6 32
3 R a var1 9 28
4 F g var1 5 21
5 E u var1 4 38
6 W r var2 1 35
7 N l var2 8 33
8 R a var2 10 29
9 F g var2 7 30
10 E u var2 2 23
If I want to make a wide format of values in m1 using id a as rows and values in v1as columns I do:
> reshape2::dcast(df_l, a~v, value.var="m1")
a var1 var2
1 E 4 2
2 F 5 7
3 N 6 8
4 R 9 10
5 W 3 1
How do I write a function that does this were arguments to dcast (row, column and value.var) are supplied as arguments, something like:
fun <- function(df,row,col,val){
require(reshape2)
res <-
dcast(df, row~col, value.var=val)
return(res)
}
I checked SO here and here to try variations of match.call and eval(substitute()) in order to "get" the arguments inside the function, and also tried with the lazyeval package. No succes.
What am I doing wrong here ? How to get dcast to recognize variable names?
Formula argument also accepts character input.
foo <- function(df, id, measure, val) {
dcast(df, paste(paste(id, collapse = " + "), "~",
paste(measure, collapse = " + ")),
value.var = val)
}
require(reshape2)
foo(df_l, "a", "v", "m1")
Note that data.table's dcast (current development) can also cast multiple value.var columns directly. So, you can also do:
require(data.table) # v1.9.5
foo(setDT(df_l), "a", "v", c("m1", "m2"))
# a m1_var1 m1_var2 m2_var1 m2_var2
# 1: F 1 6 28 21
# 2: H 9 2 38 29
# 3: M 5 10 24 35
# 4: O 8 3 23 26
# 5: T 4 7 31 39
If I have a data set like this
set.seed(100)
data <- data.frame("x" = c(1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5),
"y" = rnorm(13),
"factor" = c("a","b","c","a","b", "c", "c", "a",
"b", "c", "a", "b","c"))
so it looks like this
x y factor
1 1 -0.50219235 a
2 1 0.13153117 b
3 1 -0.07891709 c
4 2 0.88678481 a
5 2 0.11697127 b
6 2 0.31863009 c
7 3 -0.58179068 c
8 4 0.71453271 a
9 4 -0.82525943 b
10 4 -0.35986213 c
11 5 0.08988614 a
12 5 0.09627446 b
13 5 -0.20163395 c
I would like to plot this with a separate smoother each factor (a,b,c)
library(ggplot2)
ggplot(data = data, aes(x = x, y = y, col = factor)) +
geom_smooth(aes(group = factor))
However since there are no values for "a" and "b" for x = 3, so I would like the smoothers for "a" and "b" to have a break for x = 3. What's the best strategy to accomplish that?
I would create an expansion of the combinations of x and factor and then do a database-like join on the combinations and the data. For example, first I form a new data frame df with the combinations of the unique values of x and factor
df <- expand.grid(sapply(data[, c("x", "factor")], unique))
> df
x factor
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
6 1 b
7 2 b
8 3 b
9 4 b
10 5 b
11 1 c
12 2 c
13 3 c
14 4 c
15 5 c
Then we can simply perform a join operation on the df and your data, requesting that we return all the rows from the left hand side (the x argument, hence df), and include corresponding values for y from the right hand side (data). Where there is no corresponding right hand side (in data, we will get an NA.
newdf <- merge(df, data, all.x = TRUE)
> newdf
x factor y
1 1 a -0.50219235
2 1 b 0.13153117
3 1 c -0.07891709
4 2 a 0.88678481
5 2 b 0.11697127
6 2 c 0.31863009
7 3 a NA
8 3 b NA
9 3 c -0.58179068
10 4 a 0.71453271
11 4 b -0.82525943
12 4 c -0.35986213
13 5 a 0.08988614
14 5 b 0.09627446
15 5 c -0.20163395
Now we can fit and predict from a loess model by hand, but this is a little tedious - easier options are available via mgcv:gam()
loessFun <- function(XX, span = 0.85) {
fit <- loess(y ~ x, data = XX, na.action = na.exclude, span = span)
predict(fit)
}
Now split the data by factor and apply the loessFun() wrapper
fits <- lapply(split(newdf, newdf$factor), loessFun)
newdf <- transform(newdf, fitted = unsplit(fits, factor))
> head(newdf)
x factor y fitted
1 1 a -0.50219235 -0.50219235
2 1 b 0.13153117 0.13153117
3 1 c -0.07891709 -0.07891709
4 2 a 0.88678481 0.88678481
5 2 b 0.11697127 0.11697127
6 2 c 0.31863009 0.31863009
We can then plot using the new data frame
ggplot(newdf, aes(x = x, y = y, col = factor)) +
geom_line(aes(group = factor))
which gives:
It looks a bit funky because of the very low resolution of the sample data you provided and because this method that I've used predicts at the observed data only, preserving NAs. geom_smooth() is actually predicting over the range of x for each group separately and as such there are no missing xs in the data used to draw the geom layer.
Unless you can explain within what region of x = 3 we should add a break (an NA), this may well be the best that you can do. Alternatively, we could predict over the region from the models and then set anything 2.5 < x < 3.5 back to being NA. Add a comment if that is what you wanted and I'll expand my answer with an example of doing that if you can indicate how we are to envisage the gaps.