Plot observation number (label) in outlier points - r

I have this boxplot with outliers, i need to plot the number of the line that contain the outlier observation, to make it easy to go in the data set and find where the value, somebody can help me?
set.seed(1)
a <- runif(10,1,100)
b <-c("A","A","A","A","A","B","B","B","B","B")
t <- cbind(a,b)
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1] + 2, labels = round(bp$stats[,1], 2))
text(x = 2, y = bp$stats[,2] + 2, labels = round(bp$stats[,2], 2))

What is the point of t <- cbind(a, b)? That makes a character matrix and converts your numbers to character strings? You don't use it anyway. If you want a single data structure use data.frame(a, b) which will make a a factor and leave b numeric. I do not get the plot you do with set.seed(1) so I'll provide slightly different data. Note the use of the pos= and offset= arguments in text(). Be sure to read the manual page to see what they are doing:
a <- c(99.19, 59.48, 48.95, 18.17, 75.73, 45.94, 51.61, 21.55, 37.41,
59.98, 57.91, 35.54, 4.52, 64.64, 75.03, 60.21, 56.53, 53.08,
98.52, 51.26)
b <- c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B")
bp <- boxplot(a~b)
text(x = 1, y = bp$stats[,1], labels = round(bp$stats[, 1], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
text(x = 2, y = bp$stats[, 2], labels = round(bp$stats[, 2], 2),
pos=c(1, 3, 3, 1, 3), offset=.2)
obs <- which(a %in% bp$out)
text(bp$group, bp$out, obs, pos=4)

Related

R: Create every possible combinations of given columns

I have a data that corresponds to df. df shows the source and destination and the longitudes and latitudes of this sources and destinations.
I want to use df to generate df1. df1 gives all possible combinations of source and destination and while doing so combines the appropriate source and destination longitudes and latitudes.
Source <- c("A", "B", "C", "D")
Destination <- c("A", "B", "C", "D")
Source_Latitude <- c(1, 2, 3, 4)
Source_Longitude <- c(-1, -2, -3, -4)
Dest_Latitude <- c(1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4)
df <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
Source <- c("A", "A", "A", "A", "B","B","B","B", "C","C","C","C", "D","D","D","D")
Destination <- c("A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D")
Source_Latitude <- c(1,1,1,1, 2, 2, 2, 2, 3,3,3,3, 4,4,4,4)
Source_Longitude <- c(-1,-1,-1,-1,-2,-2,-2,-2,-3,-3,-3,-3,-4,-4,-4,-4)
Dest_Latitude <- c(1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4)
df1 <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
I tried using crossing() and expand.grid() without any success
library(dplyr)
expand.grid(Source = Source, Destination = Destination) %>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
select(contains("Source"), contains("Dest")) %>% View()
As an additional observation, although the code works, I don't think it's the best to keep sources and destinations in the same dataframe. Because the number of sources and destinations may be different. It would probably be best to have one data frame for each, and adapt the code accordingly.
all_combination<- expand.grid(Source=df$Source, Destination=df$Destination)%>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
distinct()
This worked for me. Took a while to figure out how to use the expand.grid()function.

Most elegant way to convert lists into igraph object for plotting

I am new to igraph and it seems to be a very powerful (and therefore also complex) package.
I tried to convert the following lists into an igraph object.
graph <- list(s = c("a", "b"),
a = c("s", "b", "c", "d"),
b = c("s", "a", "c", "d"),
c = c("a", "b", "d", "e", "f"),
d = c("a", "b", "c", "e", "f"),
e = c("c", "d", "f", "z"),
f = c("c", "d", "e", "z"),
z = c("e", "f"))
weights <- list(s = c(3, 5),
a = c(3, 1, 10, 11),
b = c(5, 3, 2, 3),
c = c(10, 2, 3, 7, 12),
d = c(15, 7, 2, 11, 2),
e = c(7, 11, 3, 2),
f = c(12, 2, 3, 2),
z = c(2, 2))
Interpretation is as follows: s is the starting node, it links to nodes a and b. The edges are weighted 3 for s to a and 5 for s to b and so on.
I tried all kinds of functions from igraph but only got all kinds of errors. What is the most elegant and easy way to convert the above into an igraph object for plotting the graph?
Create an edgelist and then a graph from that. Assign the weights and plot it.
set.seed(123)
e <- as.matrix(stack(graph))
g <- graph_from_edgelist(e)
E(g)$weight <- stack(weights)[[1]]
plot(g, edge.label = E(g)$weight)

Kruskal-Wallis test: create lapply function to subset data.frame?

I have a data set of values (val) grouped by multiple categories (distance & phase). I would like to test each category by Kruskal-Wallis test, where val is dependent variable, distance is a factor, and phase split my data in 3 groups.
As such, I need to specify the subset of the data within Kruskal-Wallis test and then apply the test to each of groups. BUT, I can not get my subsetting to work!
In R help, it is specified that the subset is an optional vector specifying a subset of observations to be used. But how to correctly put this to my lapply function?
My dummy data:
# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)
df<-data.frame(val, distance, phase)
# get unique groups
ii<-unique(df$phase)
# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)
# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
subset = phase == "c")
This works well, so my subset should be correctly set as a vector.
But how to use this in a lapply function?
# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
subset = df$phase == as.character(ii)))
My overall goal is to create a function from kruskal.test, and save all statistics for each group into one table.
All help is highly appreciated.
Usually you would start by splitting, and then lapplying.
Something like
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })
would yield a list, indexed by the phase, of the results of kruskal.test.
Your final expression does not work because lapply expects a function, and applying kruskal.test does not result in a function, it results in the result of running that test. If you surround it with a function definition with the index, then it would work, just be a little less idiomatic.
lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})
Though it is late, it might help someone having the same problem. So, I am putting an answer implemented using tidyverse and rstatix packages. The rstatix package which "provides a simple and intuitive pipe friendly framework, coherent with the 'tidyverse' design philosophy for performing basic statistical tests".
library(rstatix)
library(tidyverse)
df %>%
group_by(phase) %>%
kruskal_test(val ~ distance)
Output
# A tibble: 3 x 7
phase .y. n statistic df p method
* <chr> <chr> <int> <dbl> <int> <dbl> <chr>
1 a val 20 0.230 1 0.631 Kruskal-Wallis
2 b val 20 0.0229 1 0.88 Kruskal-Wallis
3 c val 20 0.322 1 0.570 Kruskal-Wallis
which is same as provided by #user295691.
Data
df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697,
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434,
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574,
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735,
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938,
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318,
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431,
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339,
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393,
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504,
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799,
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187,
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807,
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364,
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207,
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c")), class = "data.frame", row.names = c(NA, -60L))

rgl segments3d to connect 3d scatter points in order to plot a skeleton

I am working with motion capture data, and wish to plot two skeletons in 3D (motion capture data obtained from two different systems).
I have managed to plot and label the joints, but I canĀ“t figure out how to connect the joints with lines.
A short explanation to the abreviations used in the sample dataset below:
RA and LA (Right and Left Ankle)
RK and LK (Right and Left Knee)
RH and LH (Right and Left Hip)
CG (Center of Gravity)
Simplified data set:
df <- data.frame(
Joint = c("LA", "RA", "LK", "RK", "LH", "RH", "CG", "LA", "RA", "LK", "RK", "LH", "RH", "CG"),
system = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"),
x = c(0, 10, 0, 10, 0, 10, 5, 0, 10, 0, 10, 0, 10, 5),
y = c(0,0,0,0,0,0,0, 20,20,20,20,20,20,20),
z = c(0, 0, 20, 20, 40, 40, 50, 0, 0, 20, 20, 40, 40, 50))
My code so far to plot and label the joints from the two systems:
library(rgl)
with(df, plot3d(x, y, z, type="s", col = as.numeric(system)))
with(df, text3d(x, y, z, text = Joint, adj = 2))
Can you help me connect the joints?
Use the segments3d function to draw line segments. It takes the usual
x, y, z coordinates, and joins pairs of points. So you'll need to work out which joints are joined, and plot segments between those joints.
If the joints are always in the order you gave, it would go something like this:
segs <- c(1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 7)
segments3d(df[segs, 3:5])
(This just does the system A segments.)
Edited to add: In response to the first comment: You will need to tell R that ankles connect to knees, etc, but you can do that. For example:
segs <- c()
for (s in unique(df$system)) {
seg <- with(df, c(which(system == s & Joint == "LA"),
which(system == s & Joint == "LK"))
if (length(seg) == 2)
segs <- c(segs, seg)
seg <- with(df, c(which(system == s & Joint == "LK"),
which(system == s & Joint == "CG"))
if (length(seg) == 2)
segs <- c(segs, seg)
# etc for the other side
}
segments3d(df[segs, 3:5])
This could all be compressed if you have the connections arranged in an R object somehow. I'll leave that to you to work out.

Using matplot in R whenever certain column changes

Sorry in advance because I am new at asking questions here and don't know how to input this table properly.
Say I have a data frame in R constructed like:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind(team, value)
I want to create a plot that will give me 3 lines graphing the values for teams A, B, and C. I believe I can do this inputting the matrix m into matplot somehow, but I'm not sure how.
EDIT: I've gotten a lot closer to solving my problem. However I've realized that for some reason, with the code I have, "Value" is a list of 745 which matches the number of rows in my dataframe m. However when I unlist(Value) it turns into a numeric of length 894. Any ideas on why this would happen?
You can try something like this:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind.data.frame(team, value)
library(ggplot2)
ggplot(m, aes(x=as.factor(1:nrow(m)), y=value, group=team, col=team)) +
geom_line(lwd=2) + xlab('index')
if you have same number of ordered values for each team, you could use matplot to visualize them. but the data should be converted to matrix first;
m = cbind.data.frame(team, value, index = rep(1:3, 3))
m <- reshape(m, v.names = 'value', idvar = 'team', direction = 'wide', timevar = 'index')
matplot(t(m[, 2:4]), type = 'l', lty = 1)
legend('top', legend = m[, 1], lty = 1, col = 1:3)

Resources