I have a dataset as follows: I have a data frame like this, called data_frame_test.
Value time group
3.96655960 0 184
-8.71308460 0 184
-11.11638947 0 184
-6.84213562 11 184
-1.25926609 11 184
-4.60649529 11 184
0.27577858 11 184
11.85394249 20 184
-0.27114563 20 184
1.73081284 20 184
1.78209915 20 184
11.34305840 20 184
13.49688263 20 184
-7.54752045 20 184
-13.63673286 25 184
-5.75711517 25 184
0.35823669 25 184
-2.45237694 25 184
0.49313087 0 66
-9.04148674 0 66
-15.50337906 0 66
-17.51445351 0 66
-10.66807098 0 66
-2.24337845 5 66
-13.79929533 5 66
1.33287125 5 66
2.22143402 5 66
11.46484833 10 66
23.26805916 10 66
9.07377968 10 66
4.28664665 10 66
data_frame_test <- structure(list(Value = c(3.9665596, -8.7130846, -11.11638947,
-6.84213562, -1.25926609, -4.60649529, 0.27577858, 11.85394249,
-0.27114563, 1.73081284, 1.78209915, 11.3430584, 13.49688263,
-7.54752045, -13.63673286, -5.75711517, 0.35823669, -2.45237694,
0.49313087, -9.04148674, -15.50337906, -17.51445351, -10.66807098,
-2.24337845, -13.79929533, 1.33287125, 2.22143402, 11.46484833,
23.26805916, 9.07377968, 4.28664665), time = c(0L, 0L, 0L, 11L,
11L, 11L, 11L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 25L, 25L, 25L,
25L, 0L, 0L, 0L, 0L, 0L, 5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L),
group = c(184L, 184L, 184L, 184L, 184L, 184L, 184L, 184L,
184L, 184L, 184L, 184L, 184L, 184L, 184L, 184L, 184L, 184L,
66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L, 66L,
66L)), .Names = c("Value", "time", "group"), class = "data.frame", row.names = c(NA,
-31L))
I want to plot a boxplot of a value for each time point and group.
ggplot(data_frame_test, aes(x=factor(time), y=Value, colour = factor(group))) +
geom_boxplot(outlier.size=0, fill = "white", position="identity", alpha=.5) +
scale_x_discrete(limits = seq(-1,26), breaks = seq(-1,26), labels = seq(-1,26))
This results in the following picture, which is almost right:
However, the x axis labels and ticks are shifted. How do I put it where it belongs?
You are trying to treat a factor like a numeric, which it isn't. Here is a better solution:
ggplot(data_frame_test, aes(x=factor(time, levels = seq(-1,26), ordered = TRUE),
y=Value, colour = factor(group))) +
geom_boxplot(outlier.size=0, fill = "white", position="identity", alpha=.5) +
scale_x_discrete(drop = FALSE)
I'm not quite sure why that is happening, but I would probably make the plot like this, since converting time to a factor is intuitive to me:
ggplot(data_frame_test,
aes(x = time, y=Value, colour = factor(group), group = interaction(time, group))) +
geom_boxplot(outlier.size=0, fill = "white", position="identity", alpha=.5)
Which gives:
You can use scale_x_continuous to change the breaks and such.
Related
I am trying to use the $names operator on my OutVals (outliers) to find the class these outliers are associated to and then put the outliers and their class name inside a data frame so I can see clearly from which class these outliers came from.
However, when trying to implement this, my class names return as "1", "2" etc... and not "Van", "Bus etc.. as it is in the dataset.
Have I missed something or am I approaching this completely wrong?
The goal is to get the outliers in the data and place them inside a table which shows from which class the outliers came from
Any help would be appreciated
I have shown my data frame as well as my reproduceable code below
library(reshape2)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers function
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
namesforgroups <- boxplot(OutVals)$names #get group name of the outliers
dataf <- as.data.frame(OutVals, col.names = namesforgroups)#dataframe of outlier + names
print(OutVals) # show all outliers
remOutliers <- sapply(data, function(x) x[!x %in% OutVals]) #remove outliers from data
return (remOutliers)
}
#Remove class column and sample number
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2 #assign to new variable
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData) #remove first set of outliers
removeOutliers2 <- removeOutliers(removeOutliers1) #test again for more and remove
Output data frame
The information about which row/class name the outlier is tied to is not provided in the boxplot object. You have to get it yourself. What is given is the column that the outlier came from, inside boxplot(data)$group, so you can use which to see which row it was from, and use that to get what class it is. I rewrote your function and it now prints a table of the outlier value, the column it came from, and the row/class it came from. There are 5 outliers from 3 rows in the first iteration, and no outliers in the second iteration - makes sense because they've been removed.
removeOutliers <- function(data, class) {
x=boxplot(data)
OutVals <- x$out
columns <- x$group #get group name of the outliers
ind=numeric()
classes=c()
if (length(columns) > 0) {
for (i in 1:length(columns)) {
rows=which(data[,columns[i]]==OutVals[i])
ind=union(ind, rows)
classes=c(classes, class[rows])
}
dt=data.frame(OutVals, columns, classes) # show all outliers
print(dt)
return (list(data[-ind,], class[-ind]))
}
return(list(data, class))
}
#Remove class column and sample number
vehData1 <- vehData[, -c(1,20)]
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData1, vehClass) #remove first set of outliers
OutVals columns classes
1 103 5 bus
2 52 6 bus
3 6 6 bus
4 127 14 bus
5 14 15 saab
removeOutliers2 <- removeOutliers(removeOutliers1[[1]], removeOutliers1[[2]])
The first function returns a data frame with the outlier rows removed. The second function returns a table containing information about each outlier (the class, the column, and the value).
removeOutliers=function(data) {
x=boxplot(data %>% select(-Class), plot=FALSE)
outlierRows=c()
for (i in 1:length(x$out)) {
outlierRows=c(outlierRows, which(data[,x$group[i]]==x$out[i]))
}
return(data[-outlierRows,])
}
getOutliers=function(data) {
x=boxplot(data %>% select(-Class))
outlierInfo=data.frame()
for (i in 1:length(x$out)) {
rows=which(data[,x$group[i]]==x$out[i])
outlierInfo=bind_rows(outlierInfo, data.frame(class=data$Class[rows],
value=x$out[i],
column=names(data)[x$group[i]]))
}
return(outlierInfo)
}
removeOutliers(vehData)
Samples Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect
1 1 95 48 83 178 72 10 162 42 20 159
2 2 91 41 84 141 57 9 149 45 19 143
4 4 93 41 82 159 63 9 144 46 19 143
Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra Class
1 176 379 184 70 6 16 187 197 van
2 170 330 158 72 9 14 189 199 van
4 160 309 127 63 6 10 199 207 van
getOutliers(vehData)
class value column
1 bus 103 Pr.Axis.Ra
2 bus 52 Max.L.Ra
3 bus 6 Max.L.Ra
4 bus 127 Skew.Maxis
5 saab 14 Skew.maxis
Please help i am trying to make all then columns into x-axis and the make side by side bars later by date
this is my data i really tried but to no avail
dateVisited hh_visited hh_ind_confirmed new_in_mig out_mig deaths HOH_death Preg_Obs Preg_Outcome child_forms
102 2020-07-21 292 1170 131 86 18 7 3 14 79
103 2020-07-22 400 1553 115 100 25 10 11 18 107
104 2020-07-23 381 1458 103 67 21 9 5 23 87
105 2020-07-24 345 1379 90 98 12 4 3 20 89
106 2020-07-25 436 1585 131 119 13 2 7 20 117
107 2020-07-26 0 0 0 0 0 0 0
0 0
I think you're looking for something like this:
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(cols = -1) %>%
ggplot(aes(name, value)) +
geom_col(aes(fill = dateVisited), width = 0.6,
position = position_dodge(width = 0.8)) +
guides(x = guide_axis(angle = 45))
Reproducible Data from question
df <- structure(list(dateVisited = structure(1:6, .Label = c("2020-07-21",
"2020-07-22", "2020-07-23", "2020-07-24", "2020-07-25", "2020-07-26"
), class = "factor"), hh_visited = c(292L, 400L, 381L, 345L,
436L, 0L), hh_ind_confirmed = c(1170L, 1553L, 1458L, 1379L, 1585L,
0L), new_in_mig = c(131L, 115L, 103L, 90L, 131L, 0L), out_mig = c(86L,
100L, 67L, 98L, 119L, 0L), deaths = c(18L, 25L, 21L, 12L, 13L,
0L), HOH_death = c(7L, 10L, 9L, 4L, 2L, 0L), Preg_Obs = c(3L,
11L, 5L, 3L, 7L, 0L), Preg_Outcome = c(14L, 18L, 23L, 20L, 20L,
0L), child_forms = c(79L, 107L, 87L, 89L, 117L, 0L)), class = "data.frame",
row.names = c("102", "103", "104", "105", "106", "107"))
Your data cannot be used easily since it requires time to format it into something that could ingested by R. Here is something to get you started. I made up a hypothetical dataframe of 4 columns that resemble your data, use the function melt from reshape2 package to format the data such that it is understandable by ggplot2 package, and use ggplot2 package to generate a bar plot.
df <- data.frame(dateVisited = seq(as.Date('2019-01-01'), as.Date('2019-12-31'), 30),
hh_visited = runif(13, 0, 436),
hh_ind_confirmed = runif(13, 0, 1585),
new_in_mig = runif(13, 0, 131))
df <- reshape2::melt(df, id.vars = 'dateVisited')
ggplot(data = df, aes(x = dateVisited, y = value, fill = variable))+
geom_col(position = 'dodge')
For a report I am summarizing data by a group. Due to copyright issues I have created some dummy data below (first colum is group, then values):
X A B C D
1 1 12 0 12 0
2 2 24 0 15 0
3 3 56 0 48 0
4 4 89 0 96 0
5 5 13 3 65 0
6 6 11 16 0 0
7 7 25 19 0 0
8 8 24 98 0 0
9 9 18 111 0 0
10 10 173 125 0 0
11 11 10 65 0 0
I would like to create a barplot for every group (1:11) with a loop:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), ylim=c(0,200))}
This works, I get a barplot for every loop, however they end up in one 4 by for plotting window as if I had used par(mfrow=c(4,4)).
I need individual bar plots.
So I used par(mfrow=c(1,1)), which for some reason fixed the problem (I don't use par EVER, because I am only exporting for a scientific report featuring individual graphs), however the "main" is cut off on the top.
I would also like each bar to be a different color, so I used:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), col=c(1:5),
ylim=c(0,200))}
Realizing that the coloring vector then only uses the first color, I tried variations of this:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), col=c(4:10)[1:ncol(x)],
ylim=c(0,200))}
which doesn't do the trick...
I seem to be missing some key detail in the for loop here, thanks for help. I'm an R novice getting better every day thanks to the people here ;).
No idea, why that is in base plot. Here is a alternative way with ggplot2.
for(i in 1:11){x<- gather(data[i,])
print(ggplot(data = x, aes(x = key, y = value, fill = cols)) +
geom_bar(stat = "identity", show.legend = FALSE) +
ggtitle(paste("Group ", i)) + theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,200))
}
So is your mainstill cut off?
Then extend the margin on top of the plot. Execute:
par(mar = c(2, 2, 3 , 2)) # c(bottom, left, top, right)
Before plotting. You can reset your specifications with dev.off() when experimenting.
Staying base R, you simply could use by and set col according to the group.
colors <- rainbow(length(unique(dat$X))) # define colors, 11 in your case
by(dat, dat$X, function(x)
barplot(as.matrix(x), main=paste("Group", x$X), ylim=c(0, 200), col=colors[x$X]))
Data
dat <- structure(list(X = 1:11, A = c(12L, 24L, 56L, 89L, 13L, 11L,
25L, 24L, 18L, 173L, 10L), B = c(0L, 0L, 0L, 0L, 3L, 16L, 19L,
98L, 111L, 125L, 65L), C = c(12L, 15L, 48L, 96L, 65L, 0L, 0L,
0L, 0L, 0L, 0L), D = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11"))
I am curious as to why there is a difference in the frequency created by these two methods(shown below) even though the same dataset is used.
First Method (cut(as.vector))
wd1<- apply(wd, 2, function(x) cut(((as.numeric(x))
+ 360/(16*2) )%% 360,seq(0,360,360/16) ,
c('N', 'NNE', 'NE', 'ENE', 'E', 'ESE', '
SE', 'SSE', 'S', 'SSW', 'SW', 'WSW',
'W', 'WNW', 'NW', 'NNW')))
wd2<- as.data.frame(table(wd1))
wd3<- transform(wd2, cumFreq = cumsum(Freq),
relative = prop.table(Freq))
and this yields
> wd3
wd1 Freq cumFreq relative
1 \nSE 2942 2942 0.01579292
2 E 11550 14492 0.06200144
3 ENE 5773 20265 0.03098998
4 ESE 5713 25978 0.03066790
5 N 11051 37029 0.05932276
6 NE 4725 41754 0.02536422
7 NNE 6196 47950 0.03326069
8 NNW 14880 62830 0.07987718
9 NW 18278 81108 0.09811795
10 S 6621 87729 0.03554212
11 SSE 3772 91501 0.02024844
12 SSW 10800 102301 0.05797537
13 SW 17004 119305 0.09127900
14 W 24903 144208 0.13368154
15 WNW 20603 164811 0.11059876
16 WSW 21475 186286 0.11527973
Second method(cut(wd,breaks=))
breaks1 <- apply(wd, 2, function(x) (cut(as.numeric(x), breaks=
(seq(0,360,360/16)))))
breaks2<- as.data.frame(table(breaks1))
breaks3<- transform(breaks2, cumFreq = cumsum(Freq),
relative = prop.table(Freq))
and this yields
> breaks3
breaks1 Freq cumFreq relative
1 (0,22.5] 8110 8110 0.04358036
2 (112,135] 3314 11424 0.01780830
3 (135,158] 3084 14508 0.01657236
4 (158,180] 5039 19547 0.02707786
5 (180,202] 8387 27934 0.04506886
6 (202,225] 14246 42180 0.07655312
7 (22.5,45] 5257 47437 0.02824932
8 (225,248] 19194 66631 0.10314198
9 (248,270] 24301 90932 0.13058525
10 (270,292] 22526 113458 0.12104700
11 (292,315] 19631 133089 0.10549027
12 (315,338] 16401 149490 0.08813335
13 (338,360] 13185 162675 0.07085167
14 (45,67.5] 4614 167289 0.02479405
15 (67.5,90] 9173 176462 0.04929256
16 (90,112] 9631 186093 0.05175369
The total frequency should be 186286 as the first one but its not, I'm sure it is omitting some numbers. Also the intervals are not completely in 22.5s (as 360/16 should indicate that), only three bins are. Well they are but R is rounding off all but those three. Why is this?
The (dput)is
dput(head(wd))
structure(list(X1000mb = c(86L, 130L, 75L, 59L, 56L, 69L), X925mb = c(70L,
45L, 30L, 66L, 54L, 71L), X850mb = c(355L, 349L, 350L, 65L, 36L,
56L), X700mb = c(331L, 342L, 329L, 35L, 1L, 44L), X600mb = c(328L,
328L, 321L, 0L, 247L, 227L), X500mb = c(331L, 324L, 317L, 331L,
251L, 241L), X400mb = c(340L, 328L, 310L, 296L, 261L, 246L),
X300mb = c(336L, 334L, 328L, 295L, 259L, 262L), X250mb = c(334L,
333L, 348L, 300L, 259L, 279L), X200mb = c(336L, 330L, 356L,
331L, 257L, 282L), X150mb = c(333L, 327L, 346L, 342L, 277L,
279L), X100mb = c(317L, 326L, 325L, 318L, 260L, 274L), X70mb = c(323L,
326L, 332L, 306L, 277L, 276L), X50mb = c(350L, 4L, 352L,
328L, 305L, 311L), X30mb = c(5L, 42L, 32L, 15L, 29L, 12L),
X20mb = c(3L, 42L, 48L, 30L, 46L, 45L), X10mb = c(28L, 25L,
4L, 14L, 104L, 76L)), .Names = c("X1000mb", "X925mb", "X850mb",
"X700mb", "X600mb", "X500mb", "X400mb", "X300mb", "X250mb", "X200mb",
"X150mb", "X100mb", "X70mb", "X50mb", "X30mb", "X20mb", "X10mb"
), row.names = c(NA, 6L), class = "data.frame")
For a dataset like:
21 79
78 245
21 186
65 522
4 21
3 4
4 212
4 881
124 303
28 653
28 1231
7 464
7 52
17 102
16 292
65 837
28 203
28 1689
136 2216
7 1342
56 412
I need to find the number of associated patterns. For example 21-79 and 21-186 have 21 in common. So they form 1 pattern. Also 21 is present in 4-21. This edge also contributes to the same pattern. Now 4-881, 4-212, 3-4 have 4 in their edge. So also contribute to the same pattern. Thus edges 21-79, 21-186, 4-21, 4-881, 4-212, 3-4 form 1 pattern. Similarly there are other patterns. Thus we need to group all edges that have any 1 node common to form a pattern (or subgraph). For the dataset given there are total 4 patterns.
I need to write code (preferably in R) that will find such no. of patterns.
Since you're describing the data as subgraphs, why not use the igraph package which is very knowledgeable about graphs. So here's your data in data.frame form
dd <- structure(list(V1 = c(21L, 78L, 21L, 65L, 4L, 3L, 4L, 4L, 124L,
28L, 28L, 7L, 7L, 17L, 16L, 65L, 28L, 28L, 136L, 7L, 56L), V2 = c(79L,
245L, 186L, 522L, 21L, 4L, 212L, 881L, 303L, 653L, 1231L, 464L,
52L, 102L, 292L, 837L, 203L, 1689L, 2216L, 1342L, 412L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -21L))
We can treat each value as a vertex name so the data you provide is really like an edge list. Thus we create our graph with
library(igraph)
gg <- graph.edgelist(cbind(as.character(dd$V1), as.character(dd$V2)),
directed=F)
That defines the nodes and vertex resulting in the following graph (plot(gg))
Now you wanted to know the number of "patterns" which are really represented as connected subgraphs in this data. You can extract that information with the clusters() command. Specifically,
clusters(gg)$no
# [1] 10
Which shows there are 10 clusters in the data you provided. But you only want the ones that have more than two vertices. That we can get with
sum(clusters(gg)$csize>2)
# [1] 4
Which is 4 as you were expecting.