I was benchmarking different implementations of an algorithm as part of a project and wanted to visualize this data in R succinctly. I was thinking I could assign different versions to different colors and use x axis of the plot to represent size and y axis to represent time. Please let me know how to plot these results in R. If there is any other way that would look better, I'd be happy to follow as well.
Size
8192 2 1 1 1
65536 10 5 4 4
1048576 81 60 63 52
8388608 675 555 572 464
16777216 1334 1124 1171 953
33554432 2780 2348 2438 2014
67108864 5853 5229 4957 4238
134217728 12437 10303 10521 8921
|__________
Just putting #BenBolker's answer down here to close out the question. (If you would like, Ben, feel free to copy/paste this as your own.)
Here's your sample input:
mytab <- structure(list(Size = c(8192L, 65536L, 1048576L, 8388608L, 16777216L,
33554432L, 67108864L, 134217728L), Version1 = c(2L, 10L, 81L,
675L, 1334L, 2780L, 5853L, 12437L), Version2 = c(1L, 5L, 60L,
555L, 1124L, 2348L, 5229L, 10303L), Version3 = c(1L, 4L, 63L,
572L, 1171L, 2438L, 4957L, 10521L), Version4 = c(1L, 4L, 52L,
464L, 953L, 2014L, 4238L, 8921L)), .Names = c("Size", "Version1",
"Version2", "Version3", "Version4"), class = "data.frame", row.names = c(NA,
-8L))
and here's now to make the plot
library(reshape2);
library(ggplot2);
d <- melt(mytab, "Size");
ggplot(d,aes(x=Size,y=value,colour=variable))+
geom_point()+
geom_line()+
scale_x_log10()+scale_y_log10()
which gives
Related
My input data df is:
Action Difficulty strings characters POS NEG NEU
Field 0.635 7 59 0 0 7
Field or Catch 0.768 28 193 0 0 28
Field or Ball -0.591 108 713 6 0 101
Ball -0.717 61 382 3 0 57
Catch -0.145 89 521 1 0 88
Field 0.28 208 1214 2 3 178
Field and run 1.237 18 138 1 0 17
I am interested in group-based correlations of Difficulty with the remaining variables strings, characters, POS, NEG, NEU. The grouping variable is Action. If I am interested only in the group Field, I can do filter(str_detect(Action, 'Field')).
I can do it one by one between Difficulty and the remaining variables.
But is there a faster way to do it in one command with multiple variables?
My partial solution is:
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>% # Original data had multiple NA
group_by(Action) %>%
summarise_all(funs(cor))
But this results in an error.
Some relevant SO posts that I looked at are: This is quite relevant to generate a correlation matrix but does not address my question Find correlation coefficient of two columns in a dataframe by group. Useful to compute different types of correlations and introduces a different way of ignoring NAs: Check the correlation of two columns in a dataframe (in R)
Any help or guidance on this would be greatly appreciated!
For reference, this is the sample dput()
structure(list(
Action = c("Field", "Field or Catch", "Field or Ball", "Ball", "Catch", "Field", "Field and run"), Difficulty = c(0.635, 0.768, -0.591, -0.717, -0.145, 0.28, 1.237),
strings = c(7L, 28L, 108L, 61L, 89L, 208L, 18L),
characters = c(59L, 193L, 713L, 382L, 521L, 1214L, 138L),
POS = c(0L, 0L, 6L, 3L, 1L, 2L, 1L),
NEG = c(0L, 0L, 0L, 0L, 0L, 3L, 0L),
NEU = c(7L, 28L, 101L, 57L, 88L, 178L, 17L)),
class = "data.frame", row.names = c(NA,
-7L))
You may try -
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>% # Original data had multiple NA
group_by(Action) %>%
summarise(across(-Difficulty, ~cor(.x, Difficulty)))
If you don't want to group_by Action -
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>%
summarise(across(-c(Difficulty, Action), ~cor(.x, Difficulty)))
# strings characters POS NEG NEU
#1 -0.557039 -0.5983826 -0.8733465 -0.1520684 -0.5899733
Please help i am trying to make all then columns into x-axis and the make side by side bars later by date
this is my data i really tried but to no avail
dateVisited hh_visited hh_ind_confirmed new_in_mig out_mig deaths HOH_death Preg_Obs Preg_Outcome child_forms
102 2020-07-21 292 1170 131 86 18 7 3 14 79
103 2020-07-22 400 1553 115 100 25 10 11 18 107
104 2020-07-23 381 1458 103 67 21 9 5 23 87
105 2020-07-24 345 1379 90 98 12 4 3 20 89
106 2020-07-25 436 1585 131 119 13 2 7 20 117
107 2020-07-26 0 0 0 0 0 0 0
0 0
I think you're looking for something like this:
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(cols = -1) %>%
ggplot(aes(name, value)) +
geom_col(aes(fill = dateVisited), width = 0.6,
position = position_dodge(width = 0.8)) +
guides(x = guide_axis(angle = 45))
Reproducible Data from question
df <- structure(list(dateVisited = structure(1:6, .Label = c("2020-07-21",
"2020-07-22", "2020-07-23", "2020-07-24", "2020-07-25", "2020-07-26"
), class = "factor"), hh_visited = c(292L, 400L, 381L, 345L,
436L, 0L), hh_ind_confirmed = c(1170L, 1553L, 1458L, 1379L, 1585L,
0L), new_in_mig = c(131L, 115L, 103L, 90L, 131L, 0L), out_mig = c(86L,
100L, 67L, 98L, 119L, 0L), deaths = c(18L, 25L, 21L, 12L, 13L,
0L), HOH_death = c(7L, 10L, 9L, 4L, 2L, 0L), Preg_Obs = c(3L,
11L, 5L, 3L, 7L, 0L), Preg_Outcome = c(14L, 18L, 23L, 20L, 20L,
0L), child_forms = c(79L, 107L, 87L, 89L, 117L, 0L)), class = "data.frame",
row.names = c("102", "103", "104", "105", "106", "107"))
Your data cannot be used easily since it requires time to format it into something that could ingested by R. Here is something to get you started. I made up a hypothetical dataframe of 4 columns that resemble your data, use the function melt from reshape2 package to format the data such that it is understandable by ggplot2 package, and use ggplot2 package to generate a bar plot.
df <- data.frame(dateVisited = seq(as.Date('2019-01-01'), as.Date('2019-12-31'), 30),
hh_visited = runif(13, 0, 436),
hh_ind_confirmed = runif(13, 0, 1585),
new_in_mig = runif(13, 0, 131))
df <- reshape2::melt(df, id.vars = 'dateVisited')
ggplot(data = df, aes(x = dateVisited, y = value, fill = variable))+
geom_col(position = 'dodge')
I am trying to make a qplot in R. I know I could reformat my data but I want to try to make it in the qplot as I plan, at a later date to connect it to Shiny.
Before the problem, this is my data:
Date | Price | Postnr
2016-08-01 5000 101
2016-08-01 4300 103
2016-07-01 7000 105
2016-07-01 4500 105
2016-07-01 3000 103
2016-06-01 3900 101
2016-06-01 2700 103
2016-06-01 2900 105
2016-05-01 7100 101
I am trying to create a graph using plot lines.
I want to group using Postnr.
My problem is:
I want the Date to be on the X-axis, Price on the Y, the plot point to be created by getting the average Price on each day but I have no idea how to go about creating it with in the qplot itself.
-Edit-
Included reproducable data
mydata <- structure(list(Date = structure(c(4L, 4L, 3L, 3L, 3L, 2L,
2L, 2L, 1L), .Label = c("2016-05-01", "2016-06-01", "2016-07-01",
"2016-08-01"), class = "factor"), Price = c(5000L, 4300L, 7000L,
4500L, 3000L, 3900L, 2700L, 2900L, 7100L), Postnr = c(101L, 103L,
105L, 105L, 103L, 101L, 103L, 105L, 101L)), .Names = c("Date",
"Price", "Postnr"), row.names = c(NA, 9L), class = "data.frame")
After Ian Fellows got me on the right path I finally found what I was looking for:
ggplot(data = mydata,
aes(x = Date, y = Price, colour = Postnr, group=Postnr)) +
stat_summary(fun.y=mean, geom="point")+
stat_summary(fun.y=mean, geom="line")
Is this the idea you are looking for, #Atius?
date = runif(100,0,10)+as.Date("1980-01-01")
Price = runif(100,0,5000)
Postnr = runif(100,101,105)
dataFrame =data.frame(date=date, Price=Price, Postnr=Postnr)
d <- ggplot(dataFrame, aes(date, Price))
d + geom_point()
d + stat_summary_bin(aes(y = Postnr), fun.y = "mean", geom = "point")
For a dataset like:
21 79
78 245
21 186
65 522
4 21
3 4
4 212
4 881
124 303
28 653
28 1231
7 464
7 52
17 102
16 292
65 837
28 203
28 1689
136 2216
7 1342
56 412
I need to find the number of associated patterns. For example 21-79 and 21-186 have 21 in common. So they form 1 pattern. Also 21 is present in 4-21. This edge also contributes to the same pattern. Now 4-881, 4-212, 3-4 have 4 in their edge. So also contribute to the same pattern. Thus edges 21-79, 21-186, 4-21, 4-881, 4-212, 3-4 form 1 pattern. Similarly there are other patterns. Thus we need to group all edges that have any 1 node common to form a pattern (or subgraph). For the dataset given there are total 4 patterns.
I need to write code (preferably in R) that will find such no. of patterns.
Since you're describing the data as subgraphs, why not use the igraph package which is very knowledgeable about graphs. So here's your data in data.frame form
dd <- structure(list(V1 = c(21L, 78L, 21L, 65L, 4L, 3L, 4L, 4L, 124L,
28L, 28L, 7L, 7L, 17L, 16L, 65L, 28L, 28L, 136L, 7L, 56L), V2 = c(79L,
245L, 186L, 522L, 21L, 4L, 212L, 881L, 303L, 653L, 1231L, 464L,
52L, 102L, 292L, 837L, 203L, 1689L, 2216L, 1342L, 412L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -21L))
We can treat each value as a vertex name so the data you provide is really like an edge list. Thus we create our graph with
library(igraph)
gg <- graph.edgelist(cbind(as.character(dd$V1), as.character(dd$V2)),
directed=F)
That defines the nodes and vertex resulting in the following graph (plot(gg))
Now you wanted to know the number of "patterns" which are really represented as connected subgraphs in this data. You can extract that information with the clusters() command. Specifically,
clusters(gg)$no
# [1] 10
Which shows there are 10 clusters in the data you provided. But you only want the ones that have more than two vertices. That we can get with
sum(clusters(gg)$csize>2)
# [1] 4
Which is 4 as you were expecting.
A followup to this question here, even though the example is specific, this seems like a generalizable application, so I think it's worth a separate thread:
The general question is: How do I take elements in a list that correspond to a value in an original data frame and combine them according to that value in the original data frame, especially when the elements of the list are of different length?
In this example, I have a dataframe that has two groups, each sorted by date. What I ultimately want to do is get a dataframe, organized by date, that has just the relevant metrics for each segment. If a certain segment doesn't have data for a certain date, it gets a 0.
Here's some actual data:
structure(list(date = structure(c(15706, 15707, 15708, 15709,
15710, 15706, 15707, 15708), class = "Date"), segment = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("abc", "xyz"), class = "factor"),
a = c(76L, 92L, 96L, 76L, 80L, 91L, 54L, 62L), x = c(964L,
505L, 968L, 564L, 725L, 929L, 748L, 932L), k = c(27L, 47L,
36L, 40L, 33L, 46L, 30L, 36L), value = c(6872L, 5993L, 5498L,
5287L, 6835L, 6622L, 5736L, 7218L)), .Names = c("date", "segment",
"a", "x", "k", "value"), row.names = c(NA, -8L), class = "data.frame")
So for the "abc" segment, I JUST care about (value/a) relative to its benchmark of 75.
and for the "xyz" segment, I JUST care about (k/x) relative to its benchmark of 0.04.
Ultimately I want a dataframe that looks like:
date abc xyz
1 2013-01-01 0.21 0.24
2 2013-01-02 -0.13 0.00
3 2013-01-03 -0.24 -0.03
4 2013-01-04 -0.07 0.00
5 2013-01-05 0.14 0.00
Where, since "xyz" only had info for 2013-01-01 thru 2013-01-03, it gets 0's for everything after.
How I got to this point was:
define the arguments to be passed to mapply
splits <- split(test, test$segment)
metrics <- c("ametric","xmetric")
benchmarks <- c(75,0.04)
and the function to get performance against benchmark
performance <- function(splits,metrics,benchmarks){
(splits[,metrics]/benchmarks)-1
}
Pass these to mapply:
temp <- mapply(performance, splits, metrics, benchmarks)
The problem now is that, since the splits were of different length, the output looks like this:
summary(temp)
Length Class Mode
abc 5 -none- numeric
xyz 3 -none- numeric
Is there a way to bring in the dates from the original data frame for each split, and combine according to those dates (with 0's where there's no data)?
You just need to set the SIMPLIFY=FALSE argument to mapply, then you can use do.call with rbind to put everything back into one dataframe:
> temp <- mapply(performance, splits, metrics, benchmarks)
> do.call('rbind',mapply(cbind, splits, performance=temp, SIMPLIFY=FALSE))
date segment a x k value performance
abc.1 2013-01-01 abc 76 964 27 6872 1.333333e-02
abc.2 2013-01-02 abc 92 505 47 5993 2.266667e-01
abc.3 2013-01-03 abc 96 968 36 5498 2.800000e-01
abc.4 2013-01-04 abc 76 564 40 5287 1.333333e-02
abc.5 2013-01-05 abc 80 725 33 6835 6.666667e-02
xyz.6 2013-01-01 xyz 91 929 46 6622 2.322400e+04
xyz.7 2013-01-02 xyz 54 748 30 5736 1.869900e+04
xyz.8 2013-01-03 xyz 62 932 36 7218 2.329900e+04