Two histograms with two variables in Ggplot2 - r

This is my DF :
> head(xgb_1_plot)
week PRICE id_item food_cat_id test_label xgb_1
2 5 18 60 7 2 2
7 5 21 9 6 5 8
12 5 14 31 4 4 6
21 5 15 25 7 12 12
31 5 14 76 3 4 2
36 5 7 48 8 2 4
Where test_label is the test value, "xgb_1" is the column with the predicted values and id_items are the items.
I want to plot graph in which I can see predicted values VS true values side by side for some id_items.
There are over 100, so I need just a subset for the plot (otherwise it'll be a mess).
Let me know!
P.S. the best thing would be transform the test_label and the xgb1 in rows and add a dummy variable "Predicted/True value", but I have no idea how to do it.

I would suggest this approach, reshaping data and then plotting. Having more data, it will look better:
library(tidyverse)
#Data
dfa <- structure(list(id_item = c(60L, 9L, 31L, 25L, 76L, 48L), test_label = c(2L,
5L, 4L, 12L, 4L, 2L), xgb_1 = c(2L, 8L, 6L, 12L, 2L, 4L)), class = "data.frame", row.names = c("2",
"7", "12", "21", "31", "36"))
Code:
#Reshape
dfa %>% pivot_longer(cols = -id_item) %>%
ggplot(aes(x=value,fill=name))+
geom_histogram(position = position_dodge())+
facet_wrap(.~id_item)
Output:

Here's a differnt approach using geom_errorbar. Maybe the color thing is a little bit too much, but today is a rainy day ... so was in need of some variety
"%>%" <- magrittr::"%>%"
dat <- dplyr::tibble(id_item=c(69,9,31,25,76,48),
test_label=c(2,5,4,12,4,2),
xgb_1=c(2,8,6,21,2,4))
dat %>%
dplyr::mutate(diff=abs(test_label-xgb_1)) %>%
ggplot2::ggplot(ggplot2::aes(x=id_item,ymin=test_label,ymax=xgb_1,color=diff)) +
ggplot2::geom_errorbar()

Related

Is there a way to complete or expand an interval factor variable [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a data frame/tibble that includes a factor variable of bins. There are missing bins because the original data did not include an observation in those 5-year ranges. Is there a way to easily complete the series without having to deconstruct the interval?
Here's a sample df.
library(tibble)
df <- structure(list(bin = structure(c(1L, 3L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("[1940,1945]",
"(1945,1950]", "(1950,1955]", "(1955,1960]", "(1960,1965]", "(1965,1970]",
"(1970,1975]", "(1975,1980]", "(1980,1985]", "(1985,1990]", "(1990,1995]",
"(1995,2000]", "(2000,2005]", "(2005,2010]", "(2010,2015]", "(2015,2020]",
"(2020,2025]"), class = "factor"), Values = c(2L, 4L, 14L, 11L,
8L, 26L, 30L, 87L, 107L, 290L, 526L, 299L, 166L, 502L, 8L)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
df
# A tibble: 15 x 2
bin Values
<fct> <int>
1 [1940,1945] 2
2 (1950,1955] 4
3 (1960,1965] 14
4 (1965,1970] 11
5 (1970,1975] 8
6 (1975,1980] 26
7 (1980,1985] 30
8 (1985,1990] 87
9 (1990,1995] 107
10 (1995,2000] 290
11 (2000,2005] 526
12 (2005,2010] 299
13 (2010,2015] 166
14 (2015,2020] 502
15 (2020,2025] 8
I would like to add the missing (1945,1950] and (1955,1960] bins.
bins already has the levels that you want. So you can use complete in your df as :
tidyr::complete(df, bin = levels(bin), fill = list(Values = 0))
# A tibble: 17 x 2
# bin Values
# <chr> <dbl>
# 1 (1945,1950] 0
# 2 (1950,1955] 4
# 3 (1955,1960] 0
# 4 (1960,1965] 14
# 5 (1965,1970] 11
# 6 (1970,1975] 8
# 7 (1975,1980] 26
# 8 (1980,1985] 30
# 9 (1985,1990] 87
#10 (1990,1995] 107
#11 (1995,2000] 290
#12 (2000,2005] 526
#13 (2005,2010] 299
#14 (2010,2015] 166
#15 (2015,2020] 502
#16 (2020,2025] 8
#17 [1940,1945] 2
df <- orig_df %>%
mutate(bin = cut_width(Year, width = 5, center = 2.5))
df2 <- df %>%
group_by(bin) %>%
summarize(Values = n()) %>%
ungroup()
tibble(bin = levels(df$bin)) %>%
left_join(df2) %>%
replace_na(list(Values = 0))

Valid observations based on conditions [duplicate]

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

Sort and Return Top 5 Rows with Greatest Values

For this dataset, I would like to order the Var1 by the corresponding frequency in order from largest to smallest and take the top 5 largest by row. I've been using the functions rank(), sort(), and order() with no avail.
Var1 Freq
2 Moderate 33
3 Luxury 31
4 Couples 31
5 Families with Children 33
6 Nightlife 23
7 Europe 60
8 Architecture 23
9 Drink 58
10 Northern Europe 27
11 Skiing 29
Ideally, I would like the final output to be:
Var1 Freq
7 Europe 60
9 Drink 58
5 Families with Children 33
2 Moderate 33
3 Luxury 31
When I use the functions stated above, R returns a series of numbers such that are either jibberish or it will only return the Freq column in a ranked order.
Here's a dplyr solution.
df %>% top_n(5, Freq) %>% arrange(-Freq)
This gives you the top 5 scores in order.
# Var1 Freq
# 1 Europe 60
# 2 Drink 58
# 3 Moderate 33
# 4 Families with Children 33
# 5 Luxury 31
# 6 Couples 31
Note that 6 entries are included due to a tie.
If you just want the top 5 regardless of ties, then you can use this:
df %>% arrange(-Freq) %>% filter(row_number() <= 5)
# Var1 Freq
# 1 Europe 60
# 2 Drink 58
# 3 Moderate 33
# 4 Families with Children 33
# 5 Luxury 31
Here is a one-liner. It uses order and head.
head(dat[order(dat$Freq, decreasing = TRUE), ], 5)
# Var1 Freq
#7 Europe 60
#9 Drink 58
#2 Moderate 33
#5 Families with Children 33
#3 Luxury 31
DATA.
dat <-
structure(list(Var1 = structure(c(7L, 6L, 2L, 5L, 8L, 4L, 1L,
3L, 9L, 10L), .Label = c("Architecture", "Couples", "Drink",
"Europe", "Families with Children", "Luxury", "Moderate", "Nightlife",
"Northern Europe", "Skiing"), class = "factor"), Freq = c(33L,
31L, 31L, 33L, 23L, 60L, 23L, 58L, 27L, 29L)), .Names = c("Var1",
"Freq"), class = "data.frame", row.names = c("2", "3", "4", "5",
"6", "7", "8", "9", "10", "11"))
dat <- structure(list(Var1 = structure(c(7L, 6L, 2L, 5L, 8L, 4L, 1L, 3L, 9L, 10L), .Label = c("Architecture", "Couples", "Drink",
"Europe", "Families with Children", "Luxury", "Moderate", "Nightlife",
"Northern Europe", "Skiing"), class = "factor"), Freq = c(33L,
31L, 31L, 33L, 23L, 60L, 23L, 58L, 27L, 29L)), .Names = c("Var1",
"Freq"), class = "data.frame", row.names = c("2", "3", "4", "5",
"6", "7", "8", "9", "10", "11"))
Using data.table.
library(data.table)
DFDT <- as.data.table(dat)
DFDT[order(-Freq)][1:5]
Var1 Freq
1: Europe 60
2: Drink 58
3: Moderate 33
4: Families with Children 33
5: Luxury 31

Reducing multiple rows to 1 by index in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!
I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)
Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

R - Aggregate Percentage for Stacked Bar Charts using ggplot2

I have some data that looks like the below. I'm aiming to generate stacked bar charts for them, but I need the values to be shown as percentages. I've managed to get as far as getting the data melted to the right shape and drawing the stacked bars, but the values go far beyond 100% (in my actual dataset, some values add up to 8000+). What is the correct way to setup ggplot2 so that I can create stacked bar charts in percentages?
#Raw Data
x A B C
1 5 10 14
1 4 4 14
2 5 10 14
2 4 4 14
3 5 10 14
3 4 4 14
#Aggregate
data < read.table(...);
data <- aggregate(. ~ x, data, sum) #<---- Sum to Average?
x A B C
1 9 14 28
2 9 14 28
3 9 14 28
#Melt Data
data <- melt(data,"x")
x variable value
1 1 A 9
2 2 A 9
3 3 A 9
4 1 B 14
5 2 B 14
6 3 B 14
7 1 C 28
8 2 C 28
9 3 C 28
#Plot stack bar chart counts
ggplot(data, aes(x=1, y=value, fill=variable)) + geom_bar(stat="identity") + facet_grid(.~x)
I'm hoping to get something like this before the melt so that I can melt it and plot that as a stacked bar chart, but I'm not sure how to approach this.
#Ideal Data Format - After Aggregate, Before Melt
x A B C
1 17.64 27.45 54.90
2 17.64 27.45 54.90
3 17.64 27.45 54.90
Q: What is the correct way to create a stacked bar chart with percentages, using ggplot2?
You can calculate proportion using your melt data. Then, you can draw a figure. Here, you can calculate proportion for each level of x using group_by in the dplyr package. You have other options as well. If you wanna read the mutate line, it is like "For each level of x, I want to get percent." In order to to remove the grouped variable, which is x, I added ungroup() in the end.
library(dplyr)
library(ggplot2)
### foo is your melt data
ana <- mutate(group_by(foo, x), percent = value / sum(value) * 100) %>%
ungroup()
### Plot once
bob <- ggplot(data = ana, aes(x = x, y = percent, fill = variable)) +
geom_bar(stat = "identity") +
labs(y = "Percentage (%)")
### Get ggplot data
caroline <- ggplot_build(bob)$data[[1]]
### Create values for text positions
caroline$position = caroline$ymax + 1
### round up numbers and convert to character. Get unique values
foo <- unique(as.character(round(ana$percent, digits = 2)))
### Create a column for text
caroline$label <- paste(foo,"%", sep = "")
### Plot again
bob + annotate(x = caroline$x, y = caroline$position,
label = caroline$label, geom = "text", size=3)
DATA
foo <-structure(list(x = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), variable = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(9L, 9L, 9L, 14L, 14L, 14L, 28L, 28L, 28L)), .Names = c("x",
"variable", "value"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

Resources