Hi there) can anybody help me. I have a big DF with two columns Country_dest and SumTotal (is value), trying to use qplot function
qplot(country_dest, SumTotal, data=Africa)
Brunei 58
Aruba 73
Cuba 95
Nicaragua 97
Turkmenistan 99
Saint Lucia 102
Honduras 153
Barbados 161
Haiti 165
Montenegro 175
And I would like to draw a plot, but on x axis put the name of the countries (for example 7 or 6 of them) with the highest value of SumTotal, is it possible to do?)
Thank you in advance!
using ggplot, just reorder by population:
ggplot(data = Africa, aes(x= reorder(country_dest, -SumTotal), y= SumTotal)) + geom_bar(stat = "identity")
if you just wanna take say the top 5 use arrange and then subset:
require(dplyr)
Africa.ordered <- arrange(Africa, -SumTotal)
Africa.top5 <- Africa.ordered[1:5,]
and then draw your plot
Related
I've made a GAM model in R using the following code:
mod_gam1 <-gam(y ~ s(ï..x), data=Bird.data, method = "REML")
plot(mod_gam1)
coef(mod_gam1)
plot(mod_gam1, residuals = TRUE, pch = 1)
coef(mod_gam1)
mod_gam1$fitted.values
result <- data.frame(data = c(mod_gam1$fitted.values, Bird.data$y), Year = rep(1991:2019, times = 2),
'source' = c(rep('Modelled', times = 29), rep('Observed', times = 29)))
ggplot(result, aes(x = Year, y = data, colour = source))+ geom_point()+ geom_smooth(span= 0.8)+labs(x="Year", y = "Bird Island Total Debris Count")+ scale_y_continuous(limits = c(0,1000))
and the output looks ok but the shaded area of the geom_smooth error doesn't extend to the whole of my dataset (stops short of my first two datapoints) and I am not sure why.
Any help would be appreciated!
I can't upload a picture as I am new to the site, but yeah basically I have two datasets (observed and GAM modelled values) which both have their SE confidence ribbon, but these start two datapoints in to my datasets not at the first points.
These are my datapoints:
Bird.data
ï..x
y
1991
17
1992
76
1993
328
1994
131
1995
425
1996
892
1997
501
1998
419
1999
297
2000
277
2001
310
2002
282
2003
189
2004
278
2005
322
2006
444
2007
412
2008
241
2009
242
2010
255
2011
289
2012
335
2013
279
2014
628
2015
500
2016
174
2017
636
2018
420
2019
447
Fitted Values
[1] 95.56189 177.01468 255.17074 324.97532 380.28813 415.71334 428.67793 420.86624 398.18522 369.06325
[11] 341.72715 321.65585 310.33971 305.81158 304.53360 303.60521 302.21413 301.75501 304.77184 313.43400
[21] 328.37279 348.39076 371.04203 393.66222 414.29754 432.15104 447.48020 461.14595 474.09266
Negative Binomial
It is because of the limits you have put using scale_y_continuous. If you remove that line (or adjust the y down, so that it allows the minimum y value of the smooth, then you will see the smooth fill completely.
However, you have a larger problem here. You are not actually showing the gam model in the smooth (only the gam point predictions). There are a couple of ways to do this.. Easiest might be to feed Bird.data directly to the ggplot function, and use the method and formula params of the geom_smooth() to directly request the gam smooth:
ggplot(Bird.data, aes(x,y)) +
geom_point() +
geom_smooth(method="gam", formula=y~s(x)) +
labs(x="Year", y = "Bird Island Total Debris Count")
The problem with this approach is that you don't get the prediction points as well. This can be fixed with the following approach
add the se directly to the result dataframe
result$se = c(predict(mod_gam1,se=T)$se, rep(NA,29))
use ggplot as before, but use geom_ribbon, setting the ymin and ymax directly
ggplot(result, aes(x = Year, y = data, colour = source, fill=source))+
geom_point()+
geom_ribbon(aes(ymin=data-1.96*se, ymax=data+1.96*se), alpha=0.2) +
labs(x="Year", y = "Bird Island Total Debris Count")+
scale_y_continuous(limits = c(-200,1000))
I accessed this graph of estimation of number of cases of diabetes and future projections of numbers for every two year estimation data points from year 2000. The graph is factually incorrect as the points on line do not coincide with the scale on left. I am trying to replot it in ggplot2 or ggplotly. While replotting I intend to make two line graphs in a single plot - One for estimations over last few years and the other for future projections made in those years for next 20-25 years and the year on which the projections were made. Any help is highly appreciable.
Here is the data that was used to plot the graph - Estimated numbers with year are represented in blue while Projected numbers for future years are represented by red line. Since, there are multiple projected numbers for few year, I am intending to keep the highest number on the line graph.
EstimationYear
Estimates (in millions)
Projections (in millions)
Projection Year
2000
151
333
2025
2003
194
380
2025
2006
246
438
2030
2009
285
552
2030
2011
366
578
2030
2013
382
592
2035
2015
415
642
2040
2017
425
629
2045
2019
463
700
2045
Your question is more about the data wrangling than the actual plotting with ggplot. Once you have the data in the right shape, the plotting command is just a few lines.
prepare the data for the estimation (blue) points. Set a column type to "estimation".
prepare the data for the projected (red) points. Set a column type to "projection".
use bind_rows to combine both tables.
In the aesthetics of ggplot use color=type
Here is a start in how you can go recreate the plot from the data. I haven't put any effort in recreating the balloons, set the theme to something more elegant and those kind of things.
library(ggplot2)
txt <- "2000 151 333 2025
2003 194 380 2025
2006 246 438 2030
2009 285 552 2030
2011 366 578 2030
2013 382 592 2035
2015 415 642 2040
2017 425 629 2045
2019 463 700 2045"
df <- read.table(text = txt)
# Putting years and values in the same columns
# Probably some tidyverse function can do this more elegantly
df <- rbind(cbind(unname(df[1:2]), type = "Estimates"),
cbind(unname(df[4:3]), type = "Projection"))
colnames(df) <- c("year", "value", "type")
# We're reordering on value, because the red line does not touch year-duplicates
df <- df[order(df$value, decreasing = TRUE),]
ggplot(df, aes(year, value, colour = type)) +
# Formula notation to filter out data for the line
geom_line(data = ~ subset(., !duplicated(year))) +
geom_point() +
scale_colour_manual(
values = c("Estimates" = "dodgerblue",
"Projection" = "tomato")
) +
scale_y_continuous(limits = c(0, NA),
name = "Millions")
Created on 2021-01-06 by the reprex package (v0.3.0)
I would like to create a graph to represent projected vs collected revenue by person and I'm not sure how to do this. The goal would be to have the negative differences plotted as a red vertical bar and positive as black.
ggplot(appts2,
aes(Provider, Difference),
main = "Difference in Projected vs Actual Revenue") +
geom_bar(fill = ifelse(appts2$Difference < 0, "red", "black"), stat = 'identity') +
coord_flip()
works but isn't coloring things correctly.
Provider Revenue Visits Ave Total Add Ons Total Scheduled Total Seen Total Not Seen TotalBatchVisits ProjectedRevenue Difference MissingRecords
Smith 40911 539 75.9 38 438 404 82 486 36887.4 -4023.6 53
Antonio 4827 63 76.62 7 88 60 35 95 7278.9 2451.9 -32
Jackson 13832 171 80.89 32 155 161 20 181 14641.09 809.09 -10
Redding 23030 278 82.84 25 164 144 34 178 14745.52 -8284.48 100
You can accomplish this by setting the "fill" aesthetic to a logical statement, such as Difference < 0. ggplot will then fill the bars depending on whether the bar is less than or greater than zero.
Never use the $ operator inside of aes() (you reference appts2$Difference). Instead, use the bare column name, which ggplot will then search for in the provided data set. ggplot orders the data before plotting it, so providing an outside vector with $ can cause strange conflicts with its intended order.
library(ggplot2)
set.seed(1)
df <- data.frame(category = letters[1:10], difference = rnorm(10))
g <- ggplot(data = df, aes(y = difference, x = category, fill = difference < 0)) +
geom_col() +
coord_flip()
print(g)
I have a dataset from the world bank with some continuous and categorical variables.
> head(nationsCombImputed)
iso3c iso2c country year.x life_expect population birth_rate neonat_mortal_rate region
1 ABW AW Aruba 2014 75.45 103441 10.1 2.4 Latin America & Caribbean
2 AFG AF Afghanistan 2014 60.37 31627506 34.2 36.1 South Asia
3 AGO AO Angola 2014 52.27 24227524 45.5 49.6 Sub-Saharan Africa
4 ALB AL Albania 2014 77.83 2893654 13.4 6.5 Europe & Central Asia
5 AND AD Andorra 2014 70.07 72786 20.9 1.5 Europe & Central Asia
6 ARE AE United Arab Emirates 2014 77.37 9086139 10.8 3.6 Middle East & North Africa
income gdp_percap.x log_pop
1 High income 47008.83 5.014693
2 Low income 1942.48 7.500065
3 Lower middle income 7327.38 7.384309
4 Upper middle income 11307.55 6.461447
5 High income 30482.64 4.862048
6 High income 67239.00 6.958379
I wish to use ggpairs to plot some of the continuous variables (life_expect, birth_rate, neonat_mortal_rate, gdp_percap.x) in a scatter plot but I would like to colour them using the region categorical variable from the data. I have tried a number of different ways but I cannot colour the continuous variables without including the categorical variable.
ggpairs(nationsCombImputed[,c(2,5,7,8,9,11)],
title="Scatterplot of Variables",
mapping = ggplot2::aes(color = region),
labeller = "iso2c")
But I get this error
Error in stop_if_high_cardinality(data, columns,
cardinality_threshold) : Column 'iso2c' has more levels (211) than
the threshold (15) allowed. Please remove the column or increase the
'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
Ultimately I would just like a 4x4 scatter plot of the continuous variables coloured by region with the data points labels using the iso2c code in column 2.
Is this possible in ggpairs?
Well yes it is possible! As per #Robin Gertenbach suggestions I added the columns argument to my code and this worked great, please see below.
ggpairs(nationsCombImputed,
title="Scatterplot of Variables",
columns = c(5,7,8,11),
mapping=ggplot2::aes(colour = region))
I still wish to add data point labels to the scatter plot using the iso2c column but I am struggling with this, any pointers would be greatly appreciated.
As mentioned in the comment you can get ggpairs to color but not plot a dimension by specifying the numeric indices of the columns you do want to plot with columns = c(5,7,8,11).
To have a text scatter plot you will need to define a function e.g. textscatter that you will supply via lower = list(continuous = textscatter) in the ggpairs function call and specify the labels in the aesthetics.
textscatter <- function(data, mapping, ...) {
ggplot(data, mapping, ...) + geom_text()
}
ggpairs(
nationsCombImputed,
title="Scatterplot of Variables",
columns = c(5,7,8,11),
mapping=ggplot2::aes(colour = region, label = iso2c))
lower = list(continuous = textscatter)
)
Of course you can also put the label aesthetic definition into textscatter
I am having some troubles applying a gradient fill to my area plot.
The data is as below:
> df
year annual
1 1960 0.0100
2 1961 -0.2700
3 1962 -0.3450
4 1963 -0.6508
5 1964 -0.9458
6 1965 -0.2458
7 1966 0.9492
8 1967 0.5383
9 1968 0.6275
10 1969 0.0000
I've set up a colorRampPalette for the gradient, and I know this works.
spi.cols <- colorRampPalette(c("darkred","red","yellow","white","green","blue","darkblue"),space="rgb")
With the plot, my aim is to have the fill colours follow the values in the annual column. So as to make it easy to tell that values are within certain boundaries. Right now, the plot seems to think every value it is "filling" is equal to zero, and is thus filling it all in one colour only.
ggplot(df, aes(x = year)) +
geom_polygon(aes(y = annual, fill = annual)) +
theme_classic() +
scale_fill_gradientn(colours = spi.cols(12), limits = c(-2.5, 2.5), guide = "legend")
I have also specified the breaks I'd like in my gradient, but I'm not sure how to utilise this. I attempted to use this in values of the scale_fill_gradientn but this was unsuccessful.
spi.breaks <- c(-2.5,-2,-1.6,-1.3,-0.8,-0.5,0.5,0.8,1.3,1.6,2,2.5)
Any help would be much appreciated