Descending order in ggplot bar_col - r

Below is the dataset,
# A tibble: 449 x 7
`Country or Area` `Region 1` Year Rate MinCI MaxCI Average
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Southern Asia 2011 4.2 2.6 6.2 4.4
2 Afghanistan Southern Asia 2016 5.5 3.4 8.1 5.75
3 Aland Islands Northern Europe NA NA NA NA NA
4 Albania Southern Europe 2011 18.8 14.8 23 18.9
5 Albania Southern Europe 2016 21.7 17 26.7 21.8
6 Algeria Northern Africa 2011 24 19.9 28.4 24.2
7 Algeria Northern Africa 2016 27.4 22.5 32.7 27.6
8 American Samoa Polynesia NA NA NA NA NA
9 Andorra Southern Europe 2011 24.6 19.8 29.8 24.8
10 Andorra Southern Europe 2016 25.6 20.1 31.3 25.7
I need to draw a bar_col using the above dataset to compare the average obesity rate of each region. Further, I need to order the bar from the highest to the lowest.
I have also calculated the Average obesity rate as shown above.
Below is the code I used to generate the ggplot, but unable to figure out how to order from the highest to lowest.
region_plot <- ggplot(continent) + aes(x = continent$`Region 1`, y = continent$Average, fill = Average) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
region_plot

After checking your data, you have multiple regions so in order to show the average per region you must to compute it and then plot. You can do that with dplyr using group_by() and summarise(). Your data is limited but for the real one, NA should not be present. Here the code using part of the shared data. Be careful with names when using your real data. reorder() function can arrange bars. Here the code:
library(dplyr)
library(ggplot2)
#Code
df %>% group_by(Region) %>%
summarise(Avg=mean(Average,na.rm=T)) %>%
filter(!is.na(Avg)) %>%
ggplot(aes(x=reorder(Region,-Avg),y=Avg,fill=Region))+
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
Output:
Some data used:
#Data
df <- structure(list(Region = c("Southern Asia", "Southern Asia", "Northern Europe",
"Southern Europe", "Southern Europe", "Northern Africa", "Northern Africa",
"Polynesia", "Southern Europe", "Southern Europe"), Year = c(2011L,
2016L, NA, 2011L, 2016L, 2011L, 2016L, NA, 2011L, 2016L), Rate = c(4.2,
5.5, NA, 18.8, 21.7, 24, 27.4, NA, 24.6, 25.6), MinCI = c(2.6,
3.4, NA, 14.8, 17, 19.9, 22.5, NA, 19.8, 20.1), MaxCI = c(6.2,
8.1, NA, 23, 26.7, 28.4, 32.7, NA, 29.8, 31.3), Average = c(4.4,
5.75, NA, 18.9, 21.8, 24.2, 27.6, NA, 24.8, 25.7)), row.names = c(NA,
-10L), class = "data.frame")

The problem can be solved by preprocessing the data and sorting the result by Average. Then coerce Region 1 to factor.
library(ggplot2)
library(dplyr)
continent %>%
group_by(`Region 1`) %>%
summarise(Average = mean(Average, na.rm = TRUE)) %>%
arrange(desc(Average)) %>%
mutate(`Region 1` = factor(`Region 1`, levels = unique(`Region 1`))) %>%
ggplot(aes(x = `Region 1`, y = Average, fill = Average)) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
ggtitle("Average obesity rate of each region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) -> region_plot
region_plot
Data
continent <- read.table(text = "
'Country or Area' 'Region 1' Year Rate MinCI MaxCI Average
1 Afghanistan 'Southern Asia' 2011 4.2 2.6 6.2 4.4
2 Afghanistan 'Southern Asia' 2016 5.5 3.4 8.1 5.75
3 'Aland Islands' 'Northern Europe' NA NA NA NA NA
4 Albania 'Southern Europe' 2011 18.8 14.8 23 18.9
5 Albania 'Southern Europe' 2016 21.7 17 26.7 21.8
6 Algeria 'Northern Africa' 2011 24 19.9 28.4 24.2
7 Algeria 'Northern Africa' 2016 27.4 22.5 32.7 27.6
8 'American Samoa' Polynesia NA NA NA NA NA
9 Andorra 'Southern Europe' 2011 24.6 19.8 29.8 24.8
10 Andorra 'Southern Europe' 2016 25.6 20.1 31.3 25.7
", header = TRUE, check.names = FALSE)

Related

compute variable over the value of the difference between another variable this year and the previous one R

In the data below I want to compute the following ratio tr(year)/(op(year) - op(year-1). I would appreciate an answer with dplyr.
year op tr cp
<chr> <dbl> <dbl> <dbl>
1 1984 10 39.1 38.3
2 1985 55 132. 77.1
3 1986 79 69.3 78.7
4 1987 78 47.7 74.1
5 1988 109 77.0 86.4
this is the expected output
year2 ratio
1 1985 2.933333
2 1986 2.887500
3 1987 -47.700000
4 1988 -2.483871
I do not manage to get to any result...
Use lag:
library(dplyr)
df %>%
mutate(year = year,
ratio = tr / (op - lag(op)),
.keep = "none") %>%
tidyr::drop_na()
# year ratio
#2 1985 2.933333
#3 1986 2.887500
#4 1987 -47.700000
#5 1988 2.483871
We may use
library(dplyr)
df1 %>%
reframe(year = year[-1], ratio = tr[-1]/diff(op))
-output
year ratio
1 1985 2.933333
2 1986 2.887500
3 1987 -47.700000
4 1988 2.483871
data
df1 <- structure(list(year = 1984:1988, op = c(10L, 55L, 79L, 78L, 109L
), tr = c(39.1, 132, 69.3, 47.7, 77), cp = c(38.3, 77.1, 78.7,
74.1, 86.4)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))

Is there a function in R to fill in missing data in a variable

I have a set of panel data, df1, that take the following form:
df1:
year state eligibility coverage
1990 AL .87 .70
1991 AL .78 .61
1992 AL .82 .63
1993 AL .79 .69
1994 AL .82 .73
1990 AK .91 .88
1991 AK .83 .79
1992 AK .82 .71
1993 AK .77 .69
1994 AK .82 .73
I need to add a variable "professionalism" from a different set of data, df2, but the problem is that df2 only has observations measured on even years. df2 thus takes the following form:
df2:
year state professionalism
1990 AL 1.33
1992 AL 1.40
1994 AL 1.42
1990 AK -0.92
1992 AK -0.98
1994 AK -1.02
Is there a function R that will add the odd years into df2, copying the value of the year +1, producing the following output:
df2':
year state professionalism
1990 AL 1.33
1991 AL 1.40
1992 AL 1.40
1993 AL 1.42
1994 AL 1.42
1990 AK -0.92
1991 AK -0.98
1992 AK -0.98
1993 AK -1.02
1994 AK -1.02
I can then merge the professionalism variable from the new df2' to df1... is this possible?
We can use complete with fill
library(dplyr)
library(tidyr)
df2 %>%
complete(year = 1990:1994, state) %>%
group_by(state) %>%
fill(professionalism, .direction = "updown") %>%
ungroup %>%
arrange(state, year)
-output
# A tibble: 10 x 3
year state professionalism
<int> <chr> <dbl>
1 1990 AK -0.92
2 1991 AK -0.98
3 1992 AK -0.98
4 1993 AK -1.02
5 1994 AK -1.02
6 1990 AL 1.33
7 1991 AL 1.4
8 1992 AL 1.4
9 1993 AL 1.42
10 1994 AL 1.42
data
df2 <- structure(list(year = c(1990L, 1992L, 1994L, 1990L, 1992L, 1994L
), state = c("AL", "AL", "AL", "AK", "AK", "AK"), professionalism = c(1.33,
1.4, 1.42, -0.92, -0.98, -1.02)), class = "data.frame", row.names = c(NA,
-6L))
I think the easiest way to do it is to create a new column in df1 that rounds the year to an even value, and then left_join the data from df2:
library(tidyverse)
#Setting up example data
df1 <- tribble(
~year, ~state, ~eligibility, ~coverage,
1990, "AL", .87, .70,
1991, "AL", .78, .61,
1992, "AL", .82, .63,
1993, "AL", .79, .69,
1994, "AL", .82, .73,
1990, "AK", .91, .88,
1991, "AK", .83, .79,
1992, "AK", .82, .71,
1993, "AK", .77, .69,
1994, "AK", .82, .73)
df2 <- tribble(
~year, ~state, ~professionalism,
1990, "AL", 1.33,
1992, "AL", 1.40,
1994, "AL", 1.42,
1990, "AK", -0.92,
1992, "AK", -0.98,
1994, "AK", -1.02)
#Create a "year even" variable in df1, then left join from df2
df1 <- df1 %>% mutate(year_even = ceiling(year/2)*2)
df1 <- left_join(df1, df2, by = c("year_even" = "year", "state" = "state"))

Summarise across each column by grouping their names

I want to calculate the weighted variance using the weights provided in the dataset, while group for the countries and cities, however the function returns NAs:
library(Hmisc) #for the 'wtd.var' function
weather_winter.std<-weather_winter %>%
group_by(country, capital_city) %>%
summarise(across(starts_with("winter"),wtd.var))
The provided output from the console (when in long format):
# A tibble: 35 x 3
# Groups: country [35]
country capital_city winter
<chr> <chr> <dbl>
1 ALBANIA Tirane NA
2 AUSTRIA Vienna NA
3 BELGIUM Brussels NA
4 BULGARIA Sofia NA
5 CROATIA Zagreb NA
6 CYPRUS Nicosia NA
7 CZECHIA Prague NA
8 DENMARK Copenhagen NA
9 ESTONIA Tallinn NA
10 FINLAND Helsinki NA
# … with 25 more rows
This is the code that I used to get the data from a wide format into a long format:
weather_winter <- weather_winter %>% pivot_longer(-c(31:33))
weather_winter$name <- NULL
names(weather_winter)[4] <- "winter"
Some example data:
structure(list(`dec-wet_2011` = c(12.6199998855591, 12.6099996566772,
14.75, 11.6899995803833, 18.2899990081787), `dec-wet_2012` = c(13.6300001144409,
14.2199993133545, 14.2299995422363, 16.1000003814697, 18.0299987792969
), `dec-wet_2013` = c(4.67999982833862, 5.17000007629395, 4.86999988555908,
7.56999969482422, 5.96000003814697), `dec-wet_2014` = c(14.2999992370605,
14.4799995422363, 13.9799995422363, 15.1499996185303, 16.1599998474121
), `dec-wet_2015` = c(0.429999977350235, 0.329999983310699, 1.92999994754791,
3.30999994277954, 7.42999982833862), `dec-wet_2016` = c(1.75,
1.29999995231628, 3.25999999046326, 6.60999965667725, 8.67999935150146
), `dec-wet_2017` = c(13.3400001525879, 13.3499994277954, 15.960000038147,
10.6599998474121, 14.4699993133545), `dec-wet_2018` = c(12.210000038147,
12.4399995803833, 11.1799993515015, 10.75, 18.6299991607666),
`dec-wet_2019` = c(12.7199993133545, 13.3800001144409, 13.9899997711182,
10.5299997329712, 12.3099994659424), `dec-wet_2020` = c(15.539999961853,
16.5200004577637, 11.1799993515015, 14.7299995422363, 13.5499992370605
), `jan-wet_2011` = c(8.01999950408936, 7.83999967575073,
10.2199993133545, 13.8899993896484, 14.5299997329712), `jan-wet_2012` = c(11.5999994277954,
11.1300001144409, 12.5500001907349, 10.1700000762939, 22.6199989318848
), `jan-wet_2013` = c(17.5, 17.4099998474121, 15.5599994659424,
13.3199996948242, 20.9099998474121), `jan-wet_2014` = c(12.5099992752075,
12.2299995422363, 15.210000038147, 9.73999977111816, 9.63000011444092
), `jan-wet_2015` = c(17.6900005340576, 16.9799995422363,
11.75, 9.9399995803833, 19), `jan-wet_2016` = c(15.6099996566772,
15.5, 14.5099992752075, 10.3899993896484, 18.4499988555908
), `jan-wet_2017` = c(9.17000007629395, 9.61999988555908,
9.30999946594238, 15.8499994277954, 11.210000038147), `jan-wet_2018` = c(8.55999946594238,
9.10999965667725, 13.2599992752075, 9.85999965667725, 15.8899993896484
), `jan-wet_2019` = c(17.0699996948242, 16.8699989318848,
14.5699996948242, 19.0100002288818, 19.4699993133545), `jan-wet_2020` = c(6.75999975204468,
6.25999975204468, 6.00999975204468, 5.35999965667725, 8.15999984741211
), `feb-wet_2011` = c(9.1899995803833, 8.63999938964844,
6.21999979019165, 9.82999992370605, 4.67999982833862), `feb-wet_2012` = c(12.2699995040894,
11.6899995803833, 8.27999973297119, 14.9399995803833, 13.0499992370605
), `feb-wet_2013` = c(15.3599996566772, 15.9099998474121,
17.0599994659424, 13.3599996566772, 16.75), `feb-wet_2014` = c(10.1999998092651,
11.1399993896484, 13.8599996566772, 10.7399997711182, 7.35999965667725
), `feb-wet_2015` = c(11.9200000762939, 12.2699995040894,
8.01000022888184, 14.5299997329712, 5.71999979019165), `feb-wet_2016` = c(14.6999998092651,
14.7799997329712, 16.7899990081787, 4.90000009536743, 19.3500003814697
), `feb-wet_2017` = c(8.98999977111816, 9.17999935150146,
11.7699995040894, 6.3899998664856, 13.9899997711182), `feb-wet_2018` = c(16.75,
16.8599987030029, 12.0599994659424, 16.1900005340576, 8.51000022888184
), `feb-wet_2019` = c(7.58999967575073, 7.26999998092651,
8.21000003814697, 7.57999992370605, 8.81999969482422), `feb-wet_2020` = c(10.6399993896484,
10.4399995803833, 13.4399995803833, 8.53999996185303, 19.939998626709
), country = c("SERBIA", "SERBIA", "SLOVENIA", "GREECE",
"CZECHIA"), capital_city = c("Belgrade", "Belgrade", "Ljubljana",
"Athens", "Prague"), weight = c(20.25, 19.75, 14.25, 23.75,
14.25)), row.names = c(76L, 75L, 83L, 16L, 5L), class = "data.frame")
Your code seems to provide the right answer, now there's more data:
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 27.2
2 GREECE Athens 14.6
3 SERBIA Belgrade 19.1
4 SLOVENIA Ljubljana 16.3
Is this what you were looking for?
I took the liberty of streamlining your code:
weather_winter <- weather_winter %>%
pivot_longer(-c(31:33), values_to = "winter") %>%
select(-name)
weather_winter.std <- weather_winter %>%
group_by(country, capital_city) %>%
summarise(winter = wtd.var(winter))
With only one "winter" column, there's no need for the across().
Finally, you are not using the weights. If these are needed, then change the last line to:
summarise(winter = wtd.var(winter, weights = weight))
To give:
# A tibble: 4 x 3
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 26.3
2 GREECE Athens 14.2
3 SERBIA Belgrade 18.8
4 SLOVENIA Ljubljana 15.8

R | Adding index numbers

I have two dataset which look like below
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 NA
South Asia 2009 4.5 NA
South Asia 2011 11 0
South Asia 2014 16.7 NA
Africa 2008 0.4 NA
Africa 2013 3.5 0
Africa 2017 9.7 NA
Strategy
Region StrategyYear
South Asia 2011
Africa 2013
Japan 2007
SE Asia 2009
There are multiple regions and many review years which are not periodic and not even same for all regions. I have added a column 'Index' to dataframe 'Sales' such that for a strategy year from second dataframe, the index value is zero. I now want to change NA to a series of numbers that tell how many rows before or after that particular row is to 0 row, grouped by 'Region'.
I can do this using a for loop but that is just tedious and checking if there is a cleaner way to do this. Final output should look like
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 -2
South Asia 2009 4.5 -1
South Asia 2011 11 0
South Asia 2014 16.7 1
Africa 2008 0.4 -1
Africa 2013 3.5 0
Africa 2017 9.7 1
Join the two datasets by Region and for each Region create an Index column by subtracting the row number with the index where StrategyYear matches the ReviewYear.
library(dplyr)
left_join(Sales, Strategy, by = 'Region') %>%
arrange(Region, StrategyYear) %>%
group_by(Region) %>%
mutate(Index = row_number() - match(first(StrategyYear), ReviewYear))
# Region ReviewYear Sales Index StrategyYear
# <chr> <int> <dbl> <int> <int>
#1 Africa 2008 0.4 -1 2013
#2 Africa 2013 3.5 0 2013
#3 Africa 2017 9.7 1 2013
#4 SouthAsia 2006 1.5 -2 2011
#5 SouthAsia 2009 4.5 -1 2011
#6 SouthAsia 2011 11 0 2011
#7 SouthAsia 2014 16.7 1 2011
data
Sales <- structure(list(Region = c("SouthAsia", "SouthAsia", "SouthAsia",
"SouthAsia", "Africa", "Africa", "Africa"), ReviewYear = c(2006L,
2009L, 2011L, 2014L, 2008L, 2013L, 2017L), Sales = c(1.5, 4.5,
11, 16.7, 0.4, 3.5, 9.7), Index = c(NA, NA, 0L, NA, NA, 0L, NA
)), class = "data.frame", row.names = c(NA, -7L))
Strategy <- structure(list(Region = c("SouthAsia", "Africa", "Japan", "SEAsia"
), StrategyYear = c(2011L, 2013L, 2007L, 2009L)), class = "data.frame",
row.names = c(NA, -4L))

How to specify ID variables in dcast?

Following is a sample of the data I have
datahave
# A tibble: 6 x 6
YEAR SCHOOL_NAME CONTENT_AREA BELOW_BASIC_PCT BASIC_PCT ADVANCED_PCT
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2015 5TH AND 6TH GRADE CTR. Eng. Language Arts 38.1 28.3 10.1
2 2015 5TH AND 6TH GRADE CTR. Mathematics 39 30.3 14.6
3 2015 5TH AND 6TH GRADE CTR. Science 25.4 41.7 12.3
4 2015 6TH GRADE CENTER Eng. Language Arts 7.6 27.8 21.8
5 2015 6TH GRADE CENTER Mathematics 19.100000000000001 37.700000000000003 17.5
6 2015 7th and 8th Grade Center Eng. Language Arts 52.1 27.4 1.7
Following is a reproducible example similar to this
school<-c("A","A",'A','B','B','B')
content_area<-c('english','math','science','english','math','science')
below_basic<-c(20,30,40,10,15,20)
advanced<-c(2,5,3,1,2.5,1.5)
df<-data.frame(school,content_area,below_basic,advanced)
df
and ran the following code on the above
library(reshape2)
dcast(melt(df), school ~ content_area + variable)
This gives me the desired output because it is using Using school, content_area as id variables
However when I run the same code on the original dataset
dcast(melt(datahave), SCHOOL_NAME ~ CONTENT_AREA + variable)
it is actually using Using SCHOOL_NAME, CONTENT_AREA, BELOW_BASIC_PCT, BASIC_PCT, ADVANCED_PCT as id variables
How do I specify which columns can be used as the ID variable? so I get an output similar to the reproducible example.
We can specify the id.var in melt, otherwise, it can automatically pick the variables based on the type.
library(reshape2)
dcast(melt(datahave, id.var = c("YEAR", "SCHOOL_NAME", "CONTENT_AREA")),
SCHOOL_NAME ~ CONTENT_AREA + variable)
# SCHOOL_NAME Eng. Language Arts_BELOW_BASIC_PCT Eng. Language Arts_BASIC_PCT
#1 5TH AND 6TH GRADE CTR. 38.1 28.3
#2 6TH GRADE CENTER 7.6 27.8
#3 7th and 8th Grade Center 52.1 27.4
# Eng. Language Arts_ADVANCED_PCT Mathematics_BELOW_BASIC_PCT Mathematics_BASIC_PCT Mathematics_ADVANCED_PCT
#1 10.1 39.0 30.3 14.6
#2 21.8 19.1 37.7 17.5
#3 1.7 NA NA NA
# Science_BELOW_BASIC_PCT Science_BASIC_PCT Science_ADVANCED_PCT
#1 25.4 41.7 12.3
#2 NA NA NA
#3 NA NA NA
The melt/dcast wrapper is recast which can be used as well
recast(datahave, id.var = c("YEAR", "SCHOOL_NAME", "CONTENT_AREA"),
SCHOOL_NAME ~ CONTENT_AREA + variable)
data
datahave <- structure(list(YEAR = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L
), SCHOOL_NAME = c("5TH AND 6TH GRADE CTR.", "5TH AND 6TH GRADE CTR.",
"5TH AND 6TH GRADE CTR.", "6TH GRADE CENTER", "6TH GRADE CENTER",
"7th and 8th Grade Center"), CONTENT_AREA = c("Eng. Language Arts",
"Mathematics", "Science", "Eng. Language Arts", "Mathematics",
"Eng. Language Arts"), BELOW_BASIC_PCT = c(38.1, 39, 25.4, 7.6,
19.1, 52.1), BASIC_PCT = c(28.3, 30.3, 41.7, 27.8, 37.7, 27.4
), ADVANCED_PCT = c(10.1, 14.6, 12.3, 21.8, 17.5, 1.7)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Resources