Complex Grouped Bar Chart - r

I would really like to learn how to use R, but I'm still struggling with basic things. I need to make a Bar graph where column are grouped into four variables. This is a simplified matrix of my data:
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000
And this is the results that I would like to obtain with R (made with excel):
I've spent a lot of lime looking online but I can find codes for just two variables. Could someone help me to do this?

Not exactly what you asked for but here goes.
data <- read.table(textConnection("
REGION AREA AGE LOCALS FOREIGNER
1 USA CITY OLD 30.7485876 3.5254237
2 USA CITY YOUNG 51.1666667 1.1666667
3 USA COUNTRY OLD 6.1666667 1.8333333
4 USA COUNTRY YOUNG 14.0000000 2.5000000
5 EUROPE CITY OLD 4.5000000 8.8333333
6 EUROPE CITY YOUNG 0.6680672 18.7044818
7 EUROPE COUNTRY OLD 56.5000000 0.8333333
8 EUROPE COUNTRY YOUNG 59.8333333 0.6666667
9 ASIA CITY OLD 28.6666667 6.1666667
10 ASIA CITY YOUNG 25.8333333 7.3333333
11 ASIA COUNTRY OLD 3.0494232 18.1195224
12 ASIA COUNTRY YOUNG 2.1666667 21.5000000"), header = TRUE)
data <- as.data.frame(data)
library(tidyr)
data <- data %>%
gather(LOC_FOR, VALUE, -REGION, -AREA, -AGE) #If you want to change the name "LOC_FOR" to something else do it here.
library(ggplot2)
ggplot(data, aes(x = AGE, y = VALUE, fill = LOC_FOR)) +
geom_bar(position = 'dodge', stat = 'identity') +
facet_grid(~REGION + AREA)

Related

Selecting a column with a dot in R (nested object)

I'm new to R and I'm not sure how to rephrase the question, but basically, I have this dataset coming from the following code:
data_url <- 'https://prod-scores-api.ausopen.com/year/2021/stats'
dat <- jsonlite::fromJSON(data_url)
men_aces <- bind_rows(dat$statistics$rankings[[1]]$players[1])
men_aces_table <- dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>% select(full_name, nationality)
Which resulted in this data frame:
full_name nationality.uuid nationality.name nationality.code
1 Novak Djokovic 99da9b29-eade-4ac3-a7b0-b0b8c2192df7 Serbia SRB
2 Alexander Zverev 99d83e85-3173-4ccc-9d91-8368720f4a47 Germany GER
3 Milos Raonic 07779acb-6740-4b26-a664-f01c0b54b390 Canada CAN
4 Daniil Medvedev fa925d2d-337f-4074-a0bd-afddb38d66e1 Russia RUS
5 Nick Kyrgios 9b11f78c-47c1-43c4-97d0-ba3381eb9f07 Australia AUS
nationality is the nested object inside the player object if you check the JSON url, it contains the above properties (uuid, name, code), if I select the full_name property I would get the value (which is of type character) right back.
I'm not sure how to select the name and from that data frame (nationality) and rename it to country.
My expected outcome is:
full_name country
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
I would appreciate some help. Sorry I was unclear.
Use purrr::pmap_chr
library(tidyverse)
dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>%
select(full_name, nationality) %>%
mutate(nationality = pmap_chr(nationality, ~ ..2))
full_name nationality
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
11 Aslan Karatsev Russia
12 Taylor Fritz United States of America
13 Matteo Berrettini Italy
14 Grigor Dimitrov Bulgaria
15 Feliciano Lopez Spain
16 Stefanos Tsitsipas Greece
17 Felix Auger-Aliassime Canada
18 Thanasi Kokkinakis Australia
19 Ugo Humbert France
20 Borna Coric Croatia
You could do:
bind_cols(full_name = dat$players$full_name, country = dat$players$nationality$name)
# A tibble: 169 x 2
full_name country
<chr> <chr>
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
just add this line at the end
newdf <- data.frame(full_name = men_aces_table$full_name, country = men_aces_table$nationality$name)

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Summarize data using doBy package at region level

I have a dataset Data as below,
Region Country Market Price
EUROPE France France 30.4502
EUROPE Israel Israel 5.14110965
EUROPE France France 8.99665
APAC CHINA CHINA 2.6877232
APAC INDIA INDIA 60.9004
AFME SL SL 54.1729685
LA BRAZIL BRAZIL 56.8606917
EUROPE RUSSIA RUSSIA 11.6843732
APAC BURMA BURMA 63.5881232
AFME SA SA 115.0733685
I would like to summarize the data at Region level and get the SUM of Price at every Region Level.
I want the ouput to be Like below.
Data Output
Region Country Price
EUROPE France 30.4502
EUROPE Israel 5.14110965
EUROPE France 8.99665
EUROPE RUSSIA 11.6843732
Europe 56.27233285
APAC BURMA 63.5881232
APAC CHINA 2.6877232
APAC INDIA 60.9004
Apac 127.1762464
AFME BAHARAIN 54.1729685
AFME SA 115.0733685
AFME 169.246337
LA BRAZIL 56.8606917
LA 56.8606917
I have used summaryBy function of doBy package, i have tried the code below.
summaryBy
myfun1 <- function(x){c(s=Sum(x)}
DB= summaryBy(Data$Price ~Region + Country , data=Data, FUN=myfun1)
Anyhelp on this regard is very much appreciated.
You can do this by using dplyr to generate a summary table:
library(dplyr)
totals <- data %>% group_by(Region) %>% summarise(Country="",Price=sum(Price))
And then merging the summary with the rest of the data:
summary <- rbind(data[-3], totals)
Then you can sort by Region to put the summary with the region:
summary <- summary %>% arrange(Region)
Output:
Region Country Price
1 AFME SL 54.1730
2 AFME SA 115.0734
3 AFME 169.2463
4 APAC CHINA 2.6877
5 APAC INDIA 60.9004
6 APAC BURMA 63.5881
7 APAC 127.1762
8 EUROPE France 30.4502
9 EUROPE Israel 5.1411
10 EUROPE France 8.9967
11 EUROPE RUSSIA 11.6844
12 EUROPE 56.2723
13 LA BRAZIL 56.8607
14 LA 56.8607
You have to split data by Region factor and sum Price for each factor
lapply(split(data, data$Region), function(x) sum(x$Price))
Or, if you need to present result as you have shown:
totals = lapply(split(data, data$Region), function(x) rbind(x,data.frame(Region=unique(x$Region), Country="", Market="", Price=sum(x$Price))))
do.call(rbind, totals)

Create a moving sum of past levels of a variable, summed over for each level of 3 other variables, in R

I have a data.frame of the following structure (panel data), with 16 levels of time(quarters) 14 levels of geo (countries) and 20 levels of citizen, each of them repeating accordingly in the dataframe.
time geo citizen X
2008Q1 Belgium Afghanistan 22
2008Q1 Belgium Armenia 10
2008Q1 Belgium Bangladesh 25
2008Q1 Belgium Democratic Republic of the Congo 55
2008Q1 Belgium China (including Hong Kong) 5
2008Q1 Belgium Eritrea 8
I would like to create a new column lets say MOVSUM where it will sum variable X for each level of citizen and geo and time for the previous 4 quarters, so that I would have for each quarter, t, how many X's of each citizen in each geo were available during t-4 to t-1 quarters.
Thanks in advance

Resources