How to plot a specific row in a dataframe using ggplot [duplicate] - r

This question already has answers here:
Plotting each value of columns for a specific row
(2 answers)
Closed 1 year ago.
I have a dataframe that shows the number of car sales in each country for years 2000 to 2020. I wish to plot a line graph to show how the number of car sales have changed over time for only a specific country/row, with year on the x axis and sales on the y axis. How would I do this using ggplot?

You perhaps want this
#toy_data
sales
#> Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
#> 2 A 1002 976 746 147 1207 627 157 1481 1885 1908 392
#> 3 B 846 723 1935 176 1083 636 1540 1692 899 607 1446
#> 4 C 1858 139 1250 121 1520 199 864 238 1109 1029 937
#> 5 D 534 1203 1759 553 1765 1784 1410 420 606 467 1391
library(tidyverse)
#for all countries
sales %>% pivot_longer(!Country, names_to = 'year', values_to = 'sales') %>%
mutate(year = as.numeric(year)) %>%
ggplot(aes(x = year, y = sales, color = Country)) +
geom_line()
#for one country
sales %>% pivot_longer(!Country, names_to = 'year', values_to = 'sales') %>%
mutate(year = as.numeric(year)) %>%
filter(Country == 'A') %>%
ggplot(aes(x = year, y = sales)) +
geom_line()
Created on 2021-06-07 by the reprex package (v2.0.0)

Suppose you have a data frame that looks like this:
#make dummy df
df <- matrix(sample(1:100, 63), ncol=21, nrow=3)
rownames(df) <- c("UK", "US", "UAE")
colnames(df) <- 2000:2020
Here I generated some random data for 21 years between 2000 and 2020, and for three countries. To get a line plot with ggplot for UK, I did:
data_uk <- data.frame(year=colnames(df), sales=df["UK",], row.names=NULL)
ggplot(data=data_uk, aes(x=year, y=sales, group=1)) + geom_point() + geom_line()
Example plot

Related

How to make exploratory plots using only certain rows in a column

I am making some exploratory plots to analyze zone M. I need one that plots Distance over time and another with Distance vs. MHT.
Here is what I have so far:
library(ggplot2)
ggplot(datmarsh, aes(x=Year, y=Distance)) + geom_point()
ggplot(datmarsh, aes(x=MHT, y=Distance)) + geom_point()
What I'm struggling with is specifying only zone "M" in each of these graphs.
Here is a sample of what my data looks like:
Year Distance MHT Zone
1975 253.1875 933 M
1976 229.75 877 M
1977 243.8125 963 M
1978 243.8125 957 M
1975 103.5 933 P
1976 150.375 877 P
1977 117.5625 963 P
1978 131.625 957 P
1979 145.6875 967 P
1975 234.5 933 PP
1976 314.1875 877 PP
1977 248.5625 963 PP
1978 272 957 PP
1979 290.75 967 PP
Thanks!
dplyr::filter() will let you do what you need. However, this has probably been answered elsewhere a few times, so do try searching!
library(dplyr)
library(ggplot2)
library(magrittr)
datmarsh %>%
filter(Zone == "M") %>%
ggplot(aes(x=Year, y=Distance)) +
geom_point()
datmarsh %>%
filter(Zone == "M") %>%
ggplot(daes(x=MHT, y=Distance)) +
geom_point()

gather() with two key columns

I have a dataset that has two rows of data, and want to tidy them using something like gather() but don't know how to mark both as key columns.
The data looks like:
Country US Canada US
org_id 332 778 920
02-15-20 25 35 54
03-15-20 30 10 60
And I want it to look like
country org_id date purchase_price
US 332 02-15-20 25
Canada 778 02-15-20 35
US 920 02-15-20 54
US 332 03-15-20 30
Canada 778 03-15-20 10
US 920 03-15-20 60
I know gather() can move the country row to a column, for example, but is there a way to move both the country and org_id rows to columns?
It is not a good idea to have duplicate column names in the data so I'll rename one of them.
names(df)[4] <- 'US_1'
gather has been retired and replaced with pivot_longer.
This is not a traditional reshape because the data in the 1st row needs to be treated differently than rest of the rows so we can perform the reshaping separately and combine the result to get one final dataframe.
library(dplyr)
library(tidyr)
df1 <- df %>% slice(-1L) %>% pivot_longer(cols = -Country)
df %>%
slice(1L) %>%
pivot_longer(-Country, values_to = 'org_id') %>%
select(-Country) %>%
inner_join(df1, by = 'name') %>%
rename(Country = name, date = Country) -> result
result
# Country org_id date value
# <chr> <int> <chr> <int>
#1 US 332 02-15-20 25
#2 US 332 03-15-20 30
#3 Canada 778 02-15-20 35
#4 Canada 778 03-15-20 10
#5 US_1 920 02-15-20 54
#6 US_1 920 03-15-20 60
data
df <- structure(list(Country = c("org_id", "02-15-20", "03-15-20"),
US = c(332L, 25L, 30L), Canada = c(778L, 35L, 10L), US = c(920L,
54L, 60L)), class = "data.frame", row.names = c(NA, -3L))
First, we paste together Country and org_id
library(tidyverse)
data <- set_names(data, paste(names(data), data[1,], sep = "-"))
data
Country-org_id US-332 Canada-778 US-920
1 org_id 332 778 920
2 02-15-20 25 35 54
3 03-15-20 30 10 60
Then, we drop the first row, pivot the table and separate the column name.
df <- data %>%
slice(2:n()) %>%
rename(date = `Country-org_id`) %>%
pivot_longer(cols = -date, values_to = "price") %>%
separate(col = name, into = c("country", "org_id"), sep = "-")
df
# A tibble: 6 x 4
date country org_id price
<chr> <chr> <chr> <int>
1 02-15-20 US 332 25
2 02-15-20 Canada 778 35
3 02-15-20 US 920 54
4 03-15-20 US 332 30
5 03-15-20 Canada 778 10
6 03-15-20 US 920 60

R: fill out a new colum based on sevaral variables in another dataset

I have a first dataframe with 4 columns (ID, Year, X and Y)
Id Year X Y
1 2017 20_24
1 2016 45_49
2 2017 30_34
2 2014 20_24
4 2014 14_19
4 2015 20_24
I would like to fill out the Y column using another dataset.
The second dataset got the same variables ID and year, the other columns are the items of the column X in the first dataset.
Id Year 14_19 20_24 30_34 45_49
1 2017 123 122 5555 4444
1 2016 456 543 8888 333
1 2015 5644 0908 0987 5456
1 2014 5642 767 233 323
2 2017 123 123 5666 989
2 2016 456 876 55 45
2 2015 786 789 324 77
2 2014 633 543 334 34
3 2017 123 123 321 44
3 2016 456 345 45645 23
3 2015 876 4556 6554 23
So I would like Y to be filled out when ID, Year and items of the X variables are matching the columns of the second dataset.How is this possible ?
Thanks !
Try this dplyr and tidyr solution:
library(dplyr)
library(tidyr)
result <- df2 %>%
gather("X", "Y", -c("ID", "Year")) %>%
right_join(df1, by = c("ID", "Year", "X"))
Or with the use of pivot_longer()
result <- df2 %>%
pivot_longer(cols = 3:4,
names_to = "X",
values_to = "Y") %>%
right_join(df1, by = c("ID", "Year", "X"))

How can I store very small numbers in a list after looping through values in R?

I have analyzed current data (ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) from the European Centre for Disease Prevention and Control that keeps track about COVID-19 cases across months and countries. This way I would like to gain insights about the spread of active cases, but also about the way they relate to deaths related to the disease. My goal: to create a variable that stores the percentage of deaths per total infections for every day in march, divided by countries.
Here is my code:
library(readxl)
d <- read_excel("C:/Users/hanna/Downloads/COVID-19-geographic-disbtribution-worldwide.xlsx")
#View(d)
corona_de <- d %>% filter(`Countries and territories` == "Germany" & Month == 3)
# explore the data
library(skimr)
skim(corona_de)
library(ggplot2)
ggplot(corona_de, aes (x = Day, y = Cases)) +
geom_line(color = "red")+ theme_classic()
# deutschland, england, spanien, italien, frankreich, österreich
corona <- d %>% filter(`Countries and territories` == "Germany" |
`Countries and territories` == "France" |
`Countries and territories` == "Italy" |
`Countries and territories` == "Spain") #filter for month later %>% filter(Month == 3)
#----------------------------------------------------------------
# Preprocess data and create cumulative and percent variables
#----------------------------------------------------------------
# format dates
library(lubridate)
corona$DateRep <-as.Date(corona$DateRep,"%Y-%m-%d UTC")
# store in list for later
dates <- corona_march$DateRep
# store countries list to loop through
countries <- unique(corona$`Countries and territories`)
#create empty objects
active_cases<- NULL
deaths_cum <- NULL
active_percent <- NULL
death_percent <- NULL
#loop through number of countries
for (c in 1:4){
current_country <- subset(corona_march, `Countries and territories` == countries[c])
# loop trhough days of march
for (i in 25:1){
# get new cases, deaths and population size for that day
current_interval = current_country %>% filter(DateRep >= dates[i])
current_case = current_interval %>% select(Cases)
current_death = current_interval %>% select(Deaths)
pop = current_country %>% filter(DateRep == dates[i]) %>% select(Pop_Data.2018)
# calculate cumulative cases, deaths and percent active
active_cum = sum(current_case$Cases)
percent_active = active_cum / pop[[1]]
cum_death = sum(current_death)
# avoid scientific notation
options("scipen"=100, "digits"=7)
percent_death = cum_death / pop[[1]]
# store variables in list
active_cases <- append(active_cases,active_cum)
deaths_cum <- append(deaths_cum,cum_death)
active_percent <- append(active_percent,percent_active)
percent_death <- append(death_percent, percent_death)
}
}
Surprisingly, everything works fine except for the percent_death variable. For the cumulative deaths, the output looks like this:
deaths_cum
[1] 1098 1098 1098 1097 1096 1096 1093 1091 1090 1081 1070 1067 1052 1039 1021 1009 973
[18] 952 925 856 728 650 538 426 240 149 149 149 149 149 149 149 149 149
[35] 149 147 147 146 144 144 141 137 136 136 136 106 104 82 55 23 6799
[52] 6791 6785 6768 6740 6713 6672 6623 6587 6454 6356 6189 5993 5804 5552 5379 5009 4662
[69] 4315 3842 3413 2788 1993 1344 743 2696 2696 2696 2696 2696 2695 2693 2691 2691 2691
[86] 2668 2661 2649 2612 2575 2560 2408 2387 2205 2098 1929 1694 1370 976 514
But for the percent_death variable, it seems to stop after 1 iteration:
> percent_death
[1] 0.00001100083
Any idea what happened? Why does the append function work for all of the variables except for small numbers? Is there a smarter way to do it?
I have now found a neat way to perform the computation with data.table which I would not have found without initial advice by Gregor Thomas.
change object names to more distinguishable ones
avoid scientific notation of small values using option()
use data.table instead of several loops to reduce complexity and make the code cleaner
corona <- d %>% filter(`Countries and territories` %in% c("Germany", "France", "Italy", "Spain") )
corona_march <- corona %>% filter(Month == 3)
library(data.table)
corona_table <- data.table(corona_march)
# for each country in corona_march, calculate the cumulative cases, percent_active, cumulative deaths and percent deathsbased on the dates
corona_table[, Active_cases := cumsum(Cases), by = .(`Countries and territories`,Day)]
corona_table[, Deaths_cumulative := cumsum(Deaths), by = .(`Countries and territories`,Day)]
# avoid scientific notation for percentages
options("scipen"=100, "digits"=7)
corona_table[, Percent_active := Active_cases/Pop_Data.2018, by = .(`Countries and territories`,Day)]
corona_table[, Percent_death := Deaths_cumulative/Active_cases, by = .(`Countries and territories`,Day)]

100% stacked area using ggplot

I have two data frame: LF and HF
head(LF)
Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561
head(HF)
Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746
5 2013 165.32 5926.628 424 15.93962 461 17.16547 873 31.70556
Following relationship follow for above data frame: LF$SS + HF$SS = total load
I want to plot the proportion(%) of LF and HF each column variable using the two data frame as shown below;
Your help would be appreciated
Here is an approach:
library(tidyverse)
lf %>%
mutate(col = "lf") %>% #add column to lf specifying the data frame
bind_rows(hf %>% #bind rows of hf
mutate(col = "hf")) %>% #add column to hf specifying the data frame
gather(key, value, 2:9) %>% #convert to long format
group_by(key, Year) %>% #group by variable and year
mutate(ratio = value/sum(value)) %>% #calculate the desired ratio
ggplot()+
geom_area(aes(x = Year, y = ratio, fill = col)) + #pretty much self explanatory
facet_wrap(~key) +
scale_y_continuous(labels = scales::percent)
data:
lf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561", header = T)
hf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746", header = T)
I have removed the last row from hf so it matches the number of rows in lf
My answer doesn't differ much from #missuse's, except that it skips the need to calculate proportions.
For ggplot, you generally want data in long shape, so after binding the two data frames and marking which data frame observations come from (creating the type column in mutate), you should gather the data. In geom_area, using position = position_fill() calculates proportions within each facet, rather than you needing to do this manually.
library(tidyverse)
lf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 386.18 1164.3966 4586 12.30089 5285 14.23955 6707 18.17906
2 2010 268.72 884.9963 4354 13.37728 4927 15.20045 6078 18.81523
3 2011 347.61 746.7686 6924 12.25466 7917 13.84788 9302 16.93291
4 2012 170.68 1218.6758 2471 16.39350 3006 19.60066 3670 24.18561", header = T)
hf <- read.table(text = "Year SS SS_CQT SRP SRP_CQT TDP TDP_CQT TP TP_CQT
1 2009 184.44 4055.367 535 11.53037 621 13.50632 1175 25.82282
2 2010 118.08 2726.272 737 14.44196 868 16.92781 1236 24.56522
3 2011 119.90 2208.308 663 10.19803 742 11.42253 1086 17.36818
4 2012 554.07 11913.003 2413 45.44719 2781 52.90863 4290 85.87746", header = T)
df <- bind_rows(
lf %>% mutate(type = "LF"),
hf %>% mutate(type = "HF")
) %>%
gather(key = measure, value = value, -Year, -type)
ggplot(df, aes(x = Year, y = value, fill = type)) +
geom_area(position = position_fill()) +
facet_wrap(~ measure) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c(HF = "darkorange", LF = "slateblue"))
Created on 2018-05-20 by the reprex package (v0.2.0).

Resources