Scatter plot with variables that have multiple different years - r

I'm currently trying to make a scatter plot of child mortality rate and child labor. My problem is, I don't actually have a lot of data, and some countries may only get values for some years, and some other countries may only have data for some other years, so I can't plot all the data together, nor the data in any year is big enough to limit to that only year. I was wondering if there is a function that takes the last value available in the dataset for any given specified variable. So, for instance, if my last data for child labor from Germany is from 2015 and my last data from Italy is from 2014, and so forth with the rest of the countries, is there a way I can plot the last values for each country?
Code goes like this:
head(data2)
# A tibble: 6 x 5
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1962 34.5 NA
2 Afghanistan AFG 1963 33.9 NA
3 Afghanistan AFG 1964 33.3 NA
4 Afghanistan AFG 1965 32.8 NA
5 Afghanistan AFG 1966 32.2 NA
6 Afghanistan AFG 1967 31.7 NA
Never mind about those NA's. Labor data just doesn't go back there. But I do have it in the dataset, for more recent years. Child mortality data, on the other hand, is actually pretty complete.
Thanks.

I cannot find which variable to plot, but following code can select only last of each country.
data2 %>%
group_by(Entity) %>%
filter(Year == max(Year)) %>%
ungroup
result is like
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <lgl>
1 Afghanistan AFG 1967 31.7 NA
No you can plot some variable.

You might want to define what you mean by 'last' value per group - as in most recent, last occurrence in the data or something else?
dplyr::last picks out the last occurrence in the data, so you could use it along with arrange to order your data. In this example we sort the data by Year (ascending order by default), so the last observation will be the most recent. Assuming you don't want to include NA values, we also use filter to remove them from the data.
data2 %>%
# first remove NAs from the data
filter(
!is.na(labor)
) %>%
# then sort the data by Year
arrange(Year) %>%
# then extract the last observation per country
group_by(Entity) %>%
summarise(
last_record = last(labor)
)

Related

ggplot doesn't arrange my graph as expected [duplicate]

This question already has answers here:
ggplot2: sorting a plot
(5 answers)
How to force specific order of the variables on the X axis?
(1 answer)
Closed last month.
Good morning,
I'm trying to use ggplot with a data frame but I faced an issue. My ggplot doesn't take consideration about the function arrange on my data frame.
Here is my code :
data()
pop <- population[population$year == 1995, ]
pop <- pop[1:10, ]
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
pop <- pop %>%
arrange(population)
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
I would like that my graph would be arranged according to the population, so at the first place, the country with the lowest population, at the second place, the country with the second lowest population and so on. But ggplot doesn't match my graph as expected.
I have this data frame :
country year population
<chr> <int> <int>
1 Anguilla 1995 9807
2 American Samoa 1995 52874
3 Andorra 1995 63854
4 Antigua and Barbuda 1995 68349
5 Armenia 1995 3223173
6 Albania 1995 3357858
7 Angola 1995 12104952
8 Afghanistan 1995 17586073
9 Algeria 1995 29315463
10 Argentina 1995 34833168
But my graph is ordered by alphabetical order :
Do you have any idea to make it by population number?

In r, how do I add rows together to get totals for a specific set of variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
My goal is to have a list of how much FDI China sent to each country per year. At the moment I have a list of individual projects that looks like this
Year
Country
Amount
2001
Angola
6000000
2001
Angola
8000000
2001
Angola
5.0E7
I want to sum it so it looks like this.
Year
Country
Amount
2001
Angola
6.4E7
How do I merge the rows and add the totals to get nice country-year data? I can't find an R command that does this precise thing.
library(tidyverse)
I copied the data table and read your dataframe into R using:
df <- clipr::read_clip_tbl(clipr::read_clip())
I like using dplyr to solve this question:
df2 <- as.data.frame(df %>% group_by(Country,Year) %>% summarize(Amount=sum(Amount)))
# A tibble: 1 x 3
# Groups: Country [1]
Country Year Amount
<chr> <int> <dbl>
1 Angola 2001 64000000

R - Remove all rows but those for most recent date?

I am working on a project dealing with Covid-19 Data. I have data that is updated daily from Our World in Data. The csv file is here: https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv
The data has several columns country, date, cases, etc.
What I am interested in is saving only the most recent row for each country and removing everything else. What would be the best way to go about this?
Currently, my code looks like this. I have recently made the transition to R from another program, so guidance is helpful even if this is a dumb question!
world.data < -read.csv("https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv")
world.data$iso_code < -NULL# Remove Country ISO Code
world.data$date < -as.Date(world.data$date, "%Y-%m-%d")
library(ggplot2)
Here is a solution that uses the tidyverse. We group the data by location and select the maximum value of date.
rawData <- "https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv"
download.file(rawData,"./data/owid_covid_data.csv")
data <- read.csv("./data/owid_covid_data.csv",header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
data %>% group_by(location) %>%
filter(date == max(date)) -> filteredData
...and the first few rows of output:
> head(filteredData[1:4])
# A tibble: 6 x 4
# Groups: location [6]
iso_code location date total_cases
<chr> <chr> <chr> <int>
1 ABW Aruba 2020-04-19 96
2 AFG Afghanistan 2020-04-19 908
3 AGO Angola 2020-04-19 24
4 AIA Anguilla 2020-04-19 3
5 ALB Albania 2020-04-19 548
6 AND Andorra 2020-04-19 704
>
Try something like:
library(tidyverse)
world.data %>% group_by(location) %>% top_n(1,date)
or without the pipe
top_n(group_by(world.data, location), 1, date)

Looping through two dataframes and adding columns inside of the loop

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?
You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

Pass a string argument to a function as dataframe column name in dplyr

I am trying to pass a string variable to a function, to be used as the column name after some data alteration.
Here is the function:
cleandata <- function(df,name){
df <- df %>%
gather(key = 'Year',value = name,X1960:X2015)
df <- df %>%
select(-c(X,Indicator.Name,Indicator.Code))
df$Year <- substr(df$Year,start = 2,stop = 5)
df$Year <- as.factor(df$Year)
return(df)
}
I want to pass a string variable to 'name', and have it as the column name.
The current output of the function is:
> cleandata(lifeexp,'LifeExp')
Source: local data frame [13,888 x 4]
Country.Name Country.Code Year name
(fctr) (fctr) (fctr) (dbl)
1 Aruba ABW 1960 65.56937
2 Andorra AND 1960 NA
3 Afghanistan AFG 1960 32.32851
4 Angola AGO 1960 32.98483
5 Albania ALB 1960 62.25437
6 Arab World ARB 1960 46.84706
7 United Arab Emirates ARE 1960 52.24322
8 Argentina ARG 1960 65.21554
9 Armenia ARM 1960 65.86346
10 American Samoa ASM 1960 NA
.. ... ... ... ...
>
The last column should be 'LifeExp', not name. What am I missing?
Thanks in advance,
Rahul
You want to use gather_ here. See vignette('nse') for an explanation why.
year_cols <- names(df)[grepl('^X\\d{4}$', names(df))]
df %>% gather_('Year', name, year_cols)
The issue is gather takes an unquoted name for its key and value columns, so you can't pass in a variable name. It's just going to interpret what ever variable name you put in there as the the unquoted name you want for the value column. This is consistent with the principle that the tidyr functions without underscores are meant for interactive use and those with underscores should be used when your effort is more programmatic.

Resources