R - Remove all rows but those for most recent date? - r

I am working on a project dealing with Covid-19 Data. I have data that is updated daily from Our World in Data. The csv file is here: https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv
The data has several columns country, date, cases, etc.
What I am interested in is saving only the most recent row for each country and removing everything else. What would be the best way to go about this?
Currently, my code looks like this. I have recently made the transition to R from another program, so guidance is helpful even if this is a dumb question!
world.data < -read.csv("https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv")
world.data$iso_code < -NULL# Remove Country ISO Code
world.data$date < -as.Date(world.data$date, "%Y-%m-%d")
library(ggplot2)

Here is a solution that uses the tidyverse. We group the data by location and select the maximum value of date.
rawData <- "https://raw.githubusercontent.com/owid/covid-19-data/9ee33ac73942b2e37eb04014bf2a7a17a83998cf/public/data/owid-covid-data.csv"
download.file(rawData,"./data/owid_covid_data.csv")
data <- read.csv("./data/owid_covid_data.csv",header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
data %>% group_by(location) %>%
filter(date == max(date)) -> filteredData
...and the first few rows of output:
> head(filteredData[1:4])
# A tibble: 6 x 4
# Groups: location [6]
iso_code location date total_cases
<chr> <chr> <chr> <int>
1 ABW Aruba 2020-04-19 96
2 AFG Afghanistan 2020-04-19 908
3 AGO Angola 2020-04-19 24
4 AIA Anguilla 2020-04-19 3
5 ALB Albania 2020-04-19 548
6 AND Andorra 2020-04-19 704
>

Try something like:
library(tidyverse)
world.data %>% group_by(location) %>% top_n(1,date)
or without the pipe
top_n(group_by(world.data, location), 1, date)

Related

How to exclude observation that does not appear at least once every year - R

I have a database where companies are identified by an ID (cnpjcei) from 2009 to 2018, where we can have 1 or more observations of a given company in a given year or no observations of a given company in a given year.
Here is a sample of the database:
> df
cnpjcei year
<chr> <dbl>
1 4774 2009
2 4774 2010
3 28959 2009
4 29688 2009
5 43591 2010
6 43591 2010
7 65803 2011
8 105104 2011
9 113980 2012
10 220043 2013
I would like to keep in that df only the companies that appear at least once a year.
What would be the easiest way to do this?
Using the data.table library:
library(data.table)
df<-data.table(df)
df<-df[,unique_years:=length(unique(year)), by=list(cnpjcei),][unique_years==10]
We can use dplyr, group_by id and filter only the cases in which all the elements in 2009:2018 can be found %in% the year column.
Please mind that, for this code to work with the sample database as in the question, the range would have to be replaced with 2009:2013
library(dplyr)
df %>% group_by(cnpjcei) %>% filter(all(2009:2018 %in% year))
You can keep the ids (cnpjcei) which has all the unique years available in the data.
library(dplyr)
result <- df %>%
group_by(cnpjcei) %>%
filter(n_distinct(year) == n_distinct(.$year)) %>%
ungroup

Scatter plot with variables that have multiple different years

I'm currently trying to make a scatter plot of child mortality rate and child labor. My problem is, I don't actually have a lot of data, and some countries may only get values for some years, and some other countries may only have data for some other years, so I can't plot all the data together, nor the data in any year is big enough to limit to that only year. I was wondering if there is a function that takes the last value available in the dataset for any given specified variable. So, for instance, if my last data for child labor from Germany is from 2015 and my last data from Italy is from 2014, and so forth with the rest of the countries, is there a way I can plot the last values for each country?
Code goes like this:
head(data2)
# A tibble: 6 x 5
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1962 34.5 NA
2 Afghanistan AFG 1963 33.9 NA
3 Afghanistan AFG 1964 33.3 NA
4 Afghanistan AFG 1965 32.8 NA
5 Afghanistan AFG 1966 32.2 NA
6 Afghanistan AFG 1967 31.7 NA
Never mind about those NA's. Labor data just doesn't go back there. But I do have it in the dataset, for more recent years. Child mortality data, on the other hand, is actually pretty complete.
Thanks.
I cannot find which variable to plot, but following code can select only last of each country.
data2 %>%
group_by(Entity) %>%
filter(Year == max(Year)) %>%
ungroup
result is like
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <lgl>
1 Afghanistan AFG 1967 31.7 NA
No you can plot some variable.
You might want to define what you mean by 'last' value per group - as in most recent, last occurrence in the data or something else?
dplyr::last picks out the last occurrence in the data, so you could use it along with arrange to order your data. In this example we sort the data by Year (ascending order by default), so the last observation will be the most recent. Assuming you don't want to include NA values, we also use filter to remove them from the data.
data2 %>%
# first remove NAs from the data
filter(
!is.na(labor)
) %>%
# then sort the data by Year
arrange(Year) %>%
# then extract the last observation per country
group_by(Entity) %>%
summarise(
last_record = last(labor)
)

How to check for each country if date falls within specific interval across rows?

Building on Check if a date is within an interval in R, we want to see if a specific event falls into a timeframe specified by another event. To give you a concrete example: For each country, did event (battle/protests/...) happen at the time of elections?
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_type <- c("battle", "protests","riots", "riots", "elections","elections","protests","riots","violence","riots","elections")
event_date <- as.Date(c("2017-06-16", "2017-01-23", "2016-03-15", "2017-09-18", "2017-08-23", "2019-04-18", "2019-03-12", "2019-04-14", "2018-03-15", "2015-09-14", "2016-03-20"))
start_ecycle <- as.Date(c(NA,NA,NA,NA,"2017-05-25", "2019-01-18",NA,NA,NA,NA,"2015-12-21"))
end_ecycle <-as.Date(c(NA,NA,NA,NA,"2017-09-22","2019-05-18",NA,NA,NA,NA,"2016-04-19"))
mydata <- data.frame(country, event_type, event_date, start_ecycle, end_ecycle)
To this end, we created an interval variable
library(lubridate)
is.instant(mydata$start_ecycle); is.instant(mydata$end_ecycle)
mydata$ecycle <- interval(mydata$start_ecycle, mydata$end_ecycle)
Now, we got stuck. This is what the data.frame should look like in the end - i.e. here column G "ecycle_within" is added with 1 if event_date falls within ecycle (per country):
Any help much appreciated. Thanks!
Based on your comment about the elections cycles being across rows, I would recommend creating a separate dataset first with the elections data.
You can then join the election dates table. This will create a duplicate row for each event and election date range though.
The %within% lubridate function can then be used to check whether an event is within a specific election date range.
Lastly I reduce the number of rows, by filtering out rows corresponding to election date ranges that aren't relevant.
I am more familiar with dplyr and purrr and used them to implement it below. But you should be able to do something similar with base-r functions too.
I got the output close to your required output. But not 100% sure why you would like to do it this way.
library(tidyverse)
library(lubridate)
library(purrr)
elections <- mydata %>%
as_tibble() %>%
select(country, event_type, start_ecycle, end_ecycle) %>%
filter(event_type == "elections") %>%
mutate(election_year = year(start_ecycle)) %>%
select(country, start_ecycle, end_ecycle, election_year)
mydata2 <- mydata %>%
as_tibble() %>%
mutate(row = row_number()) %>%
select(row, country, event_type, event_date) %>%
left_join(elections, by = "country") %>%
mutate(ecycle = map2(start_ecycle, end_ecycle, ~ interval(.x, .y))) %>%
mutate(ecycle_within = map2_int(event_date, ecycle, ~ .x %within% .y)) %>%
select(-ecycle) %>%
group_by(country, event_type, event_date) %>%
arrange(desc(ecycle_within)) %>%
slice(1:1) %>%
ungroup() %>%
arrange(row) %>%
select(-row)
mydata2 %>% select(-election_year)
#> # A tibble: 11 x 6
#> country event_type event_date start_ecycle end_ecycle ecycle_within
#> <fct> <fct> <date> <date> <date> <int>
#> 1 Angola battle 2017-06-16 2017-05-25 2017-09-22 1
#> 2 Angola protests 2017-01-23 2017-05-25 2017-09-22 0
#> 3 Angola riots 2016-03-15 2017-05-25 2017-09-22 0
#> 4 Angola riots 2017-09-18 2017-05-25 2017-09-22 1
#> 5 Angola elections 2017-08-23 2017-05-25 2017-09-22 1
#> 6 Benin elections 2019-04-18 2019-01-18 2019-05-18 1
#> 7 Benin protests 2019-03-12 2019-01-18 2019-05-18 1
#> 8 Benin riots 2019-04-14 2019-01-18 2019-05-18 1
#> 9 Benin violence 2018-03-15 2019-01-18 2019-05-18 0
#> 10 Benin riots 2015-09-14 2019-01-18 2019-05-18 0
#> 11 Benin elections 2016-03-20 2015-12-21 2016-04-19 1

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Looping through two dataframes and adding columns inside of the loop

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?
You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

Resources