Merge Data Frames of different years - r

im trying to merge some dataframes using R. You can find the dataframe here https://www.kaggle.com/mathurinache/world-happiness-report.
There are 6 dataframes, each for one year (2015-2020).
Is there anyway of merging this dateframes using the year as a new column?
Ex:
Year Country Region
2015 Switzerland Western Europe ...
2016 Switzerland Western Europe ...
2017 Switzerland Western Europe ...
.
.
.
.

You first need to clean the data so that it contains the same columns. I gave it a shot, but it's not perfect yet. You still need to figure out how healthy life expectancy is defined across each year. In 2020 it seems to be a number in years, in 2019 it seems to be a standardized value and in the other years it is probably the proportion relative to the real life expectancy. Further, the log transformation of GDP in 2020 is my best guess, so no guarantee that this brings the 2020 data on the same scale as the rest of the data.
library(tidyverse)
mypath <- "Insert/Your/Path/Here/"
file_ls <- paste0(mypath, list.files(path = mypath, pattern = "*.csv"))
dat_ls <- tibble(year = 2015:2020, # setup a nested tibble
data = set_names(map(file_ls, ~ read.csv(.x) %>% as_tibble), year)) %>%
# mutate the 2020 data to match the other years
mutate(data = map_at(data, "2020",
~ mutate(.x,
Rank = rank(desc(Ladder.score)),
Logged.GDP.per.capita = log(exp(Logged.GDP.per.capita), exp(8)))
)) %>%
# enter here more map_at calls to transform the life expectancy column
# ...
# switch to rowwise, so that `map` etc is no longer needed
rowwise(year) %>%
# select names in each data set in given order and then rename them with set_names
mutate(data2 = list(select(data,
contains("country"),
contains("rank"),
contains("score") & !contains("Dystopia") & !contains("standard"),
contains("gdp") & !contains("explained"),
contains("expectancy") & !contains("explained")
) %>%
set_names(., c("country",
"rank",
"happiness_score",
"gdp_per_capita",
"life_expectancy"))
)) %>%
select(year, data2) %>%
unnest(data2)

Related

How to merge two rows so 2 years of data is represtented in one row

This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819

How do you output results of a loop applied to a data frame into a new data frame? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 1 year ago.
I am working on a data frame with the following columns - country, factor, year, number of technicians.
What i want to achieve: create a new data frame containing only the most recent data (i.e. highest year) for each country (NOTE: there are several rows of data for each country).
I have created a function to isolate the country and year as follows. (NOTE: main dataset = numeric)
#Function to isolate 1 country
one_country1 <- function(x) {
a <- numeric %>% filter(Factor == x)
return(a) }
#Function to isolate latest year
latest_country <- function(y){
b <- y %>% filter(Year == max(Year))
return(b) }
#Function to isolate both country and latest year
best_data <- function(z){
G <- latest_country(one_country1(z))
return(G)}
I then made this into a for loop to apply it to each country as follows.
z <- 1
loop_data <- for(z in 1:114){
print(best_data(z))}
This produces the correct data but it is in a strange format that is not a data frame. When I try 'typeof' it says 'NULL' and I can't seem to convert it into a data frame using simple as.data.frame functions or starting with an empty data frame and incorporating rbind.data.frame into the function. The results appear as follows:
Country Factor Year Technicians
1 Yemen 112 2010 1809
Country Factor Year Technicians
1 Zambia 113 2012 1126
Country Factor Year Technicians
1 Zimbabwe 114 2018 1126
typeof(loop_data) = NULL
Any help about how to rework this code to output a data frame would be much appreciated! I've only started leaning R a couple of weeks ago so please forgive how amateurish and untidy the code may be!
Country = df$Country
for (country in Country) {
loop_data <- df%>%
group_by(Country)
filter(Year == max(Year))
}
alternatively you could try
for (country in Country) {
loop_data <- df%>%
filter(Country == country)%>%
filter(Year == max(Year))
}
or
for (country in Country) {
loop_data <- df%>%
filter(Country == country)%>%
slice(which.max(Year))
}

Compute a custom mean for each row over multiple columns, based on a set of conditions

I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.
My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this
dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
Ozone_2016=runif(2000))
In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.
The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.
Is their a way to do this complex task in R?
NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried
dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>% group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")
but got an error
*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*
Link to the previous question
How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?
Thank you
The tidying step from your last question works well:
tidy_data = dat %>%
pivot_longer(
starts_with("Ozone"),
names_pattern = "(.*)_(.*)",
names_to = c(NA, "year"),
values_to = "ozone"
) %>%
mutate(year = as.integer(year))
Now you can filter out the years you want to get mean exposure by country / age:
mean_lifetime_exposure = tidy_data %>%
group_by(CNT, dob) %>%
filter(year >= dob) %>%
summarise(mean(ozone))
PS I'm sorry I don't quite understand your first question about country AO.
Edit:
Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.
tidy_data_filtered = tidy_data %>%
filter(
!(CNT == "AO" & year != 1998),
!(CNT == "SL" & !year %in% 1998:2014)
)

Searching and using databases

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-Scraping/master/countries.csv')
In the 1st dataset, there are countries divided into continents.
In the second data set, there is country and population information.
How can I combine population information in data set 2 according to the continental information in data set 1.
thank you. The problem is that in the 1st dataset, countries are written on a continental basis. Countries and their populations in the second dataset. Do I need the population information of the continents? eg europe = 400 million, asia = 2.4 billion
Using the dplyr package all you have to do is join by a common variable, in this case country name. Since in one data frame the name is called countryName and in the other one country_name, we just have to specify that they in fact belong to the same variable.
library(dplyr)
library(stringr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
slice_tail(1) %>%
group_by(region) %>%
summarize(population = sum(population, na.rm = TRUE))
# A tibble: 5 x 2
region population
* <chr> <dbl>
1 Africa 1304908713
2 Americas 1019607512
3 Asia 4592311527
4 Europe 738083720
5 Oceania 40731992

Spread function returning all "NA" within one of two columns

I am using my own version of the gapminder data set and trying to see which country has realized the most growth from 2008 to 2018. When i'm using the original gapminder data, it works fine but for some reason I cannot replicate on my own data set? The problem is that I cannot use na.locf() because all the "2008" rows populate before "2018"
I am using the spread function but it returns values in a way where I can't carry the last observation forward and the group_by function does not seem to work
# The code on the original data that works fine
library(gapminder)
gapminder %>%
filter(year %in% c("1952", "1957")) %>%
spread(year, pop) %>%
na.locf() %>%
mutate(diff = `1957` - `1952`)
However, when I use my data set (the structure is the same), it changes the data in a way that is difficult to subtract
> class(gapminder_df$Year)
[1] "integer"
> class(gapminder_df$population)
[1] "numeric"
# and
> nrow(gapminder_df[gapminder_df$Year == "2018",])
[1] 134
> nrow(gapminder_df[gapminder_df$Year == "2008",])
[1] 134
top_10 <- gapminder_df %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population) %>%
na.locf()
the first column has NAs for the first half of rows and the second column returns NAs for the second half and therefore I can't subtract... group_by(country) doesn't provide desirable results:
2018 2008 country
1 NA 27300000 Afghanistan
2 NA 2990000 Albania
3 NA 34900000 Algeria
4 NA 21800000 Angola
here is a sample of the data
gapminder_df <- tibble(
Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)),
Year = rep(c("2008", "2009", "2018", "2004"), 4),
population = rnorm(16, mean = 5000000, sd = 50)
)
EDIT:
I was able to fix it by selecting only relevant columns before spread... can someone explain to me why that worked? I guess I had multiple of the same dates for the same countries with many different values for other variables?
top_10 <- gapminder_df %>%
select(country, Year, population) %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population)

Resources