This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819
I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.
My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this
dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
Ozone_2016=runif(2000))
In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.
The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.
Is their a way to do this complex task in R?
NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried
dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>% group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")
but got an error
*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*
Link to the previous question
How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?
Thank you
The tidying step from your last question works well:
tidy_data = dat %>%
pivot_longer(
starts_with("Ozone"),
names_pattern = "(.*)_(.*)",
names_to = c(NA, "year"),
values_to = "ozone"
) %>%
mutate(year = as.integer(year))
Now you can filter out the years you want to get mean exposure by country / age:
mean_lifetime_exposure = tidy_data %>%
group_by(CNT, dob) %>%
filter(year >= dob) %>%
summarise(mean(ozone))
PS I'm sorry I don't quite understand your first question about country AO.
Edit:
Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.
tidy_data_filtered = tidy_data %>%
filter(
!(CNT == "AO" & year != 1998),
!(CNT == "SL" & !year %in% 1998:2014)
)
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-Scraping/master/countries.csv')
In the 1st dataset, there are countries divided into continents.
In the second data set, there is country and population information.
How can I combine population information in data set 2 according to the continental information in data set 1.
thank you. The problem is that in the 1st dataset, countries are written on a continental basis. Countries and their populations in the second dataset. Do I need the population information of the continents? eg europe = 400 million, asia = 2.4 billion
Using the dplyr package all you have to do is join by a common variable, in this case country name. Since in one data frame the name is called countryName and in the other one country_name, we just have to specify that they in fact belong to the same variable.
library(dplyr)
library(stringr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
slice_tail(1) %>%
group_by(region) %>%
summarize(population = sum(population, na.rm = TRUE))
# A tibble: 5 x 2
region population
* <chr> <dbl>
1 Africa 1304908713
2 Americas 1019607512
3 Asia 4592311527
4 Europe 738083720
5 Oceania 40731992
I am using my own version of the gapminder data set and trying to see which country has realized the most growth from 2008 to 2018. When i'm using the original gapminder data, it works fine but for some reason I cannot replicate on my own data set? The problem is that I cannot use na.locf() because all the "2008" rows populate before "2018"
I am using the spread function but it returns values in a way where I can't carry the last observation forward and the group_by function does not seem to work
# The code on the original data that works fine
library(gapminder)
gapminder %>%
filter(year %in% c("1952", "1957")) %>%
spread(year, pop) %>%
na.locf() %>%
mutate(diff = `1957` - `1952`)
However, when I use my data set (the structure is the same), it changes the data in a way that is difficult to subtract
> class(gapminder_df$Year)
[1] "integer"
> class(gapminder_df$population)
[1] "numeric"
# and
> nrow(gapminder_df[gapminder_df$Year == "2018",])
[1] 134
> nrow(gapminder_df[gapminder_df$Year == "2008",])
[1] 134
top_10 <- gapminder_df %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population) %>%
na.locf()
the first column has NAs for the first half of rows and the second column returns NAs for the second half and therefore I can't subtract... group_by(country) doesn't provide desirable results:
2018 2008 country
1 NA 27300000 Afghanistan
2 NA 2990000 Albania
3 NA 34900000 Algeria
4 NA 21800000 Angola
here is a sample of the data
gapminder_df <- tibble(
Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)),
Year = rep(c("2008", "2009", "2018", "2004"), 4),
population = rnorm(16, mean = 5000000, sd = 50)
)
EDIT:
I was able to fix it by selecting only relevant columns before spread... can someone explain to me why that worked? I guess I had multiple of the same dates for the same countries with many different values for other variables?
top_10 <- gapminder_df %>%
select(country, Year, population) %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population)