This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819
I have a growth rate, calculated from individual measurements 4 times a year, that I am trying to assign to a different time frame called Year2 (August 1st of year 1 to July 31st of year 2, see attached photo).
My Dataframe:
ID
Date
Year
Year2
Lag
Lapse
Growth
Daily_growth
1
2009-07-30
2009
2009
NA
NA
35.004
NA
1
2009-10-29
2009
2010
2009-07-30
91 days
31.585
0.347
1
2010-01-27
2010
2010
2009-10-29
90 days
63.769
0.709
1
2010-04-27
2010
2010
2010-01-27
90 days
28.329
0.315
1
2010-07-29
2010
2010
2010-04-27
93 days
32.068
0.345
1
2010-11-02
2010
2011
2010-07-29
96 days
128.1617320
1.335
I took the growth rate as follows:
Growth_df <- Growth_df%>%
group_by(ID) %>% # Individuals we measured
mutate(Lag = lag(Date), #Last date measured
Lapse = round(difftime(Date, Lag, units = "days")), #days between Dates monitored
Daily_growth = as.numeric(Growth) / as.numeric(Lapse))
What I am trying to do is assign the daily growth rate between each measurement, matching to the Year2 timeframe:
Growth_df <- Growth_df %>%
mutate(Year = as.numeric(Year),
Year2_growth = ifelse(Year == Year2, Daily_growth*Lapse, 0)) %>%
group_by(Year2) %>%
mutate(Year2_growth = sum(Year2_growth, na.rm = TRUE))
My problem is that I do not know how to get the dates in between the years (something in place of the 0 in the ifelse statement). I need some sort of way that would calculate how many days would be left from the new start date (August 1st) to the most recent measurement, then multiply it by the growth rate, as well as cut the end early (July 31st)
I tried making a second dataframe with nothing by years and days then assigning the growth rate when comparing the two dataframes but I have been stuck on the same issue: partitioning the time frame.
I am sure there's a much much muuuuch more efficient way to deal with this, but this is the way I sorted out:
Make my timeframes
Create a function for the ranges I wanted
Make a dataframe with for both the start and the end ranges
Join them together
Marvel in my lack of r skills.
Start_dates <- seq(ymd('2008-08-01'),ymd('2021-08-1'), by = '12 months')
End_dates <- seq(ymd('2009-07-31'),ymd('2022-07-31'), by = '12 months')
Year2_dates <- data.frame(Start_dates, End_dates)
Year2_dates <- Year2_dates %>%
mutate(Year2 = format(as.Date(Start_dates, format="%d/%m/%Y"),"%Y"),
Year2 = as.numeric(Year2) + 1)
Vegetation <- Vegetation %>%
left_join(Year2_dates)
Range_finder <- function(x,y){
as.numeric(difftime(x, y, unit = "days"))
}
Range_start <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==1) %>%
filter(Year != Year2) #had to get rid of first year samples as they were the top row but didn't have a change in year
Range_start <- Range_start %>%
mutate(Number_days_start = Range_finder(Date, Start_dates),
Border_range = Number_days_start * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Range_end <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==n(),
Year2 != 2022)
Range_end <- Range_end %>%
mutate(Number_days_end = Range_finder(End_dates, Date),
Border_range = Number_days_end * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Ranges <- full_join(Range_start, Range_end)
Vegetation <- Vegetation %>%
left_join(Ranges)
I want to summarize the dataset based on "year", "months", and "subdist_id" columns. For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years. For example, for subdist_id 81, the mean Rainfall value of 2004 will be the mean Rainfall of months 11, 12 of 2004, and months 1,2 of 2005.
I am getting no clue how to do it, although I searched online rigorously.
Expanding on #Bloxx's answer and incorporating my comment:
# Set up example data frame:
df = data.frame(year=c(rep.int(2004,2),rep.int(2005,4)),
month=((0:5%%4)-2)%%12+1,
Rainfall=seq(.5,by=0.15,length.out=6))
Now use mutate to create year2 variable:
df %>% mutate(year2 = year - (month<3)*1) # or similar depending on the problem specs
And now apply the groupby/summarise action:
df %>% mutate(year2 = year - (month<3)*1) %>%
group_by(year2) %>%
summarise(Rainfall = mean(Rainfall))
Lets assume your dataset is called df. Is this what you are looking for?
df %>% group_by(subdist_id, year) %>% summarise(Rainfall = mean(Rainfall))
I think you can simply do this:
df %>% filter(months %in% c(1,2,11,12)) %>%
group_by(subdist_id, year=if_else(months %in% c(1,2),year-1,year)) %>%
summarize(meanRain = mean(Rainfall))
Output:
subdist_id year meanRain
<dbl> <dbl> <dbl>
1 81 2004 0.611
2 81 2005 0.228
Input:
df = data.frame(
subdist_id = 81,
year=c(2004,2004, 2005, 2005, 2005, 2005),
months=c(11,12,1,2,11,12),
Rainfall = c(.251,.333,.731,1.13,.111,.346)
)
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 2 years ago.
I use a dataset with 16 variables and 80.000 observations.
The variable "syear" describes the year of the observation (2008,2012,2016).
The variable "pid" describes the unique person ID.
As you can see in the screenshot, it is possible that persons only participated in one or two years. I only want to keep observations from persons, who participated in all three years. In the screenshots this would be the pid 901 and 1501.
How do I filter my dataset by this condition?
pid and year
You can try this:
library(tidyverse)
df <- tribble(
~syear, ~pid,
2008,201,
2008,203,
2008,602,
2012,602,
2008,604,
2008,901,
2012,901,
2016,901,
2008,1501,
2012,1501,
2016,1501
)
df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3)
# alternatively, the cnt column can be dropped
df %>%
group_by(pid) %>%
mutate(cnt = n()) %>%
filter(cnt == 3) %>%
select(-cnt)
As a simplification of nyk's answer you could also do this:
library(dplyr)
library(conflicted)
conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package
tibble(
year = c(2001, 2002, 2003, 2001, 2003, 2002, 2003),
pid = c(1, 1, 1, 2, 2, 3, 3)
) %>%
group_by(pid) %>%
filter(n() == 3)
#> # A tibble: 3 x 2
#> # Groups: pid [1]
#> year pid
#> <dbl> <dbl>
#> 1 2001 1
#> 2 2002 1
#> 3 2003 1
Created on 2021-01-05 by the reprex package (v0.3.0)
So you do not have to create cnt as an intermediary variable. Depending on what you want to do afterward you may call ungroup().
I am using my own version of the gapminder data set and trying to see which country has realized the most growth from 2008 to 2018. When i'm using the original gapminder data, it works fine but for some reason I cannot replicate on my own data set? The problem is that I cannot use na.locf() because all the "2008" rows populate before "2018"
I am using the spread function but it returns values in a way where I can't carry the last observation forward and the group_by function does not seem to work
# The code on the original data that works fine
library(gapminder)
gapminder %>%
filter(year %in% c("1952", "1957")) %>%
spread(year, pop) %>%
na.locf() %>%
mutate(diff = `1957` - `1952`)
However, when I use my data set (the structure is the same), it changes the data in a way that is difficult to subtract
> class(gapminder_df$Year)
[1] "integer"
> class(gapminder_df$population)
[1] "numeric"
# and
> nrow(gapminder_df[gapminder_df$Year == "2018",])
[1] 134
> nrow(gapminder_df[gapminder_df$Year == "2008",])
[1] 134
top_10 <- gapminder_df %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population) %>%
na.locf()
the first column has NAs for the first half of rows and the second column returns NAs for the second half and therefore I can't subtract... group_by(country) doesn't provide desirable results:
2018 2008 country
1 NA 27300000 Afghanistan
2 NA 2990000 Albania
3 NA 34900000 Algeria
4 NA 21800000 Angola
here is a sample of the data
gapminder_df <- tibble(
Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)),
Year = rep(c("2008", "2009", "2018", "2004"), 4),
population = rnorm(16, mean = 5000000, sd = 50)
)
EDIT:
I was able to fix it by selecting only relevant columns before spread... can someone explain to me why that worked? I guess I had multiple of the same dates for the same countries with many different values for other variables?
top_10 <- gapminder_df %>%
select(country, Year, population) %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population)