Related
This is a cutout of my dataframe
I have a dataframe where i have two different variables that is found one year apart from each other. I would like to combine for exampel 2007 and 2008 to make one row with both variable and name it Denmark2007/8.
I have about 300 rows to do this with, and cannot find a command that will do this, and typing it mannually is not in the question
I have looked at everything from merge() and colsums, and i am lost
While one can debate whether a wide format data frame will be easiest to use in subsequent analysis steps, the tricky part of this request is that the names of countries may include multiple words. This means that a simpler solution like tidyr::separate() with sep = " " isn't feasible.
Here is a solution that uses length of each Country to extract the last 4 characters into a Year column, and everything before the final space as Country.
For the purposes of this example, v1 represents the odd year data, and v2 represents the even year data.
Refactored Solution
After coding the tidyverse friendly answer (see below), I realized I could simplify the original solution by starting with the long form tidy data, splitting it into even and odd years, renaming columns and then merging by year.
First, we create data based on the graphic in the original post, and add a couple of rows for a country whose name includes multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
After reading the data into a data frame, we extract the last 4 characters from the Country column to create Year, merge v1 and v2 into a single column, add a yearType column, and use it to split the data into even and odd years.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2),
yearType = if_else(Year %% 2 == 0,"Even","Odd")) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) %>%
split(.$yearType) -> dataList
Having split the data into two data frames, we now rename columns in the even year data frame, subtract 1 from Year to merge with the odd numbered year data, join with the odd numbered year data, rename a few columns and add a column for the even numbered years.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType)
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 5.519108 0.93181 2008
2 Denmark 2009 4.938850 0.64285 2010
3 Denmark 2011 5.101908 0.55260 2012
4 United Kingdom 2007 3.187810 0.13187 2008
>
If it is absolutely required to append the start and end years to the Country column, that can be accomplished as follows.
dataList$Even %>%
rename(EvenYearValue = value) %>%
mutate(Year = Year - 1) %>%
select(-yearType) %>%
full_join(dataList$Odd,by = c("Country","Year")) %>%
rename(OddYearValue = value,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1) %>% select(-yearType) %>%
# modify the Country name to include years
mutate(Country = paste(Country,OddYear,"-",EvenYear))
...and the output:
Country OddYear EvenYearValue OddYearValue EvenYear
1 Denmark 2007 - 2008 2007 5.519108 0.93181 2008
2 Denmark 2009 - 2010 2009 4.938850 0.64285 2010
3 Denmark 2011 - 2012 2011 5.101908 0.55260 2012
4 United Kingdom 2007 - 2008 2007 3.187810 0.13187 2008
>
Original Solution
First, we covert the graphic from the question into usable data, and include a couple of rows for a country name that contains multiple words.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
Next, we load a couple of packages, create a column to count the number of characters in each row of Country, and use it to separate Year from countryName. We also drop the intermediary columns created during this operation and save the result to yearlyData.
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength))) %>%
select(!c(Country,countryLength)) %>%
rename(Country = countryName) -> yearlyData
At this point we separate the even years data into another data frame, drop the v1 variable, and subtract 1 from Year so we can merge it with the data for odd numbered years.
yearlyData %>%
filter(Year %% 2 == 0) %>%
select(-v1) %>%
mutate( Year = Year - 1) -> evenYears
Next, we read the yearly data, filter() out the rows for even numbered years, merge in the evenYears data frame via full_join(), rename a few columns and generate a new column for the even numbered years.
yearlyData %>%
filter(Year %% 2 == 1) %>%
rename(OddYearValue = v1) %>%
select(-v2) %>%
full_join(.,evenYears,by = c("Year","Country")) %>%
rename(EvenYearValue = v2,
OddYear = Year) %>%
mutate(EvenYear = OddYear + 1)
...and the output:
OddYearValue Country OddYear EvenYearValue EvenYear
1 0.93181 Denmark 2007 5.519108 2008
2 0.64285 Denmark 2009 4.938850 2010
3 0.55260 Denmark 2011 5.101908 2012
4 0.13187 United Kingdom 2007 3.187810 2008
>
NOTE: that the tidy data specification assets that each column in a data frame should contain one and only one variable, so we did not combine OddYear, EvenYear and Country into a single column as requested in the original post.
A tidy friendly solution
In the classic article on this topic, Hadley Wickham defines two forms of tidy data, narrow / long form and wide form.
The following solution creates a tidy data long form data frame, where each row in the resulting table is one value for each combination of Country and Year.
textData <- "v1,Country,v2
0.93181,Denmark 2007,NA
NA,Denmark 2008,5.519108
0.64285,Denmark 2009,NA
NA,Denmark 2010,4.93885
.55260,Denmark 2011,NA
NA,Denmark 2012,5.101908
0.13187,United Kingdom 2007,NA
NA,United Kingdom 2008,3.18781"
df <- read.csv(text = textData)
library(dplyr)
library(stringr)
df %>%
mutate(countryLength = str_length(Country),
countryName = substr(Country,1,countryLength - 5),
Year = as.numeric(substr(Country,countryLength - 4,countryLength)),
value = if_else(!is.na(v1),v1,v2)) %>%
select(!c(Country,countryLength,v1,v2)) %>%
rename(Country = countryName) -> yearlyData
yearlyData
...and the output:
> yearlyData
Country Year value
1 Denmark 2007 0.931810
2 Denmark 2008 5.519108
3 Denmark 2009 0.642850
4 Denmark 2010 4.938850
5 Denmark 2011 0.552600
6 Denmark 2012 5.101908
7 United Kingdom 2007 0.131870
8 United Kingdom 2008 3.187810
>
Ironically, given the input data, it's much easier to create a long form tidy data frame than it is to format the data as requested in the original post.
I had to make very specific assumptions about the whole dataset. I hope they apply also to the rest of the table:
You always merge two rows together (not more than two)
The format of the column with Country + year is always the same
Older years (e.g. 2007) always have a non-NA value in the first column and an NA in the last column, while the opposite is true for more recent years (e.g. 2008).
If these assumptions hold, I thought to work it out by first creating two tibbles containing the columns for Country and year, only the non-NA values in v1 and v2, respectively (i.e. dropping all the NAs).
Then you add one year to the tibble containing v1, and finally perform an inner join on the year.
To make it more readable, and do not repeat code, I created a function that takes care of the string extraction.
# Import data and libraries
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~v1,~Country,~v2,
#--|--|---
0.93181,"Denmark 2007",NA,
NA,"Denmark 2008",5.519108,
0.64285,"Denmark 2009",NA,
NA,"Denmark 2010",4.93885,
0.55260,"Denmark 2011",NA,
NA,"Denmark 2012",5.101908,
0.13187,"New Zealand 2007",NA,
NA,"New Zealand 2008",3.187819
)
# Regular expressions to extract year and country from the Country column
regexp_year <- "[[:digit:]]+"
regexp_country <- "[[:alpha:]\\s]+"
# Function that carries out the string extraction from the `Country` column
do_separate_df <- function(df) {
df %>%
mutate(year = str_extract(Country,regexp_year) %>% as.numeric()) %>%
mutate(Country = str_extract(Country,regexp_country))
}
# Tibble with non-NA values in v1 (earlier year)
df_v1 <- df %>%
select(v1,Country) %>%
drop_na %>%
do_separate_df()
# Tibble with non-NA values in v2 (later year)
df_v2 <- df %>%
select(Country,v2) %>%
drop_na %>%
do_separate_df()
# Join on df_v1$year + 1 = df_v2$year
df_combined <-inner_join(
df_v1 %>% mutate(year_to_match = year + 1),
df_v2,
by=c("year_to_match" = "year", "Country")
) %>%
mutate(Country = paste(Country, year, year + 1, sep = " ")) %>%
relocate(Country) %>%
select(-c(year,year_to_match))
df_combined
Country
v1
v2
Denmark 2007 2008
0.93181
5.519108
Denmark 2009 2010
0.64285
4.938850
Denmark 2011 2012
0.55260
5.101908
New Zealand 2007 2008
0.13187
3.187819
I want to summarize the dataset based on "year", "months", and "subdist_id" columns. For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years. For example, for subdist_id 81, the mean Rainfall value of 2004 will be the mean Rainfall of months 11, 12 of 2004, and months 1,2 of 2005.
I am getting no clue how to do it, although I searched online rigorously.
Expanding on #Bloxx's answer and incorporating my comment:
# Set up example data frame:
df = data.frame(year=c(rep.int(2004,2),rep.int(2005,4)),
month=((0:5%%4)-2)%%12+1,
Rainfall=seq(.5,by=0.15,length.out=6))
Now use mutate to create year2 variable:
df %>% mutate(year2 = year - (month<3)*1) # or similar depending on the problem specs
And now apply the groupby/summarise action:
df %>% mutate(year2 = year - (month<3)*1) %>%
group_by(year2) %>%
summarise(Rainfall = mean(Rainfall))
Lets assume your dataset is called df. Is this what you are looking for?
df %>% group_by(subdist_id, year) %>% summarise(Rainfall = mean(Rainfall))
I think you can simply do this:
df %>% filter(months %in% c(1,2,11,12)) %>%
group_by(subdist_id, year=if_else(months %in% c(1,2),year-1,year)) %>%
summarize(meanRain = mean(Rainfall))
Output:
subdist_id year meanRain
<dbl> <dbl> <dbl>
1 81 2004 0.611
2 81 2005 0.228
Input:
df = data.frame(
subdist_id = 81,
year=c(2004,2004, 2005, 2005, 2005, 2005),
months=c(11,12,1,2,11,12),
Rainfall = c(.251,.333,.731,1.13,.111,.346)
)
I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.
For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.
There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.
I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).
Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:
For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 1982-06-09 1982 6 6-1982 aleutian_islands-5-1983 0
For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 2013-07-22 2013 7 7-2013 aleutian_islands-5-2014 0.998
The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>%
mutate(month_year = paste0(startmonth,"-",year)) %>%
select(region, month_year) %>%
distinct() %>%
mutate(region_month_year = paste0(region,"-",month_year))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
month_year = paste0(month,"-",year)) %>%
select(region, month_year, ref_year) %>%
distinct() %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
month_year = paste0(month,"-",year)) %>%
left_join(surv_dat_exploded, by=c('region','month_year')) %>%
filter(!is.na(ref_year))# get rid of dates that are after any ref_year
Sounds like you want a non-equi join. This is easily done with data.table and is very fast. Here's an example that lightly modifies your MWE:
library(data.table)
# make survey dfs
ai_dat = data.table(year = c(1983, 1986, 1991, 1994, 1997),
region = "aleutian_islands", "startmonth" = 5)
ebs_dat = data.table(year = seq(1983, 1999, 1),
region = "eastern_bering_sea", "startmonth" = 5)
# bind together and create date (and cutoffdate) vars
surv_dat = rbind(ai_dat, ebs_dat)
surv_dat[, startdate := as.IDate(paste(year, startmonth, '01', sep = '-'))
][, cutoffdate := startdate - 365L]
# make temperature df
temp_dat = CJ(date=seq(as.IDate("1982-01-01"), as.IDate("1999-12-31"), "days"),
region=c('aleutian_islands','eastern_bering_sea'))
# add temperature var
temp_dat$temp = rnorm(nrow(temp_dat))
# create duplicate date variable (will make post-join processing easier)
temp_dat[, matchdate := date]
# Optional: Set keys for better join performance
setkey(surv_dat, region, startdate)
setkey(temp_dat, region, matchdate)
# Where the magic happens: Non-equi join
surv_dat = temp_dat[surv_dat, on = .(region == region,
matchdate <= startdate,
matchdate >= cutoffdate)]
# Optional: get rid of unneeded columns
surv_dat[, c('matchdate', 'matchdate.1') := NULL][]
#> date region temp year startmonth
#> 1: 1982-05-01 aleutian_islands 0.3680810 1983 5
#> 2: 1982-05-02 aleutian_islands 0.8349334 1983 5
#> 3: 1982-05-03 aleutian_islands -1.3622227 1983 5
#> 4: 1982-05-04 aleutian_islands 1.4327587 1983 5
#> 5: 1982-05-05 aleutian_islands 0.5068226 1983 5
#> ---
#> 8048: 1999-04-27 eastern_bering_sea -1.2924594 1999 5
#> 8049: 1999-04-28 eastern_bering_sea 0.7519078 1999 5
#> 8050: 1999-04-29 eastern_bering_sea -1.0185174 1999 5
#> 8051: 1999-04-30 eastern_bering_sea -1.4322252 1999 5
#> 8052: 1999-05-01 eastern_bering_sea -1.0412836 1999 5
Created on 2021-05-20 by the reprex package (v2.0.0)
Try this solution.
I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <-
rbind(ai_dat, ebs_dat) %>%
mutate(year_month = paste0(year,"-",startmonth),
region_year_month = paste0(region,"-",year,"-",startmonth))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <-
expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(year_month = paste0(year,"-",month)) %>%
mutate(region_year_month = paste0(region,"-",year,"-",month)) %>%
mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
year_month = paste0(year,"-",month))
final_df <-
left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>%
#split ref_column in ref_year and ref_region
separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
#convert ref_year into date
mutate_at("ref_year", as.Date, format= "%Y-%M") %>%
#round it down to be in the first day of the month (not needed if the day matters)
mutate_at("ref_year", floor_date, "month" ) %>%
#difference between observed and the reference
mutate(diff_days = date - ref_year) %>%
# ifelse statement for capturing values of interest
mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm working with a time-series tibble that is organized in the following way:
Country<- ('Somalia')
'1961'<- 2999
'1962'<- 2917
'1963'<- 1853
df <- data.frame(Country, `1961`, `1962`, `1963`)
df
The problem is that is extremely hard to work with data organized in such a way, since that the only way to access the data that I want (those numbers that are under the column names) is by referring to each year individually.
Is there a simple way to organize them in a tidy way, such as:
x <- 'Somalia'
y <- c('1961', '1962', '1963')
z <- c(2999, 2917, 1853)
df <- data.frame(x, y, z)
df
Without having to manually rebuild the entire dataset?
> library(tidyverse)
> df %>%
gather(Year, Value, -Country)
Country Year Value
1 Somalia 1961 2999
2 Somalia 1962 2917
3 Somalia 1963 1853
where df is
df <- data.frame(Country = "Somalia",
`1961` = 2999,
`1962` = 2917,
`1963` = 1853,
check.names = FALSE)
I have a matrix of daily data of average flow and want to make a summary matrix that shows the maximum peak flow. Here's a little sample of what my data looks like:
x<-c(5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
flow<-matrix(c(c(rep(1990,365),rep(1991,365),rep(1992,365)),sample(x,(365*3), replace=TRUE)),nrow=(365*3), ncol=2)
I'd like the summary matrix to be formatted with the year in column 1 and the peak flow event from that year in column 2. Here's an example of how I would like the summary matrix formatted.
summary=matrix(, ncol=2, nrow=3)
summary[,1]=c(1990,1991,1992)
This should be close:
DF <- as.data.frame(flow)
names(DF) <- c("year", "flow")
DF$year <- as.factor(DF$year)
res <- aggregate(flow ~ year, data = DF, FUN = max)
And gives:
year flow
1 1990 100
2 1991 100
3 1992 100
in the form of a data frame.
And the dplyr family of functions (building on #Bryans work):
DF <- as.data.frame(flow)
names(DF) <- c("year", "flow")
group_by(DF, year) %>% summarize(flow = max(flow))
Gives:
Source: local data frame [3 x 2]
year flow
1 1990 100
2 1991 100
3 1992 100