Find the closest value for a certain year in R - r

I have this type of data:
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA
and I would like to find the closest value to year 2012 and 2017 (+ ou - 2 years, i.e. for 2012 it can be 2010, 2011, 2013 or 2014 data) for cata10 variable. The output should be :
iso3year_UHC UHC year_cata cata10
AFG 2012 0.3468012 2013 14.631331
AFG 2017 0.3948606 2016 4.837534
AGO 2012 0.3400455 2011 12.379809
AGO 2017 0.3764945 2015 16.902584
EDIT: Note that I should have NA is there is no data 2 years before or after the reference year.
I have tried tones of commands since two days but could not manage to find a solution. Could you please advice on the type of commands to try?
Thank you very much,
N.

Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.
1) sqldf Using DF defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...) to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...) will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.
library(sqldf)
sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
2) Base R This solution uses only base R. We first create year and iso variables by breaking up the iso3year into two parts. ix is an index into DF giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply returns as a list of rows, L. Finally we rbind those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.
to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
3) dplyr/tidyr
First separate iso3year into iso and year columns giving DF2. Then pick out the 2012 and 2017 rows giving DF3. Now left join DF3 to DF2 using iso and get those rows for cata10 in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice to pick out the row having least distance in years and select out the desired columns giving DF4 Finally left join DF3 with DF4 which will fill out any rows for which there was no match.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)
DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))
DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)
DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))
giving:
# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9
Note
Lines <- "iso3year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)

Here is an answer with dplyr only:
library(tidyverse)
uhc_comb = read.table(header = T, text = "
iso3 year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2026 0.3855273 4.837534 #Year is 2026 for the example
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA")
uhc_comb2 = uhc_comb %>%
pivot_longer(cols=c("UHC","cata10")) %>% #pivot UHC and cata10 to long format as columns "name" and "value"
filter(!is.na(value)) %>% #remove missing
group_by(iso3, name) %>% #for each iso3 and for each variable name (UHC and cata10)
mutate(dist=pmin(abs(year-2012),abs(year-2017))) %>% #compute the distance between the year and the targets and keep only the lowest
# filter(dist<=2) %>% #remove
top_n(-2, dist) %>% #select the minimal distance (in each group)
mutate(year=ifelse(dist>2, NA, year),
value=ifelse(dist>2, NA, value)) %>% #infer NA if distance is too high
select(-dist) #discard the now useless variable
uhc_comb2 %>%
pivot_wider(id_cols = iso3, values_from = c("year", "value")) %>% #pivot to wide again
unnest #since there are several values, unnest the lists from the dataframe
This will output some warnings but they are not significant. I'm not sure it is possible to remove them.
If you want to understand this better, run each line one by one. Pivoting tables is a tough brain gymnastic in the beginning.
EDIT: this will get you the right output with no warnings:
uhc_comb2 %>%
pivot_wider(id_cols = iso3,
values_from = c("year", "value"),
values_fn = list(value = list, year = list)) %>%
unnest(cols = c(year_cata10, year_UHC, value_cata10, value_UHC))

Related

How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
Leading NAs are to be replaced with the first non-NA.
Trailing NAs are to be replaced with the last non-NA.
If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1

R: Count number of new observations compared to a previous groups

I would like to know the number of new observations that occurred between groups.
If I have the following data:
Year
Observation
2009
A
2009
A
2009
B
2010
A
2010
B
2010
C
I wound like the output to be
Year
New_Obsevation_Count
2009
2
2010
1
I am new to R and don't really know how to move forward. I have tried using the count function in the tidyverse package but still can't figure out.
You can use union in Reduce:
y <- split(x$Observation, x$Year)
data.frame(Year = names(y), nNew =
diff(lengths(Reduce(union, y, NULL, accumulate = TRUE))))
# Year nNew
#1 2009 2
#2 2010 1
Data:
x <- read.table(header=TRUE, text="Year Observation
2009 A
2009 A
2009 B
2010 A
2010 B
2010 C")

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

computing onset date of snowmelt in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have daily temperature in this format starting from 1950 to 2017
Data
I need to compute snowmelt onset date which is defined as as the the first day when daily temperature is above 0 C, following the last five-day period between March and May, when the daily temperature is below 0 C. My codes so far:
df1<-read.csv("temp.csv")
require(dplyr)
# applying the condition to check each temperature value
df1$boolean<- ifelse(df1$temp<0.0 , 1, 0)
#computing the total sum < 0 and the start and end date
snow<-df1 %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(date)),
"End Date"=max(as.character(date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
colnames(snow)[3] <- 'length'
# subset length that greater >5
obs<-subset(snow,length >=5)
The codes above give me partial solution ( if further manually edit I will get ideal solution to match my definition) I am only interested in one onset date for each year. I need some further guidance on how I can edit this code to compute onset date based on definition above.
I have number of locations so manually editing this would not be ideal solution.
Your help would be appreciated
We have assumed in (1) that the melt day must occur in Mar, Apr or May and in (2) that only the 5 subzero days occur in Mar, Apr, May but the melt day could occur in June, say.
1) Define df2 which is df1 plus additional columns: month, year and code where code is 0 if the date is not in Mar, Apr, May and is otherwise 1 if temp < 0 and 2 if temp >= 0.
Now using df2 run rollapplyr on code returning TRUE if the most recent 6 dates have codes 1, 1, 1, 1, 1, 2 and otherwise FALSE. Take the TRUE rows and only keep the last in each year. Right join that to a data frame of all years in order to generate NAs in the output for any missing years.
library(zoo)
df2 <- df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * ((temp < 0) + 2 * (temp >= 0)),
OK = rollapplyr(code, 6, identical, c(1, 1, 1, 1, 1, 2), fill = FALSE))
df2 %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
right_join(unique(df2["year"]), by = "year") %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 <NA>
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-05-13
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
2) In (1) we assumed that the melt onset day must be in Mar, Apr or May but here we assume that only the subzero days lie in that range and the melt onset day may extend further out.
Calculations are the same as in (1) except that the codes are now such that 1 indicates a subzero temperature in Mar, Apr or May, 2 indicates any temp above zero any time (not just in Mar, Apr and May) and 0 is anything else. We collapse the codes into a character string (one character per date) and use a regular expression on it to look for a substring of 5 ones followed by anything until we get to the next 2. We process the rest as in (1) except now we don't need the join since there will always be a melt onset day. Without the join we can represent this now as a single pipeline.
df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * (temp < 0) + 2 * (temp >= 0),
OK = { g <- gregexpr("1{5}.*?2", paste(code, collapse = ""))[[1]]
seq_along(code) %in% (g + attr(g, "match.length") - 1) }) %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-06-01
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 1959-06-02
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-06-01
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
A straightforward solution in tidyverse.
library(tidyverse)
library(lubridate)
readxl::read_excel("temp.xlsx") -> df1
df1 %>%
mutate(year = year(Date),
month = month(Date)) %>%
group_by(year) %>%
mutate(
below_0 = as.numeric(temp < 0),
streak5 = cumsum(below_0) - cumsum(lag(below_0, 5, 0)),
onset = month %in% c(3, 4, 5) & lag(streak5) == 5 & below_0 == 0) %>%
filter(onset) %>%
summarise(Date = last(Date))
Gives
# A tibble: 20 x 2
year Date
<dbl> <dttm>
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1960 1960-05-26
11 1961 1961-05-16
12 1962 1962-05-19
13 1963 1963-05-13
14 1964 1964-05-27
15 1965 1965-05-20
16 1966 1966-05-26
17 1967 1967-05-26
18 1968 1968-05-27
19 1969 1969-05-30
20 1970 1970-05-21
I hope the code more or less explains itself, streak5 is the number of previous days with temp below 0, onset implements the criteria given in the question, summarise picks the last date in given year.
rle() to the rescue!
library(broom)
library(tidyverse)
temp <- read_csv("temp.csv")
Best read the pipe below first before reading this helper function.
For each year we:
take a run-length encoding of above/below 0
the first one that's TRUE (<0) and has 5+ consecutive days is our candidate
take the next index
if that's too much (no days that fit the criteria) return NA
else return that date
thus:
mk_runs <- function(xdf) {
r <- rle(xdf$below_0) take the T/F RLE
pos <- which(r$values & r$length>=5)[1] # find the first one meeting criteria
idx <- (sum(r$lengths[1:pos]))+1 # sum the lengths up until this point and add 1 to get to the first > 0 day
if (idx > nrow(xdf)) { # if past our date range return NA
data_frame(year=xdf$year[1], date=NA)
} else {
xdf[idx, c("year", "date")]
}
}
We need to get the data into shape:
separate(temp, Date, c("month", "day", "year")) %>%
mutate_all(as.numeric) %>%
mutate(year = ifelse(year >=50, 1900+year, 2000+year)) %>%
mutate(date = as.Date(sprintf("%04d-%02d-%02d", year, month, day))) %>%
mutate(month = lubridate::month(date)) %>%
mutate(below_0 = temp < 0) %>%
filter(month >= 3 & month <=5) %>%
group_by(year) %>% # year groups
arrange(date) %>% # in order
do(mk_runs(.)) %>% # see above function
print(n=21)
## # A tibble: 21 x 2
## # Groups: year [21]
## year date
## <dbl> <date>
## 1 1950 1950-04-30
## 2 1951 1951-05-21
## 3 1952 1952-05-28
## 4 1953 1953-05-15
## 5 1954 1954-05-28
## 6 1955 1955-05-14
## 7 1956 1956-05-02
## 8 1957 1957-05-07
## 9 1958 1958-04-27
## 10 1959 NA
## 11 1960 1960-04-24
## 12 1961 1961-05-16
## 13 1962 1962-05-19
## 14 1963 1963-05-13
## 15 1964 1964-05-20
## 16 1965 1965-05-20
## 17 1966 1966-05-07
## 18 1967 1967-04-27
## 19 1968 1968-05-10
## 20 1969 1969-05-22
## 21 1970 1970-05-21
Here is another attempt. In my first step, I created two new columns first (i.e., year and month). Then, I filtered the data for data between March and May. Then, I created index numbers for rows which have temperature higher than 0 Celsius. This process is done per year. Since you need to have five consecutive days before those days that have temperature above zero, index numbers equal to / smaller than 5 needs to be ignored. This is done if_else() in the true condition in the outer if_else().
In my second step, I chose to use a package called SOfun which is developed by the author of splitstackshape. You can download this package from github. What getMyRows() is doing are; 1) it identifies which rows should be considered by specifying pattern, 2) get a certain range of rows from the marked rows in 1), and 3) create a list. Here range = -5:0 means that I am choosing five previous rows of a target row, and the target row itself.
In my third step, I subsetted mylist with two logical conditions. !is.na(x$ind[6]) checks if the 6th element of ind is not NA, and all(x$temp[1:5] < 0) checks if the 1st-5th elements of temp (temperature) are all smaller than zero. Filter() chooses list elements that satisfy the two logical condition. Then, I extracted the 6th row from each data frame since that is the target row. I bound the list, grouped the data by year and chose the first observation for each year using slice().
library(devtools)
install_github("mrdwab/overflow-mrdwab")
install_github("mrdwab/SOfun")
library(overflow)
library(SOfun)
library(readxl)
library(dplyr)
# Part 1
mydf <- read_excel("temp.xlsx") %>%
mutate(year = as.numeric(format(Date, "%Y")),
month = as.numeric(format(Date, "%m"))) %>%
filter(between(month, 3, 5)) %>%
group_by(year) %>%
mutate(ind = if_else(temp > 0,
{ind <- row_number()
if_else(ind <= 5, NA_integer_, ind)},
NA_integer_)) %>%
ungroup
# Part 2
mylist <- getMyRows(mydf,
pattern = which(complete.cases(mydf$ind)),
range = -5:0, isNumeric = TRUE)
# Part 3
Filter(function(x) !is.na(x$ind[6]) & all(x$temp[1:5] < 0), mylist) %>%
lapply(function(x) x[6, ]) %>%
bind_rows %>%
group_by(year) %>%
slice(1) %>%
select(Date)
year Date
<dbl> <dttm>
1 1950 1950-04-30 00:00:00
2 1951 1951-05-21 00:00:00
3 1952 1952-05-28 00:00:00
4 1953 1953-05-15 00:00:00
5 1954 1954-05-28 00:00:00
6 1955 1955-05-14 00:00:00
7 1956 1956-05-02 00:00:00
8 1957 1957-05-07 00:00:00
9 1958 1958-04-27 00:00:00
10 1960 1960-04-24 00:00:00
11 1961 1961-05-16 00:00:00
12 1962 1962-05-19 00:00:00
13 1963 1963-05-13 00:00:00
14 1964 1964-05-20 00:00:00
15 1965 1965-05-20 00:00:00
16 1966 1966-05-07 00:00:00
17 1967 1967-04-27 00:00:00
18 1968 1968-05-10 00:00:00
19 1969 1969-05-22 00:00:00
20 1970 1970-05-21 00:00:00

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Resources