Looping through two dataframes and adding columns inside of the loop - r

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?

You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

Related

Applying own function

I am trying to implement my own function. The function works with three arguments that need to be changed for each subsequent column.
# Data
library(dplyr)
df<-data.frame(
Year=c("2000","2001","2002","2003","2004","2005","2006","2007","2008","2009"),
Sales=c(100,200,300,400,500,600,100,300,200,200),
# Store,Mall and Grocery
Store=c(100,400,300,800,900,400,800,400,300,100),
Mall=c(100,600,300,200,200,300,200,500,200,400),
Grocery=c(100,600,300,200,200,300,200,500,200,400),
# Building + Store,Mall and Grocery
Building_Store=c(100,200,300,400,500,600,100,300,200,400),
Building_Mall=c(100,400,300,800,900,400,800,400,300,600),
Building_Grocery=c(100,600,300,200,200,300,200,500,200,400))
# Own function
my_function <- function(x,y,z){((x-(y*lag(z))))}
This function I applied this with dplyr and code you can see below
estimation<-mutate(df,
df_Store=my_function(Store,Sales,Building_Store),
df_Mall=my_function(Mall,Sales,Building_Mall),
df_Grocery=my_function(Grocery,Sales,Building_Grocery))
In this way, I applied this function by manually changing arguments in the function. Results you can see below
Otherwise, in practice, I have a huge set with dozens of such arguments and it is not possible to enter them all manually.
Can someone help me by applying the map function to automatically get the results shown in the above table?
You can try this:
library(dplyr)
library(tidyr)
rename(df,
Value_Store=Store,
Value_Mall=Mall,
Value_Grocery=Grocery) %>%
pivot_longer(-c(Year, Sales), names_to=c(".value", "name"), names_sep="_") %>%
mutate(df=my_function(Value, Sales, Building)) %>%
pivot_wider(values_from=c(Value, Building, df)) %>%
select(Year, Sales, starts_with('df'))
# A tibble: 10 × 5
Year Sales df_Store df_Mall df_Grocery
<chr> <dbl> <dbl> <dbl> <dbl>
1 2000 100 NA -9900 -9900
2 2001 200 -19600 -39400 -79400
3 2002 300 -179700 -89700 -89700
4 2003 400 -119200 -159800 -319800
5 2004 500 -99100 -249800 -449800
6 2005 600 -119600 -359700 -239700
7 2006 100 -29200 -9800 -79800
8 2007 300 -59600 -89500 -119500
9 2008 200 -99700 -39800 -59800
10 2009 200 -39900 -79600 -119600

How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
Leading NAs are to be replaced with the first non-NA.
Trailing NAs are to be replaced with the last non-NA.
If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1

Replacing items in a list with items from another list in R

I have a column in a list with country codes in characters, I want to replace these with numeric codes. for the "decoding" I have a second list where the character country codes are associated with the numeric codes.
I tried gsub:
for (i in 1:nrow(countries))
{gsub(countries$code3[i], countries$numcode[i], doc_report$nationality)}
I tried a for loop:
i <- NULL
n <- NULL
for (i in 1:nrow(doc_report)) {
for (n in 1:nrow(countries)) {
if(doc_report$nationality[i] == countries$code3[n])
doc_report$nationality[i] <- countries$numcode[n]
else
if(doc_report$nationality[i] == "NA")
doc_report$nationality[i] <- 000
}
}
and I had something in mind with merge()
this is how the column looks like that has to be replaced
[nationality] IRL GBR ITA FRA POL BRA ESP GBR GBR GBR
this is how the second table for decoding looks like:
[code3] AFG ALB DZA ASM AGO AIA ATG ARG ARM
[numcode] 4 8 12 16 24 660 NA 28 32 51
so in table one I want the numcode from table 2 rather than the code3 style.
Updated Answer
Here's an example with data formatted like yours to make it clearer that it does work despite duplicate country codes.
library(tidyverse)
country <- c("IRL", "GBR", "ITA", "FRA", "POL", "BRA", "ESP")
codes <- c(1,2,3,4,5,6,7)
countries <- tibble(country, codes)
doc_report <- tibble(x=c("a","b","c","d","e"),
country = c("ITA","ITA", "POL", "BRA","ESP"))
left_join(doc_report, countries, by="country")
The output of this code is:
# A tibble: 5 x 3
x country codes
<chr> <chr> <dbl>
1 a ITA 3
2 b ITA 3
3 c POL 5
4 d BRA 6
5 e ESP 7
Which I believe is the behavior you're looking for.
Original Answer
A simple solution would be to use the left_join() function in the dplyr package and then select() to remove the unneeded column.
Let's say doc_report keys countries by code and country_codes is a tibble with 1 column of country string codes and 1 column of corresponding numerical codes, you could do something like this
## join the country codes
doc_report <- left_join(doc_report, country_codes, by="code3")
## remove the unneeded column
doc_report <- select(doc_report, -code3)
Does this make sense? Happy to expand otherwise.

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Use dplyr to compute lagging difference

My data frame consists of three columns: state name, year, and the tax receipt for each year and each state. Below is an example for just one state.
year RealTaxRevs
1 1971 8335046
2 1972 9624026
3 1973 10498935
4 1974 10052305
5 1975 8708381
6 1976 8911262
7 1977 10759032
I'd like to compute the change in tax receipt from one year to the next, for each state. I used the following code:
data %>% group_by(state) %>% summarise(diff(RealTaxRevs, lag = 1, differences = 1))
but it gives me "Error: expecting a single value".
Could anyone explain this error message, and help me do this correctly using dplyr? Thank you.
If you want to use diff like function, then consider using the zoo library as well. Then you can have code which looks like the following:
library(zoo)
diff(as.zoo(1:4), na.pad=T)
In a data frame setting it would be like:
dat <- data.frame(a=c(8335046, 9624026, 10498935, 10052305, 8708381, 8911262, 10759032))
dat %>% mutate(b=diff(as.zoo(a), na.pad=T))
# a b
# 1 8335046 NA
# 2 9624026 1288980
# 3 10498935 874909
# 4 10052305 -446630
# 5 8708381 -1343924
# 6 8911262 202881
# 7 10759032 1847770
This way you can easily increase the number of lags, without continually adding NA
dat %>% mutate(b2=diff(as.zoo(a), lag=2, na.pad=T))
# a b2
# 1 8335046 NA
# 2 9624026 NA
# 3 10498935 2163889
# 4 NA NA
# 5 8708381 -1790554
# 6 8911262 NA
# 7 10759032 2050651
We can use data.table
library(data.table)
setDT(data)[, Diffs := RealTaxRevs - shift(RealTaxRevs)[[1]], state]

Resources