R - read Excel file and switch variables into observations - r

I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA:

Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
Reading Excel files via exact cell references with readxl::read_excel()
Reading a single cell into a constant
Setting column names for ease of use with tidyr::separate()
Restructuring to narrow format Tidy Data
Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script. Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
Specify a worksheet with the sheet= parameter
Add _Q1 to distinguish each quarter's read, and save the quarter as a key variable
Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>

Related

In r, how do I add rows together to get totals for a specific set of variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
My goal is to have a list of how much FDI China sent to each country per year. At the moment I have a list of individual projects that looks like this
Year
Country
Amount
2001
Angola
6000000
2001
Angola
8000000
2001
Angola
5.0E7
I want to sum it so it looks like this.
Year
Country
Amount
2001
Angola
6.4E7
How do I merge the rows and add the totals to get nice country-year data? I can't find an R command that does this precise thing.
library(tidyverse)
I copied the data table and read your dataframe into R using:
df <- clipr::read_clip_tbl(clipr::read_clip())
I like using dplyr to solve this question:
df2 <- as.data.frame(df %>% group_by(Country,Year) %>% summarize(Amount=sum(Amount)))
# A tibble: 1 x 3
# Groups: Country [1]
Country Year Amount
<chr> <int> <dbl>
1 Angola 2001 64000000

How to transpose panel data into correct form in R

So I am struggling to transform my data into a panel data form so that I can start analysing it. So far I have imported and merged my excel files so my data looks something like this (bear in mind the real data has far more rows and far more variables)
Company Name Date Market Share ...5.x ...6.x ...7.x ...8.x
<chr> <dttm> <chr> <chr> <chr> <chr> <chr>
1 NA NA FY0 FY-1 FY-2 FY-3 FY-4
2 Kimball Elect 2020-06-29 23:00:00 4020 4422 4232 4111 4003
3 Mercadolibre 2019-12-31 00:00:00 8357 2843 2653 2222 2134
4 Lazard Ltd 2019-12-31 00:00:00 47700 45061 45050 43280 42281
As you can see, row 1 exists to specify the lags in time for the market share variable, where FY0 is equal to the date in the date column and then FY-1 is the year before that, FY-2 is two years before etc. In the original excel files, the market share column was multi-index so all the lags were associated with the market share column, however when importing to R only FY0 remained associated with the market share column and all the other columns were auto-filled with '...5.x ...6.x ...7.x ...8.x'.
I essentially want to transform my data to look like this:
Company Name Date Market Share
1 Kimball Elect 2020 4020
2 Kimball Elect 2019 4422
3 Kimball Elect 2018 4232
4 Kimball Elect 2017 4111
5 Kimball Elect 2016 4003
6 Mercadolibre 2019 8357
7 Mercadolibre 2018 2843
8 Mercadolibre 2017 2653
9 Mercadolibre 2016 2222
10 Mercadolibre 2015 2134
11 Lazard Ltd 2019 47700
12 Lazard Ltd 2018 45061
13 Lazard Ltd 2017 45050
14 Lazard Ltd 2016 43280
15 Lazard Ltd 2015 42281
So basically I want to transpose the data in a way that makes the time lags into rows and then associate each lag (FY0, FY-1, FY-2...' with a date/year determined by the date column minus the lag ie. FY0 = 2020-06-29 so FY-1 = 2019-06-29.
Thanks in advance for anyone who is able to help as I feel this is quite tricky to do in R!
One solution is the following
Data
> example <- data.frame(Company = "Kimball", date = "2020", FY0 = 4200, FY1 = 4210)
> example
Company date FY0 FY1
1 Kimball 2020 4200 4210
Code
example %>%
tidyr::pivot_longer(., c("FY0", "FY1")) %>%
dplyr::group_by(Company) %>%
dplyr::mutate(Years = as.numeric(date) - (row_number() - 1)) %>%
dplyr::select(-date, -name)
Output
# A tibble: 2 x 3
# Groups: Company [1]
Company value Years
<chr> <dbl> <dbl>
1 Kimball 4200 2020
2 Kimball 4210 2019
EDIT
To address your concerns:
(1) The first row contains the variables FY0, ... . Hence you just need to replace the columns of the third, fourth, ..., last column with the values of the first row minus the first two columns, i.e. colnames(df) <- df[1, 3:(ncols(df))].
(2) The row_number() pertains to the grouping! Hence, for each group, i.e. firm, the numbering will start again at 1! No worries there.

Summing each hydrologic year in my dataframe at 649 locations and 11,088 observations

can someone pleas help me? I have a dataframe with 649 different locations and each with 11088 observations from the last 30 years. 1 hydologic year spans from sep. 1 to aug. 31. The datafram looks like this:
What I want to end up with is something like this:
In my original dataframe I also have a lot of data missing. If a location (i.e. 1.50.0 ) is missing more than 10% data in one hydrological year I do not want to keep that year in my new dataframe.
If my question is unclear pleas ask. :)
Without data it's not easy, but it may be something like that
df<-data.frame(d1=c(rnorm(9,5,2),NA),
d2=rnorm(10,15,2))
row.names(df)<-c(seq(today()-days(9),today(),"day"))
df%>%
rownames_to_column("id")%>%
gather(variable,value,-id)%>%
mutate(yr=year(id))%>%
group_by(yr)%>%
mutate(is_na=sum(is.na(value))/n())%>%
filter(is_na<.1)%>%
group_by(yr,variable)%>%
summarise(res=mean(value,na.rm=T))%>%
spread(variable,res)
# A tibble: 1 x 3
# Groups: yr [1]
yr d1 d2
<dbl> <dbl> <dbl>
1 2018. 4.41 14.7

Looping through two dataframes and adding columns inside of the loop

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?
You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Resources