How to summarize data in R (dplyr) and avoid duplicate identifiers? [duplicate] - r

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I'm trying to identify the lowest rate over a range of years for a number of items (ID).
In addition, I would like to know the Year the lowest rate was pulled from.
I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from.
This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached.
How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly.
Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!

You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:
df %>%
group_by(ID) %>%
filter(Rate == min(Rate))

Related

How can I conditionally expand rows in my R dataframe?

I have a dataframe that I would like to expand based on a few conditions. If the Activity is "Repetitive" I would like to explode the rows to twice as long as the duration, filling in a new dataframe with a row for each 0.5 second event. The rest of the information would stay the same, except that the rows that have been expanded will alternate between the given object in the original dataframe (e.g. "Toy") and "Nothing."
Location <- c("Kitchen", "Living Room", "Living Room", "Garage")
Object <- c("Food", "Toy", "Clothes", "Floor")
Duration <- c(6,3,2,5)
CumDuration <- c(6,9,11,16)
Activity <- c("Repetitive", "Constant", "Constant", "Repetitive")
df <- data.frame(Location, Object, Duration, CumDuration, Activity)
So it looks like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 6 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 5 | 16 | Repetitive |
And I want it to look like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 0.5 | 0.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 1 | Repetitive |
| Kitchen | Food | 0.5 | 1.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 2 | Repetitive |
| Kitchen | Food | 0.5 | 2.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 3 | Repetitive |
| Kitchen | Food | 0.5 | 3.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 4 | Repetitive |
| Kitchen | Food | 0.5 | 4.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 5 | Repetitive |
| Kitchen | Food | 0.5 | 5.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 0.5 | 11.5 | Repetitive |
| Garage | Nothing | 0.5 | 12 | Repetitive |
| Garage | Floor | 0.5 | 12.5 | Repetitive |
| Garage | Nothing | 0.5 | 13 | Repetitive |
| Garage | Floor | 0.5 | 13.5 | Repetitive |
| Garage | Nothing | 0.5 | 14 | Repetitive |
| Garage | Floor | 0.5 | 14.5 | Repetitive |
| Garage | Nothing | 0.5 | 15 | Repetitive |
| Garage | Floor | 0.5 | 15.5 | Repetitive |
| Garage | Nothing | 0.5 | 16 | Repetitive |
Thanks so much in advance!
Here is a dyplr option to achieve this
library(dplyr)
df$CumDuration = as.numeric(df$CumDuration)
df %>% filter(Activity == "Repetitive") %>%
group_by(Location) %>%
slice(rep(1:n(), each= Duration/0.5)) %>% # Create the new rows
mutate(Duration = Duration/(Duration*2)) %>% # Change the Duration to 0.5
ungroup() %>%
arrange(CumDuration) %>%
mutate(Object = ifelse((row_number() %% 2) == 0, "Nothing", Object), ID = 1:n()) %>% # Change the Object every other row for "Nothing" and add ID for sorting in correct order
full_join(filter(df, Activity != "Repetitive")) %>% # Merge back with the unmodified rows of original data frame
arrange(CumDuration, ID) %>% # Arrange rows in the correct order
mutate(CumDuration = cumsum(Duration)) %>% # Recalculate the cumulative sum
select(-ID) # Remove the ID column no longer wanted
# A tibble: 24 x 5
Location Object Duration CumDuration Activity
<chr> <chr> <dbl> <dbl> <chr>
1 Kitchen Food 0.5 0.5 Repetitive
2 Kitchen Nothing 0.5 1 Repetitive
3 Kitchen Food 0.5 1.5 Repetitive
4 Kitchen Nothing 0.5 2 Repetitive
5 Kitchen Food 0.5 2.5 Repetitive
6 Kitchen Nothing 0.5 3 Repetitive
7 Kitchen Food 0.5 3.5 Repetitive
8 Kitchen Nothing 0.5 4 Repetitive
9 Kitchen Food 0.5 4.5 Repetitive
10 Kitchen Nothing 0.5 5 Repetitive
# ... with 14 more rows

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

Functions by groups in another column in R [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have 2 questions regarding groups in a dataframe in R.
Imagine I have a dataframe (df) like this
| CONT | COUNTRY | GDP | AVG_GDP |
|------|---------|-----|---------|
| AF | EGYPT | 3 | 2 |
| AF | SUDAN | 2 | 2 |
| AF | ZAMBIA | 1 | 2 |
| AM | CANADA | 4 | 5 |
| AM | MEXICO | 2 | 5 |
| AM | USA | 9 | 5 |
| EU | FRANCE | 5 | 4 |
| EU | ITALY | 4 | 4 |
| EU | SPAIN | 3 | 4 |
How can I calculate the average of GDP by continents and then put it in the AVG_GDP column so it looks like in the table above?
The second question is how can I sum the GDP by continents so it looks like this:
| CONT | SUM_GDP |
|------|---------|
| AF | 6 |
| AM | 15 |
| EU | 12 |
For this last question I think that in base R the second column would be obtained with something like df$SUM_GDP <- aggregate(df$GDP, by=list(df$CONT), FUN=sum) but maybe there is another way to make it in a new dataframe.
Thank you in advance

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Dealing with grouped dataset in R

I have a dataset like:
+----+-------+---------+----------+--+
| id | time | event | timediff | |
+----+-------+---------+----------+--+
| 1 | 15.00 | install | - | |
| 1 | 15.30 | sale | 00.30 | |
| 1 | 16.00 | sale | 00.30 | |
| 2 | 15.00 | sale | - | |
| 2 | 15.30 | sale | 0.30 | |
| 3 | 16.00 | install | - | |
| 4 | 15.00 | install | - | |
| 5 | 13.00 | install | - | |
| 5 | 14.00 | sale | 01.00 | |
+----+-------+---------+----------+--+
I want to clean this dataset:
I want to exclude the ids for which the first (and the next n..) events are sales but not installs.
I want to exclude the ids for which there is an install but no sales (those ids are indeed the unique ones)
Obtaining finally a result like:
+----+-------+---------+----------+
| id | time | event | timediff |
+----+-------+---------+----------+
| 1 | 15.00 | install | - |
| 1 | 15.30 | sale | 0.30 |
| 1 | 16.00 | sale | 0.30 |
| 5 | 13.00 | install | - |
| 5 | 14.00 | sale | 01.00 |
+----+-------+---------+----------+
How can I do that in R? is there any specific package for data manipulation or I can just use if formulas? Should I use tapply?
Based on the example, we can group by 'id' and filter the 'event' column that has first element as 'install' and 2nd as 'sale' to get the expected output.
df1 %>%
group_by(id) %>%
filter(first(event)=='install' & event[2L]=='sale')
id time event timediff
# (int) (dbl) (chr) (dbl)
#1 1 15.0 install NA
#2 1 15.3 sale 0.3
#3 1 16.0 sale 0.3
#4 5 13.0 install NA
#5 5 14.0 sale 1.0
Or if all the elements except first one should be 'sale', we create a logical variable ('ind') by comparing the first element as 'install' and the successive elements as 'sale' (using lead), then filter groups where all the 'ind' are TRUE. If needed, we can remove the 'ind' column using select.
df1 %>%
group_by(id) %>%
mutate(ind= first(event)=='install' & lead(event, default='sale')=='sale') %>%
filter(all(ind)) %>%
ungroup() %>%
select(-ind)
Or we can use data.table., grouped by 'id', if the number of rows is greater than 1 (.N >1), first element is 'install' (event[1L]=='install') and all the rest of the elements are 'sale', then we get the Subset of Data.table (.SD).
library(data.table)
setDT(df1)[, if(.N > 1 & event[1L]=='install' & all(event[2:.N]=='sale')) .SD, by = id]
# id time event timediff
#1: 1 15.0 install NA
#2: 1 15.3 sale 0.3
#3: 1 16.0 sale 0.3
#4: 5 13.0 install NA
#5: 5 14.0 sale 1.0

Resources