Find maximum value of one column based on group_by multiple other columns - r

I have the a table of the following form in R:
| COUNTRY | date_start | code | bin | ord |
| -----------------------------------------|
| Chile | 04/11/2020 | 4.5.1 | 1 | 3 |
| Chile | 04/11/2020 | 4.5.2 | 1 | 0 |
| Norway | 23/02/2021 | 4.4.1 | 1 | 2 |
| Norway | 23/02/2021 | 4.4.2 | 0 | 1 |
| Norway | 25/02/2021 | 4.4.2 | 0 | 1 |
First I want to drop the column 'who_code', and then I want to create an extra column 'ordMax', and populate it with the maximum value of the 'ord' column for a given 'COUNTRY' and 'date_start'. So in this example the new column would be
| COUNTRY | date_start | bin | ord | ordMax |
| ------------------------------------------|
| Chile | 04/11/2020 | 1 | 3 | 3 |
| Chile | 04/11/2020 | 1 | 0 | 3 |
| Norway | 23/02/2021 | 1 | 2 | 2 |
| Norway | 23/02/2021 | 0 | 1 | 2 |
| Norway | 25/02/2021 | 0 | 1 | 1 |
I have tried a couple of methods in R, using both 'aggregate' and the dplyr library, but it nothing seemed to work. One of the things that I tried was:
df_k_reduced <- df_k %>%
group_by(COUNTRY, date_start) %>%
select(-code) %>%
summarise(ordMax = max(ord))
But this gives something like:
| COUNTRY | date_start | ordMax |
| ------------------------------|
| Chile | 04/11/2020 | 3 |
| Norway | 23/02/2021 | 2 |
| Norway | 25/02/2021 | 1 |
Note that 'bin' and the original 'ord' column have also been dropped, even though that was not the original intention.
How would I obtain the table with that extra column, where the only dropped column is 'code', and no rows are dropped?

We can use slice_max instead of summarise to return all the columns after the select step
library(dplyr)
df_k %>%
group_by(COUNTRY, date_start) %>%
select(-code) %>%
slice_max(order_by = 'ord', n = 1)
If we need to create a new column, use mutate
df_k %>%
group_by(COUNTRY, date_start) %>%
select(-code) %>%
mutate(ordMax = max(ord, na.rm = TRUE)) %>%
ungroup

data.table way
sample data
library(data.table)
DT <- fread("COUNTRY | date_start | code | bin | ord
Chile | 04/11/2020 | 4.5.1 | 1 | 3
Chile | 04/11/2020 | 4.5.2 | 1 | 0
Norway | 23/02/2021 | 4.4.1 | 1 | 2
Norway | 23/02/2021 | 4.4.2 | 0 | 1
Norway | 25/02/2021 | 4.4.2 | 0 | 1 ")
code
DT[, ordMax := max(ord), by = .(COUNTRY, date_start)][, code := NULL][]
output
# COUNTRY date_start bin ord ordMax
# 1: Chile 04/11/2020 1 3 3
# 2: Chile 04/11/2020 1 0 3
# 3: Norway 23/02/2021 1 2 2
# 4: Norway 23/02/2021 0 1 2
# 5: Norway 25/02/2021 0 1 1

Related

How can I conditionally expand rows in my R dataframe?

I have a dataframe that I would like to expand based on a few conditions. If the Activity is "Repetitive" I would like to explode the rows to twice as long as the duration, filling in a new dataframe with a row for each 0.5 second event. The rest of the information would stay the same, except that the rows that have been expanded will alternate between the given object in the original dataframe (e.g. "Toy") and "Nothing."
Location <- c("Kitchen", "Living Room", "Living Room", "Garage")
Object <- c("Food", "Toy", "Clothes", "Floor")
Duration <- c(6,3,2,5)
CumDuration <- c(6,9,11,16)
Activity <- c("Repetitive", "Constant", "Constant", "Repetitive")
df <- data.frame(Location, Object, Duration, CumDuration, Activity)
So it looks like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 6 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 5 | 16 | Repetitive |
And I want it to look like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 0.5 | 0.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 1 | Repetitive |
| Kitchen | Food | 0.5 | 1.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 2 | Repetitive |
| Kitchen | Food | 0.5 | 2.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 3 | Repetitive |
| Kitchen | Food | 0.5 | 3.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 4 | Repetitive |
| Kitchen | Food | 0.5 | 4.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 5 | Repetitive |
| Kitchen | Food | 0.5 | 5.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 0.5 | 11.5 | Repetitive |
| Garage | Nothing | 0.5 | 12 | Repetitive |
| Garage | Floor | 0.5 | 12.5 | Repetitive |
| Garage | Nothing | 0.5 | 13 | Repetitive |
| Garage | Floor | 0.5 | 13.5 | Repetitive |
| Garage | Nothing | 0.5 | 14 | Repetitive |
| Garage | Floor | 0.5 | 14.5 | Repetitive |
| Garage | Nothing | 0.5 | 15 | Repetitive |
| Garage | Floor | 0.5 | 15.5 | Repetitive |
| Garage | Nothing | 0.5 | 16 | Repetitive |
Thanks so much in advance!
Here is a dyplr option to achieve this
library(dplyr)
df$CumDuration = as.numeric(df$CumDuration)
df %>% filter(Activity == "Repetitive") %>%
group_by(Location) %>%
slice(rep(1:n(), each= Duration/0.5)) %>% # Create the new rows
mutate(Duration = Duration/(Duration*2)) %>% # Change the Duration to 0.5
ungroup() %>%
arrange(CumDuration) %>%
mutate(Object = ifelse((row_number() %% 2) == 0, "Nothing", Object), ID = 1:n()) %>% # Change the Object every other row for "Nothing" and add ID for sorting in correct order
full_join(filter(df, Activity != "Repetitive")) %>% # Merge back with the unmodified rows of original data frame
arrange(CumDuration, ID) %>% # Arrange rows in the correct order
mutate(CumDuration = cumsum(Duration)) %>% # Recalculate the cumulative sum
select(-ID) # Remove the ID column no longer wanted
# A tibble: 24 x 5
Location Object Duration CumDuration Activity
<chr> <chr> <dbl> <dbl> <chr>
1 Kitchen Food 0.5 0.5 Repetitive
2 Kitchen Nothing 0.5 1 Repetitive
3 Kitchen Food 0.5 1.5 Repetitive
4 Kitchen Nothing 0.5 2 Repetitive
5 Kitchen Food 0.5 2.5 Repetitive
6 Kitchen Nothing 0.5 3 Repetitive
7 Kitchen Food 0.5 3.5 Repetitive
8 Kitchen Nothing 0.5 4 Repetitive
9 Kitchen Food 0.5 4.5 Repetitive
10 Kitchen Nothing 0.5 5 Repetitive
# ... with 14 more rows

total() in tab_cols only sum up to one, any suggestion?

Suppose I have dataframe 'y'
WR<-c("S",'J',"T")
B<-c("b1","b2","b3")
wgt<-c(0.3,2,3)
y<-data.frame(WR,B,wgt)
I want to make column percentage crosstab with B as row, WR, and total of WR as columns using expss function
library(expss)
y %>% tab_cols(total(),WR) %>% # Columns
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_cells(mrset(B))%>% # Row
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
Result
But the total Base column does not match up
# #Total WR|J WR|S WR|T
# Base 1.000000 1 1.0 1
# Projection 5.300000 2 0.3 3
# b1 5.660377 NA 100.0 NA
# b2 37.735849 100 NA NA
# b3 56.603774 NA NA 100
I think I found the solution
y %>% tab_cols(total(),WR) %>% # Columns
tab_cells(mrset(B))%>% # Row
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
| | | #Total | WR | | |
| | | | J | S | T |
| -- | ---------- | ------ | --- | ----- | --- |
| B | Base | 3.0 | 1 | 1.0 | 1 |
| | Projection | 5.3 | 2 | 0.3 | 3 |
| b1 | | 5.7 | | 100.0 | |
| b2 | | 37.7 | 100 | | |
| b3 | | 56.6 | | | 100 |

Sqlite count occurence per year

So let's say I have a table in my Sqlite database with some information about some files, with the following structure:
| id | file format | creation date |
----------------------------------------------------------
| 1 | Word | 2010:02:12 13:31:33+01:00 |
| 2 | PSD | 2021:02:23 15:44:51+01:00 |
| 3 | Word | 2019:02:13 14:18:11+01:00 |
| 4 | Word | 2010:02:12 13:31:20+01:00 |
| 5 | Word | 2003:05:25 18:55:10+02:00 |
| 6 | PSD | 2014:07:20 20:55:58+02:00 |
| 7 | Word | 2014:07:20 21:09:24+02:00 |
| 8 | TIFF | 2011:03:30 11:56:56+02:00 |
| 9 | PSD | 2015:07:15 14:34:36+02:00 |
| 10 | PSD | 2009:08:29 11:25:57+02:00 |
| 11 | Word | 2003:05:25 20:06:18+02:00 |
I would like results that show me a chronology of how many of each file format were created in a given year – something along the lines of this:
|Format| 2003 | 2009 | 2010 | 2011 | 2014 | 2015 | 2019 | 2021 |
----------------------------------------------------------------
| Word | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 0 |
| PSD | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| TIFF | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
I've gotten kinda close (I think) with this, but am stuck:
SELECT
file_format,
COUNT(CASE file_format WHEN creation_date LIKE '%2010%' THEN 1 ELSE 0 END),
COUNT(CASE file_format WHEN creation_date LIKE '%2011%' THEN 1 ELSE 0 END),
COUNT(CASE file_format WHEN creation_date LIKE '%2012%' THEN 1 ELSE 0 END)
FROM
fileinfo
GROUP BY
file_format;
When I do this I am getting unique amounts for each file format, but the same count for every year…
|Format| 2010 | 2011 | 2012 |
-----------------------------
| Word | 4 | 4 | 4 |
| PSD | 1 | 1 | 1 |
| TIFF | 6 | 6 | 6 |
Why am I getting that incorrect tally, and moreover, is there a smarter way of querying that doesn't rely on the year being statically searched for as a string for every single year? If it helps, the column headers and row headers could be switched – doesn't matter to me. Please help a n00b :(
Use SUM() aggregate function for conditional aggregation:
SELECT file_format,
SUM(creation_date LIKE '2010%') AS `2010`,
SUM(creation_date LIKE '2011%') AS `2011`,
..........................................
FROM fileinfo
GROUP BY file_format;
See the demo.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Resources