Expand data.frame by creating duplicates based on group condition (3) - r

Starting from this SO question.
Example data.frame:
df = read.table(text = 'ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 5
12 118 2 5
90 117 1 3
22 115 2 2
99 115 2 2', header = TRUE)
The Count col shows the total number of ID values per each Day and the Count_group col shows the sum of the ID values per each Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. 1933 = Count_group 15 because Count 6 (1933) + Count 5 (1932) + Count 3 (1931) + Count 1 (1930) + Count 0 (1929).
What I need to do is to create duplicated observations per each Count_group and add them to it in order to show per each Count_group its Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. Count_group = 15 is composed by the Count values of Day 1933, 1932, 1931, 1930 (and 1929 not present in the df). So the five days needs to be included in the Count_group = 15. The next one will be Count_group = 9, composed by 1932, 1931, 1930, 1929 and 1928; etc...
Desired output:
ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 15
34 1932 5 15
60 1932 5 15
88 1932 5 15
18 1932 5 15
33 1931 3 15
13 1931 3 15
56 1931 3 15
23 1930 1 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 9
13 1931 3 9
56 1931 3 9
23 1930 1 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 12
6 1799 4 12
52 1799 4 12
133 1799 4 12
112 1798 2 12
677 1798 2 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 6
677 1798 2 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 8
26 887 2 8
8 886 1 8
56 885 1 8
37 887 2 4
26 887 2 4
8 886 1 4
56 885 1 4
8 886 1 2
56 885 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
22 115 2 6
99 115 2 6
99 118 2 5
12 118 2 5
90 117 1 5
22 115 2 5
99 115 2 5
90 117 1 3
22 115 2 3
99 115 2 3
22 115 2 2
99 115 2 2
(note that different group of 5 days each one have been separated by a blank line in order to make them clearer)
I have got different data.frames which are grouped by n days and therefore I would like to adapt the code (by changing it a little) specifically for each of them.
Thanks

A generalised version of my previous answer...
#first add grouping variables
days <- 5 #grouping no of days
df$smalldaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]!=df$Day[i-1]))) #individual days
df$bigdaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]<df$Day[i-1]-days+1))) #blocks of linked days
#duplicate days in each big group
df2 <- lapply(split(df,df$bigdaygroup),function(x) {
n <- max(x$Day)-min(x$Day)+1 #number of consecutive days in big group
dayvec <- (max(x$Day):min(x$Day)) #possible days in range
daylog <- dayvec[dayvec %in% x$Day] #actual days in range
pattern <- data.frame(base=rep(dayvec,each=days))
pattern$rep <- sapply(1:nrow(pattern),function(i) pattern$base[i]+1-sum(pattern$base[1:i]==pattern$base[i])) #indices to repeat
pattern$offset <- match(pattern$rep,daylog)-match(pattern$base,daylog) #offsets (used later)
pattern <- pattern[(pattern$base %in% x$Day) & (pattern$rep %in% x$Day),] #remove invalid elements
#store pattern in list as offsets needed in next loop
return(list(df=split(x,x$smalldaygroup)[match(pattern$rep,daylog)],pat=pattern))
})
#change the Count_group to previous value in added entries
df2 <- lapply(df2,function(L) lapply(1:length(L$df),function(i) {
x <- L$df[[i]]
offset <- L$pat$offset #pointer to day to copy Count_group from
x$Count_group <- L$df[[i-offset[i]]]$Count_group[1]
return(x)
}))
df2 <- do.call(rbind,unlist(df2,recursive=FALSE)) #bind back together
df2[,5:6] <- NULL #remove grouping variables
head(df2,30) #ignore rownames!
ID Day Count Count_group
01.1 18 1933 6 15
01.2 33 1933 6 15
01.3 37 1933 6 15
01.4 18 1933 6 15
01.5 16 1933 6 15
01.6 11 1933 6 15
02.7 111 1932 5 15
02.8 34 1932 5 15
02.9 60 1932 5 15
02.10 88 1932 5 15
02.11 18 1932 5 15
03.12 33 1931 3 15
03.13 13 1931 3 15
03.14 56 1931 3 15
04 23 1930 1 15
05.7 111 1932 5 9
05.8 34 1932 5 9
05.9 60 1932 5 9
05.10 88 1932 5 9
05.11 18 1932 5 9
06.12 33 1931 3 9
06.13 13 1931 3 9
06.14 56 1931 3 9
07 23 1930 1 9
08.12 33 1931 3 4
08.13 13 1931 3 4
08.14 56 1931 3 4
09 23 1930 1 4
010 23 1930 1 1
11.16 6 1800 6 12

I attach a rather mechanical method, but I believe it is a good starting point.
I have noticed that in your original table the entry
ID Day Count Count_group
18 1933 6 14
is duplicated; I have left it untouched for sake of clarity.
Structure of the approach:
Read original data
Generate list of data frames, for each Day
Generate final data frame, collapsing the list in 2.
1. Read original data
We start with
df = read.table(text = 'ID Day Count Count_group
18 1933 6 14
33 1933 6 14
37 1933 6 14
18 1933 6 14
16 1933 6 14
11 1933 6 14
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 7
111 888 4 7
88 888 4 7
10 888 4 7
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1', header = TRUE)
# ordered vector of unique values for "Day"
ord_day <- unique(df$Day[order(df$Day)])
ord_day
[1] 885 886 887 888 1798 1799 1800 1930 1931 1932 1933
2. Generate list of data frames, for each Day
For each element in ord_day we introduce a data.frame as element of a list called df_new_aug.
Such data frames are defined through a for loop for all values in ord_day except ord_day[2] and ord_day[1] which are treated separately.
Idea behind the looping: for each unique ord_day[i] with i > 2 we check which days between ord_day[i-1] and ord_day[i-2] (or both!) contribute (through the variable "Count") to the value "Count_Group" at ord_day[i].
We therefore introduce if else statements in the loop.
Here we go
# Recursive generation of the list of data.frames (for days > 886)
#-----------------------------------------------------------------
df_new <- list()
df_new_aug <- list()
# we exclude cases i=1, 2: they are manually treated below
for ( i in 3: length(ord_day) ) {
# is "Count_Group" for ord_day[i] equal to the sum of "Count" at ord_day[i-1] and ord_day[i-2]?
if ( unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) + unique(df[df$Day == ord_day[i-2], "Count"])
) {
# we create columns ID | Day | Count
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1] | df$Day == ord_day[i-2],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame( df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else if (unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) ) #only "Count" at i and i-1 contribute to "Count_group" at i
{
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else #only "Count" at i contributes to "Count_group" at i
df_new[[i]] <- data.frame(df[df$Day == ord_day[i],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
#closing the for loop
}
# for ord_day[2] = "886" (both "Count" at i =2 and i = 1 contribute to "Count_group" at i=2)
#-------------------------------------------------------------------------------------
df_new[[2]] <- data.frame(df[df$Day == ord_day[2] | df$Day == ord_day[1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[2]] <- data.frame(df_new[[2]],
Count_group = rep(unique(df[df$Day == ord_day[2], "Count_group"]), nrow(df_new[[2]]) ) )
# for ord_day[1] = "885" (only "count" at i = 1 contributes to "Count_group" at i =1)
#------------------------------------------------------------------------------------
df_new[[1]] <- data.frame(df[df$Day == ord_day[1], c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[1]] <- data.frame(df_new[[1]], Count_group = rep(unique(df[df$Day == ord_day[1], "Count_group"]), nrow(df_new[[1]]) ) )
# produced list
df_new_aug
3. Generate final data frame, collapsing the list in 2.
We collapse df_new_aug through an ugly loop, but other solutions (for example with Reduce() and merge() are possible):
# merging the list (mechanically): final result
df_result <- df_new_aug[[1]]
for (i in 1:10){
df_result <- rbind(df_result, df_new_aug[[i+1]])
}
One arrives at df_result and the analysis is stopped.

Related

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

Combining grouping and filtering on a dataframe to plot in ggplot and shiny

I am creating a shiny app that tracks various stats of 6 teams in a competition over 6 years. The df is as follows:
Year Pos Team P W L D GF GA GD G. BP Pts
1 2017 1 Southern Steel 15 15 0 0 1062 812 250 130.8 0 30
2 2017 2 Central Pulse 15 9 6 0 783 756 27 103.6 2 20
3 2017 3 Northern Mystics 15 8 7 0 878 851 27 111.3 3 19
4 2017 4 Waikato Bay of Plenty Magic 15 7 8 0 873 848 25 103.0 5 19
5 2017 5 Northern Stars 15 4 11 0 738 868 -130 85.0 1 9
6 2017 6 Mainland Tactix 15 2 13 0 676 875 -199 77.3 2 6
7 2018 1 Central Pulse 15 12 3 0 850 679 171 125.2 3 27
8 2018 2 Southern Steel 15 10 5 0 874 866 8 100.9 2 22
9 2018 3 Mainland Tactix 15 7 8 0 746 761 -15 98.0 5 19
10 2018 4 Northern Mystics 15 7 8 0 783 796 -13 98.4 3 17
11 2018 5 Waikato Bay of Plenty Magic 15 5 10 0 804 878 -74 91.6 3 13
12 2018 6 Northern Stars 15 4 11 0 832 909 -77 91.5 5 13
13 2019 1 Central Pulse 15 13 2 0 856 676 180 126.6 0 39
14 2019 2 Southern Steel 15 12 3 0 946 809 137 116.9 2 38
15 2019 3 Northern Stars 15 6 9 0 785 840 -55 93.5 3 21
16 2019 4 Waikato Bay of Plenty Magic 15 5 10 0 713 793 -80 89.9 0 15
17 2019 5 Mainland Tactix 15 5 10 0 740 849 -109 87.2 0 15
18 2019 6 Northern Mystics 15 4 11 0 786 859 -73 91.5 2 14
19 2020 1 Central Pulse 15 11 2 2 594 474 120 125.3 1 49
20 2020 2 Mainland Tactix 15 9 4 2 606 566 40 107.1 2 42
21 2020 3 Northern Mystics 15 7 6 2 582 475 7 101.2 3 35
22 2020 4 Northern Stars 15 5 7 3 590 626 -36 94.2 3 29
23 2020 5 Southern Steel 15 4 10 1 578 637 -59 90.7 3 21
24 2020 6 Waikato Bay of Plenty Magic 15 2 9 4 520 592 -72 87.8 3 19
25 2021 1 Northern Mystics 15 11 4 0 924 878 46 105.2 4 37
26 2021 2 Southern Steel 15 11 4 0 813 801 12 101.5 2 35
27 2021 3 Mainland Tactix 15 9 6 0 801 775 26 103.4 4 31
28 2021 4 Northern Stars 15 9 6 0 825 791 34 104.3 2 29
29 2021 5 Central Pulse 15 4 11 0 789 810 -21 97.4 8 20
30 2021 6 Waikato Bay of Plenty Magic 15 1 15 0 807 904 -97 89.3 6 9
31 2022 1 Central Pulse 15 10 5 0 828 732 96 113.1 4 34
32 2022 2 Northern Stars 15 11 4 0 836 783 53 106.8 1 34
33 2022 3 Northern Mystics 15 9 6 0 858 807 51 106.3 4 31
34 2022 4 Southern Steel 15 6 9 0 853 898 -45 95.0 2 20
35 2022 5 Waikato Bay of Plenty Magic 15 4 11 0 733 803 -70 91.3 4 16
36 2022 6 Mainland Tactix 15 5 0 0 788 873 -85 90.3 1 16
I need 3 graphs:
A stacked bar chart showing wins/draws/losses for each team across the 6 years.
A line chart showing the position of each team at the end of each of the 6 years.
A bubble chart showing total goals for/ goals against for each team across all 6 years, with total wins dictating size of the plots.
I also need to be able to filter the data for these graphs with a checkbox for choosing teams and a slider to select the year range.
I have got a stacked bar chart which can not be filtered - I can't figure out how to group the original df by team AND have it connected to the reactive filter I have. Currently the graph is connected to a melted df which is no good as I need the reactive filtered one defined in the function. The graph is also a bit ugly - how can I flip the chart so that wins are on bottom and draws are on top?
The second chart is all good.
The third chart again I need to group the data so that I have total stats across the 6 years- currently there are 36 bubbles but I only want 6.
Screenshots of shiny app output: https://imgur.com/a/qzqlUob
Code:
library(ggplot2)
library(shiny)
library(dplyr)
library(reshape2)
library(scales)
df <- read.csv("ANZ_Premiership_2017_2022.csv")
teams <- c("Central Pulse", "Northern Stars", "Northern Mystics",
"Southern Steel", "Waikato Bay of Plenty Magic", "Mainland Tactix")
mdf <- melt(df %>%
group_by(Team) %>% summarise(Wins = sum(W),
Losses = sum(L),
Draws = sum(D)),
id.vars = "Team")
ui <- fluidPage(
titlePanel("ANZ Premiership Analysis"),
sidebarLayout(
sidebarPanel(
checkboxGroupInput("teams",
"Choose teams",
choices = teams,
selected = teams),
sliderInput("years",
"Choose years",
sep="",
min=2017, max=2022, value=c(2017,2022))
),
mainPanel(
h2("Chart Tabs"),
tabsetPanel(
tabPanel("Wins/ Losses/ Draws", plotOutput("winLoss")),
tabPanel("Standings", plotOutput("standings")),
tabPanel("Goals", plotOutput ("goals"))
)
)
)
)
server <- function(input, output){
filterTeams <- reactive({
df.selection <- filter(df, Team %in% input$teams, Year %in% (input$years[1]:input$years[2]))
})
output$winLoss <- renderPlot({
ggplot(mdf, mapping=aes(Team, value, fill=variable))+
geom_bar(stat = "identity", position = "stack")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ylab("Wins")+
xlab("Team")
})
output$standings <- renderPlot({
filterTeams() %>%
ggplot(aes(x=Year, y=Pos, group=Team, color=Team)) +
geom_line(size=1.25) +
geom_point(size=2.5)+
ggtitle("Premiership Positions") +
ylab("Position")
})
output$goals <- renderPlot({
filterTeams()%>%
ggplot(aes(GF, GA, size=W, color=Team))+
geom_point(alpha=0.7)+
scale_size(range=c(5,15),name = "Wins")+
xlab("Goals for")+
ylab("Goals against")
})
}
shinyApp(ui = ui, server = server)

How can I write a commmand in R that groups by multiple critera?

I am looking for a function where I can classify my data into five different industries given their SIC code
Permno SIC Industry
1 854
2 977
3 549
4 1231
5 3295
6 2000
7 1539
8 2549
9 3950
10 4758
11 4290
12 5498
13 5248
14 142
15 3209
16 2759
17 4859
18 2569
19 739
20 4529
It could be that all SICS between 100-200 and 400-700 should be in Industry 1, all SICs between 300-350 and 980-1020 should be in Industry 2 etc.
So in short - an 'If = or' function where I could list all the SICs that could match a given industry
Thank you!
You can add a new column with the filters by number:
For example:
data$Group <- 0
data[data$SCIS < 1000, data$Group == 1]
data[data$SCIS >= 1000, data$Group == 2 ]
floor the value after dividing the SIC value by 1000.
df$Industry <- floor(df$SIC/1000) + 1
df
# Permno SIC Industry
#1 1 854 1
#2 2 977 1
#3 3 549 1
#4 4 1231 2
#5 5 3295 4
#6 6 2000 3
#7 7 1539 2
#8 8 2549 3
#9 9 3950 4
#10 10 4758 5
#11 11 4290 5
#12 12 5498 6
#13 13 5248 6
#14 14 142 1
#15 15 3209 4
#16 16 2759 3
#17 17 4859 5
#18 18 2569 3
#19 19 739 1
#20 20 4529 5
If there is no way to programmatically define groups you may need to individually define the ranges. It is convenient to do this with case_when in dplyr.
library(dplyr)
df %>%
mutate(Industry = case_when(between(SIC, 100, 200) | between(SIC, 400, 700) ~ 'Industry 1',
between(SIC, 300, 350) | between(SIC, 980, 1020) ~ 'Industry 2'))

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

Adding a column according to norm data in R

I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)

Resources