What is this arrange function in the second line doing here? - r

I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?

Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8

Related

mutate across with vectorized function parameters

I know the "across" paradigm is "many columns, one function" so this might not be possible. The idea is that I want to apply the same function to several columns with a parameter varying based on the column.
I got this to work using cur_column() but it basically amounts to computing the parameters 1 by 1 rather than providing a vector of equal size to the number of columns containing the parameters.
This first block produces what I want but it I'm wondering if there's a cleaner way.
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,addend){
x + addend
}
df2 = df %>% mutate(
across(x$column_names,
~custom_function(.,addend = x %>%
filter(column_names == cur_column()) %>%
pull(parameters))))
What I would like to do for the last line would look like
df2 = df %>% mutate(
across(x$column_names,~custom_function(.,addend = x$parameters)))
We can do this in base with mapply:
mapply(`+`, df[,parameters$column_names], parameters$parameters)
##> column1 column2
##> [1,] 11 101
##> [2,] 12 102
##> [3,] 13 103
##> [4,] 14 104
##> [5,] 15 105
##> ...
I think a mapping function operating on the parameters would easier than across on the data:
library(tidyverse)
with(parameters, as_tibble(map2(df[column_names], parameters, custom_function)))
#> # A tibble: 100 x 2
#> column1 column2
#> <dbl> <dbl>
#> 1 11 101
#> 2 12 102
#> 3 13 103
#> 4 14 104
#> 5 15 105
#> 6 16 106
#> 7 17 107
#> 8 18 108
#> 9 19 109
#> 10 20 110
#> # ... with 90 more rows
Created on 2022-12-15 with reprex v2.0.2
We could either use match on the column name (cur_column()) with the column_names column of 'parameters' and extract the 'parameters' column value to be used as input in custom_function
library(dplyr)
df %>%
mutate(across(all_of(parameters$column_names),
~ custom_function(.x, parameters$parameters[match(cur_column(),
parameters$column_names)])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
7 17 107
8 18 108
...
Or convert the two column data.frame to a named vector (deframe) and directly extract the value from the name
library(tibble)
params <- deframe(parameters)
df %>%
mutate(across(all_of(names(params)),
~ custom_function(.x, params[cur_column()])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
...
Edit (this doesn't work on the second column). I don't think across is what you want here. One of the other answers is better.
You could also just add the column to your function like this:
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,column, parameters){
addend = parameters %>%
filter(column_names == column) %>%
pull(parameters)
x + addend
}
df2 = df %>% mutate(
across(parameters$column_names,custom_function,cur_column(),parameters))

Create numerical discrete values if values in a column equal in R

I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?
Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1
library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)
df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))
another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data

R output BOTH maximum and minimum value by group in dataframe

Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82

Subtracting columns in a dataframe (or matrix)

I am trying to do less in Excel and more in R, but get stuck on a simple calculation. I have a dataframe with meter readings over a number of weeks. I need to calculate the consumption in each week, i.e. subtracting a column from the previous column. For instance, in the example below I need to subtract Reading1 from Reading2 and Reading2 from Reading3. My actual data set contains hundreds of readings, so I need to find an easy way to do this.
SerialNo = c(1,2,3,4,5)
Reading1 = c(100, 102, 119, 99, 200)
Reading2 = c(102, 105, 120, 115, 207)
Reading3 = c(107, 109, 129, 118, 209)
df <- data.frame(SerialNo, Reading1, Reading2, Reading3)
df
SerialNo Reading1 Reading2 Reading3
1 1 100 102 107
2 2 102 105 109
3 3 119 120 129
4 4 99 115 118
5 5 200 207 209
Here's a tidyverse solution that returns a data frame with similar formatting. It converts the data to long format (pivot_longer), applies the lag function, does the subtraction and then widens back to the original format (pivot_wider).
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Reading1:Reading3,
names_to = "reading",
names_prefix = "Reading",
values_to = "value") %>%
group_by(SerialNo) %>%
mutate(offset = lag(value, 1),
measure = value - offset) %>%
select(SerialNo, reading, measure) %>%
pivot_wider(names_from = reading,
values_from = measure,
names_prefix = "Reading")
>
# A tibble: 5 x 4
# Groups: SerialNo [5]
SerialNo Reading1 Reading2 Reading3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 2 5
2 2 NA 3 4
3 3 NA 1 9
4 4 NA 16 3
5 5 NA 7 2
df[,paste0(names(df)[3:4], names(df)[2:3])] <- df[,names(df)[3:4]] - df[,names(df)[2:3]]
df
SerialNo Reading1 Reading2 Reading3 Reading2Reading1 Reading3Reading2
1 1 100 102 107 2 5
2 2 102 105 109 3 4
3 3 119 120 129 1 9
4 4 99 115 118 16 3
5 5 200 207 209 7 2
PS: I assume columns are ordered 1,2,3,...etc
We can use apply row-wise to calculate difference between consecutive columns.
temp <- t(apply(df[-1], 1, diff))
df[paste0('ans', seq_len(ncol(temp)))] <- temp
df
# SerialNo Reading1 Reading2 Reading3 ans1 ans2
#1 1 100 102 107 2 5
#2 2 102 105 109 3 4
#3 3 119 120 129 1 9
#4 4 99 115 118 16 3
#5 5 200 207 209 7 2
Another option is to use a simple for to loop over the columns of your data frame. I think this solution can be easier to understand, specially if you are starting to use R.
#Create a data frame with same rows as your df and number of cols-1
resul<-as.data.frame(matrix(nrow=nrow(df),ncol=(ncol(df)-1)))
#Add the SerialNo column to the first column of results df
resul[,1]<-df[,1]
#Set the name of the first column to SerialNo (as the first colname of df)
colnames(resul)[1]<-colnames(df)[1]
#Loop over the Reading columns of df (from the second column to the last minus 1)
for(i in 2:(ncol(df)-1)){
#Do the subtraction
resul[,i] <- df[,i+1]-df[,i]
#Set the colname for each iteration
colnames(resul)[i]<-paste0(colnames(df)[i+1],"-",colnames(df)[i])
}

Calculating total sum of line segments overlapping on a line

I'm trying to calculate the total sum of overlapping line segments across a single line. With line A, the segments are disjointed, so it's pretty simple to calculate. However, with lines B and C, there are overlapping line segments, so it's more complicated. I would need to somehow exclude parts of the previous lines that already part of the total sum.
data = read.table(text="
line left_line right_line small_line left_small_line right_small_line
A 100 120 101 91 111
A 100 120 129 119 139
B 70 90 63 53 73
B 70 90 70 60 80
B 70 90 75 65 85
C 20 40 11 1 21
C 20 40 34 24 44
C 20 40 45 35 55", header=TRUE)
This should be the expected result.
result = read.table(text="
total_overlapping
A 0.6
B 0.75
C 0.85", header=TRUE)
EDIT: Added a picture to better illustrate what I'm trying to figure out. There's 3 different pictures of lines (solid red line), with line segments (the dashed lines) overlapping. The goal is to figure out how much of the dashed lines are covering/overlapping.
Line A
Line B
Line C
If I understand correctly, the small_line variable is irrelevant here. The rest of the columns can be used to get the sum of overlapping segments:
Step 1. Get the start & end point for each segment's overlap with the corresponding line:
library(dplyr)
data1 <- data %>%
rowwise() %>%
mutate(overlap.start = max(left_line, left_small_line),
overlap.end = min(right_line, right_small_line)) %>%
ungroup() %>%
select(line, overlap.start, overlap.end)
> data1
# A tibble: 8 x 3
line overlap.start overlap.end
<fct> <int> <int>
1 A 100 111
2 A 119 120
3 B 70 73
4 B 70 80
5 B 70 85
6 C 20 21
7 C 24 40
8 C 35 40
Step 2. Within the rows corresponding to each line, sort the overlaps in order. consider it a new overlapping section if it is the first overlap, OR the previous overlap ends before it started. Label each new overlapping section:
data2 <- data1 %>%
arrange(line, overlap.start, overlap.end) %>%
group_by(line) %>%
mutate(new.section = is.na(lag(overlap.end)) |
lag(overlap.end) <= overlap.start) %>%
mutate(section.number = cumsum(new.section)) %>%
ungroup()
> data2
# A tibble: 8 x 5
line overlap.start overlap.end new.section section.number
<fct> <int> <int> <lgl> <int>
1 A 100 111 TRUE 1
2 A 119 120 TRUE 2
3 B 70 73 TRUE 1
4 B 70 80 FALSE 1
5 B 70 85 FALSE 1
6 C 20 21 TRUE 1
7 C 24 40 TRUE 2
8 C 35 40 FALSE 2
Step 3. Within each overlapping section, take the earliest starting point & the latest ending point. Calculate the length of each overlap:
data3 <- data2 %>%
group_by(line, section.number) %>%
summarise(overlap.start = min(overlap.start),
overlap.end = max(overlap.end)) %>%
ungroup() %>%
mutate(overlap = overlap.end - overlap.start)
> data3
# A tibble: 5 x 5
line section.number overlap.start overlap.end overlap
<fct> <int> <dbl> <dbl> <dbl>
1 A 1 100 111 11
2 A 2 119 120 1
3 B 1 70 85 15
4 C 1 20 21 1
5 C 2 24 40 16
Step 4. Sum the length of overlaps for each line:
data4 <- data3 %>%
group_by(line) %>%
summarise(overlap = sum(overlap)) %>%
ungroup()
> data4
# A tibble: 3 x 2
line overlap
<fct> <dbl>
1 A 12
2 B 15
3 C 17
Now, your expected result shows the expected percentage of overlap on each line, rather than the sum. If that's what you are looking for, you can add the length for each line to data4, & calculate accordingly:
data5 <- data4 %>%
left_join(data %>%
select(line, left_line, right_line) %>%
unique() %>%
mutate(length = right_line - left_line) %>%
select(line, length),
by = "line") %>%
mutate(overlap.percentage = overlap / length)
> data5
# A tibble: 3 x 4
line overlap length overlap.percentage
<fct> <dbl> <int> <dbl>
1 A 12 20 0.6
2 B 15 20 0.75
3 C 17 20 0.85

Resources