dplyr group_by summarise inconsistent number of rows - r

I have been following the tutorial on DataCamp. I have the following line of code, that when I run it produces a different value for "drows"
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(rows= n(), drows = n_distinct(rows))
First time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 86
2 AirTran BKG 14 6
3 Alaska SEA 32 18
4 American DFW 186 74
5 American MIA 129 57
6 American_Eagle DFW 234 101
7 American_Eagle LAX 74 34
8 American_Eagle ORD 133 56
9 Atlantic_Southeast ATL 64 28
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Second time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 125
2 AirTran BKG 14 13
3 Alaska SEA 32 29
4 American DFW 186 118
5 American MIA 129 76
6 American_Eagle DFW 234 143
7 American_Eagle LAX 74 47
8 American_Eagle ORD 133 85
9 Atlantic_Southeast ATL 64 44
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Third time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 88
2 AirTran BKG 14 7
3 Alaska SEA 32 16
4 American DFW 186 79
5 American MIA 129 61
6 American_Eagle DFW 234 95
7 American_Eagle LAX 74 31
8 American_Eagle ORD 133 67
9 Atlantic_Southeast ATL 64 31
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
My question is why does this value constantly change? What is it doing?

Apparently this is normal behaviour, see this issue here. https://github.com/tidyverse/dplyr/issues/2222.
This is because values in list columns are compared by reference, so
n_distinct() treats them as different unless they really point to the
same object:
So the internal storage of the df changes the way the thing works. Hadley's comment in that issue seems to say it might be a bug (in the sense of unwanted behaviour), or it might be expected behaviour they need to document better.

Related

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

How to sum all variables that aren't characters/factors using group_by? [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I am new to R. I have some data from local elections in Mexico and I want to determine how many votes each party had in each municipality.
Here is an example of the data (political parties are all variables from PRI onwards, NAME_MUN is the name of the municipalities):
head(Campeche)
# A tibble: 6 x 14
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 153 137 43 5 6 9 7
2 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 109 113 52 15 9 4 5
3 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 169 154 33 14 12 5 6
4 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 1414 1474 415 154 73 62 53
5 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 199 238 88 25 17 11 12
6 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 176 197 60 15 7 13 11
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
tail(Campeche)
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SABANCUY 3 CAMPECHE CARMEN 83 74 21 7 0 3 1
2 SABANCUY 3 CAMPECHE CARMEN 68 47 28 5 3 4 1
3 SABANCUY 3 CAMPECHE CARMEN 56 72 16 1 0 1 1
4 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 90 147 3 2 4 1 3
5 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 141 161 39 30 4 9 15
6 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 84 77 1 6 0 0 3
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
The data is disaggregated by electoral section, there is more than one electoral section for each municipality, what I am looking for is to obtain the total votes for each political party by municipality.
This is what I was doing, but I believe there is a faster way to do the same and that can be replicated for different municipalities with different parties.
results_Campeche <- Campeche %>% group_by(NOM_MUN) %>%
summarize(PRI = sum(PRI), PAN = sum(PAN), PRD = sum(PRD), MORENA = sum(MORENA),
PVEM = sum(PVEM), PT = sum(PT), MC = sum(MC), NVA_ALIANZA = sum(NVA_ALIANZA),
PH = sum(PH),ES = sum(ES), .groups = "drop")
head(results_Campeche)
NOM_MUN PRI PAN PRD MORENA PVEM PT MC NVA_ALIANZA PH ES
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CALAKMUL 4861 5427 290 198 70 109 84 236 9 53
2 CALKINI 9035 1326 319 11714 684 194 282 4537 41 262
3 CAMPECHE 39386 32574 4394 11639 2211 2033 1451 4656 1995 4681
4 CANDELARIA 6060 11982 98 209 38 73 135 73 21 21
5 CARMEN 25252 38239 2505 9314 1164 708 712 1124 742 838
6 CHAMPOTON 16415 8500 3212 5387 457 636 1122 1034 203 340

Percentage change in values in r

Here is the df I am using:
Date Country City Specie count min max median variance
27 2020-03-25 IN Delhi pm25 797 6 192 92 12116.60
159 2020-03-25 IN Chennai pm25 96 27 89 57 1928.38
223 2020-03-25 IN Mumbai pm25 285 12 163 90 6275.41
412 2020-03-25 IN Bengaluru pm25 179 25 145 73 4890.82
419 2020-03-25 IN Kolkata pm25 260 6 168 129 10637.10
10 2020-04-10 IN Delhi pm25 835 2 393 137 24542.30
132 2020-04-10 IN Chennai pm25 87 5 642 53 87856.50
298 2020-04-10 IN Mumbai pm25 168 1 125 90 5025.35
358 2020-04-10 IN Bengaluru pm25 159 21 834 56 57091.10
444 2020-04-10 IN Kolkata pm25 219 4 109 64 2176.61
I want to calculate the percentage change between 'median' values of the data frame. For that I have used the following code:
pct_change_pm25 <- day %>%
arrange(City, .by_group = TRUE) %>%
mutate(pct_change = -diff(median) / median[-1] * 100)
But I am getting this error:
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 2, expecting : 10
The number of rows that mutate is creating is 9 and is not matching with the number of rows in the df.
I have followed this post on stackoverflow:
Calculate Percentage Change in R using dplyr
But, unfortunately id didn't work for me.
Since diff returns vector of length 1 less than the original vector, append an NA at the start of the calculation. Also probably you want to do this for each City separately,hence grouping by city.
library(dplyr)
df %>%
arrange(City) %>%
group_by(City) %>%
mutate(pct_change = c(NA, -diff(median) / median[-1] * 100))
Another way to do the same calculation is using lag
df %>%
arrange(City) %>%
group_by(City) %>%
mutate(pct_change = (lag(median) - median)/median * 100)
# Date Country City Specie count min max median variance pct_change
# <fct> <fct> <fct> <fct> <int> <int> <int> <int> <dbl> <dbl>
# 1 2020-03-25 IN Bengaluru pm25 179 25 145 73 4891. NA
# 2 2020-04-10 IN Bengaluru pm25 159 21 834 56 57091. 30.4
# 3 2020-03-25 IN Chennai pm25 96 27 89 57 1928. NA
# 4 2020-04-10 IN Chennai pm25 87 5 642 53 87856. 7.55
# 5 2020-03-25 IN Delhi pm25 797 6 192 92 12117. NA
# 6 2020-04-10 IN Delhi pm25 835 2 393 137 24542. -32.8
# 7 2020-03-25 IN Kolkata pm25 260 6 168 129 10637. NA
# 8 2020-04-10 IN Kolkata pm25 219 4 109 64 2177. 102.
# 9 2020-03-25 IN Mumbai pm25 285 12 163 90 6275. NA
#10 2020-04-10 IN Mumbai pm25 168 1 125 90 5025. 0
With data.table, we can do
library(data.table)
setDT(df)[, pct_change := (shift(median) - median)/median * 100, City]

dplyr summarize output - how to save it

I need to calculate summary statistics for observations of bird breeding activity for each of 150 species. The data frame has the species (scodef), the type of observation (codef)(e.g. nest building), and the ordinal date (days since 1 January, since the data were collected over multiple years). Using dplyr I get exactly the result I want.
library(dplyr)
library(tidyr)
phenology %>% group_by(sCodef, codef) %>%
summarize(N=n(), Min=min(jdate), Max=max(jdate), Median=median(jdate))
# A tibble: 552 x 6
# Groups: sCodef [?]
sCodef codef N Min Max Median
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 ABDU AY 3 172 184 181
2 ABDU FL 12 135 225 188
3 ACFL AY 18 165 222 195
4 ACFL CN 4 142 156 152.
5 ACFL FL 10 166 197 192.
6 ACFL NB 6 139 184 150.
7 ACFL NY 6 166 207 182
8 AMCO FL 1 220 220 220
9 AMCR AY 53 89 198 161
10 AMCR FL 78 133 225 166.
# ... with 542 more rows
How do I get these summary statistics into some sort of data object so that I can export them to use ultimately in a Word document? I have tried this and gotten an error. All of the many explanations of summarize I have reviewed just show the summary data on screen. Thanks
out3 <- summarize(N=n(), Min=min(jdate), Max=max(jdate), median=median(jdate))
Error: This function should not be called directly
Assign this to a variable, then write to a csv like so:
summarydf <- phenology %>% group_by......(as above)
write.csv(summarydf, filename="yourfilenamehere.csv")

How to cross-reference tibbles in R?

library(nycflights13)
library(tidyverse)
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?
Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>

Resources