How to count repetitions in a dataframe column in R using dplyr? - r

I have srink a df and I have kept only two columns, which are airport names (origin and destiny):
Origin Destination
<chr> <chr>
1 LPPD LEMD
2 DAAE LFML
3 EDDH UUEE
4 LFLL DAAS
5 LFPO LFSL
6 UMKK ULLI
7 LFPO LFBA
8 LFPG EDDN
9 LFLL LFRN
10 LFPG EDDW
# … with more rows
Airport names are repeated on either columns. I would like to summarise the repeated airport names, and output the following:
Airports totalMovements takeoffs landings
Airports are the airport names (one appearance) that appear on both columns. Total_Movements are the sum of the number of times an airport name appears in the Origin column plus the times that appears in the Destiny colum. Takeoffs are the number of times that an airport name appears in the Origin column and finally, landings are the total number of times that an airport name appears on the Destiny column.

We can use data.table
library(data.table)
melt(setDT(df1), measure = 1:2)[, .(.N, sum(variable == 'Origin'),
sum(variable == 'Destination')), value]

You could try:
library(dplyr)
library(tidyr)
pivot_longer(df, everything()) %>%
group_by(Airports = value) %>%
summarise(
totalMovements = n(),
takeoffs = sum(name == 'Origin'),
landings = sum(name == 'Destination')
)
Output (based on the rows shown in your question):
# A tibble: 17 x 4
Airports totalMovements takeoffs landings
<fct> <int> <int> <int>
1 DAAE 1 1 0
2 EDDH 1 1 0
3 LFLL 2 2 0
4 LFPG 2 2 0
5 LFPO 2 2 0
6 LPPD 1 1 0
7 UMKK 1 1 0
8 DAAS 1 0 1
9 EDDN 1 0 1
10 EDDW 1 0 1
11 LEMD 1 0 1
12 LFBA 1 0 1
13 LFML 1 0 1
14 LFRN 1 0 1
15 LFSL 1 0 1
16 ULLI 1 0 1
17 UUEE 1 0 1
If you'd like to stick to only using dplyr, you can also emulate the behaviour of pivot_longer by:
library(dplyr)
bind_rows(
df %>% transmute(Airports = Origin, name = 'Origin'),
df %>% transmute(Airports = Destination, name = 'Destination')
) %>%
group_by(Airports) %>%
summarise(
totalMovements = n(),
takeoffs = sum(name == 'Origin'),
landings = sum(name == 'Destination')
)

Related

Pivot and transform df in dplyr

my input:
df<-data.frame("frame"=c(1,2,3,4,5,6,7,8,9,10),
"label_x"=c("AO","Other","AO","GS","GS","RF","RF","TI",NA,"Other"),
"label_y"=c("AO","RF","RF", "GS","GS","Other","Other","TI","AO","RF"),
"cross"=c("Matched","Mismatched", "Mismatched","Matcehed","Matched"
,"Mismatched", "Mismatched","Mismatched","Mismatched","Mismatched") )
I want to count all "Matches/Mismatches" from column cross per label, for column label_x and label_y (both). So I tried this code for each column label_:
df %>% filter(!is.na(label_y )) %>% group_by(label_y) %>% count(cross)
but it doesn't answer my question, after that I need to sum the counts for each column .
So I expect something like this...:
label Mismatching Matching Total
AO 5 7 13
RF 3 4 7
On way to do it:
df %>% pivot_longer(cols = c(label_x ,label_y), values_to = "label") %>%
group_by(label) %>% count(cross) %>%
pivot_wider(values_from = n, names_from = cross, values_fill = 0) %>%
mutate(total = Matched + Mismatched)
Result tibble:
# A tibble: 6 x 4
# Groups: label [6]
label Matched Mismatched total
<chr> <int> <int> <int>
1 AO 2 2 4
2 GS 4 0 4
3 Other 0 4 4
4 RF 0 5 5
5 TI 0 2 2
6 NA 0 1 1
However, keep in mind that the matched number is overestimated because both label_x and label_y have been used. Could you show a result table with the real labels and number you expect ?
Using table:
table(data.frame(label = unlist(df[, c("label_x", "label_y")]),
cross = df$cross))
# cross
#label Matcehed Matched Mismatched
# AO 0 2 2
# GS 2 2 0
# Other 0 0 4
# RF 0 0 5
# TI 0 0 2

How best to create calculated columns in R

Below is the sample data. The task at hand is creating two new columns that would designate something by zip code. The first new column would be titled Las_Vegas and the second would be Laughlin. The first eight zip codes would have a value of 1 for Las Vegas and the second eight would have a value of 1 for Laughlin. The purpose of this is that I want to sum the employment for Las Vegas and Laughlin.
First question: Would it be best to use ifelse or case_when?
Second question: Making the two new columns into defacto dummy variables... is this the best approach?
zipcode <-c(89102,89103,89104,89105,89106,89107,89108,89109,89110,89111,89112,89113,89114,89115,89116,89117)
naicstest<-c(541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541212,541215,541214)
emptest <-c(2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32)
county <- data.frame(zipcode,naicstest,emptest)
End result. This end result would have sixteen rows. I kept it short for sake of simplicity. one row for Las_Vegas and one row for Laughlin but there would be eight rows for Las_Vegas and eight for Laughlin. I know how to do the summarise (summing employment) but struggling how to make the two columns.
zipcode naicstest emptest Las_Vegas Laughlin
89102 541213 2 1 0
89110 541213 18 0 1
We can use tidyverse
We match the 'zipcode' by unique(zipcode) to get a numeric index for each unique zipcode.
Use the index from 1 to create another index for every 8 elements with %/%
The index from 2 is used as position index replacing with vector of values
Use the output from 3 as a grouping variable
Get the first row for each group - slice_head with n = 1
Reshape from 'long' to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
county %>%
group_by(un1 = c("Las_Vegas", "Laughlin")[
(match(zipcode, unique(zipcode)) -1) %/% 8 + 1]) %>%
slice_head(n = 1) %>%
mutate(n = 1) %>%
pivot_wider(names_from = un1, values_from = n, values_fill = 0)
-output
# A tibble: 2 x 5
zipcode naicstest emptest Las_Vegas Laughlin
<dbl> <dbl> <dbl> <dbl> <dbl>
1 89102 541213 2 1 0
2 89110 541213 18 0 1
If we want to return all the rows, then don't do the slice_head, instead create a sequence column - row_number()
county %>%
group_by(un1 = c("Las_Vegas", "Laughlin")[
(match(zipcode, unique(zipcode)) -1) %/% 8 + 1]) %>%
mutate(n = 1, rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = un1, values_from = n, values_fill = 0) %>%
select(-rn)
-ouptut
# A tibble: 16 x 5
zipcode naicstest emptest Las_Vegas Laughlin
<dbl> <dbl> <dbl> <dbl> <dbl>
1 89102 541213 2 1 0
2 89103 541213 4 1 0
3 89104 541213 6 1 0
4 89105 541213 8 1 0
5 89106 541213 10 1 0
6 89107 541213 12 1 0
7 89108 541213 14 1 0
8 89109 541213 16 1 0
9 89110 541213 18 0 1
10 89111 541213 20 0 1
11 89112 541213 22 0 1
12 89113 541213 24 0 1
13 89114 541213 26 0 1
14 89115 541212 28 0 1
15 89116 541215 30 0 1
16 89117 541214 32 0 1

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

Computing minimum distance between a row and all previous rows in R

I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)

sub setting panel data based on two variables in R

library(dplyr)
id <- c(rep(1,4),rep(2,3),rep(3,4))
missing <- c(rep(0,4),rep(0,3),1,0,0,0)
wave <- c(seq(1:4),1,2,3,seq(1:4))
df <- as.data.frame(cbind(id,missing,wave))
df
id missing wave
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4
5 2 0 1
6 2 0 2
7 2 0 3
8 3 1 1
9 3 0 2
10 3 0 3
11 3 0 4
I am trying to delete cases if they have missing=1 or if they are missing a wave (1:4). For example, ID=3 should be dropped because at wave=1 they have missing=1 and ID=2 should be dropped because they only have values of 1, 2, and 3 in Wave.
I tried to use dplyr's group_by and filter functions but this removes all cases. I want to only end up with cases for ID=1.
df <- df %>% group_by(id) %>% filter(missing==0, wave==1, wave==2, wave==3, wave==4)
df
Try this. We first group_by id, and then create a list column with the sorted unique values of wave for each id. Then we check to make sure this list equals 1:4. We create a missing_check variable, which is just the max of missing for each id. We filter on both missing_check and wave_check.
df %>%
group_by(id) %>%
mutate(wave_list = I(list(sort(unique(wave))))) %>%
mutate(wave_list_check = all(unlist(wave_list) == 1:4),
missing_check = max(missing)) %>%
filter(missing_check == 0, wave_list_check) %>%
select(id:wave)
id missing wave
<dbl> <dbl> <dbl>
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4

Resources