Write JSON with multiple inputs in R - r

I have a tibble and a list which I would like to write to a json file.
# A tibble: 2 x 12
i n c x
<chr> <chr> <chr> <chr>
1 NYC New York City United States LON,271;BOS,201
2 LON London United Kingdom NYC,270
I would like to replace the 'x' column with a list.
When I try to merge by the 'i' column with the element of the list, a lot of data is duplicated... :/
sample list:
$NYC
d p
1: LON 271
2: BOS 201
$LON
d p
1: NYC 270
I would like to end up with something that looks like this:
[
{
"i": "NYC",
"n": "New York City",
"c": "United States",
"C": "US",
"r": "Northern America",
"F": 66.256,
"L": -166.063,
"b": 94.42,
"s": 0.752,
"q": 4417,
"t": "0,0,0,0,0",
"x": [{
"d": "LON",
"p": 271
},
{
"d": "BOS",
"p": 201
}]
}
...
]
I'm thinking there should be a way to write the json file without merging the list and the tibble, or maybe there is a way to merge them in a ragged way ?
ah. I just had another idea. maybe I can convert my dataframe to a list then use Reduce to combine the lists...
http://www.sharecsv.com/s/2e1dc764430c6fe746d2299f71879c2e/routes-before-split.csv
http://www.sharecsv.com/s/b114e2cc6236bd22b23298035fb7e042/tibble.csv

We may do the following:
tbl
# A tibble: 1 x 13
# X i n c C r F L b s q t x
# <int> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <fct> <fct>
# 1 1 LON London United Kingd… GB Northern Eur… 51.5 -0.127 55.4 1.25 2088 0,0,1,3… AAL,15;AAR,15;A…
require(tidyverse)
tbl$x <- map(tbl$x, ~ strsplit(., ";|,")[[1]] %>%
{data.frame(d = .[c(T, F)], p = as.numeric(.[c(F, T)]))})
The latter two lines are a shortened version of this base R equivalent:
tbl$x <- lapply(tbl$x, function(r) {
tmp <- strsplit(r, ";|,")[[1]]
data.frame(d = tmp[seq(1, length(tmp), 2)],
p = as.numeric(tmp[seq(2, length(tmp), 2)]))
})
We go over the x column, split its elements by ; and , whenever possible, and then use the fact that the resulting odd elements will correspond do the d column in the desired outcome, and the even elements to the p column.
Output:
toJSON(tbl, pretty = TRUE)
[
{
"X": 1,
"i": "LON",
"n": "London",
"c": "United Kingdom",
"C": "GB",
"r": "Northern Europe",
"F": 51.508,
"L": -0.127,
"b": 55.43,
"s": 1.25,
"q": 2088,
"t": "0,0,1,3,1",
"x": [
{
"d": "AAL",
"p": 15
},
{
"d": "AAR",
"p": 15
},
{
"d": "ABZ",
"p": 48
}
]
}
]

Related

Parse multi-level json file in r

I have a pretty good understanding of R but am new to JSON file types and best practices for parsing. I'm having difficulties building a data frame from a raw JSON file. The JSON file (data below) is made up of repeated measure data that has multiple observations per user.
When the raw file is read into r
jdata<-read_json("./raw.json")
It comes in as a "List of 1" with that list being user_ids. Within each user_id are further lists, like so -
jdata$user_id$`sjohnson`$date$`2020-09-25`$city
The very last position actually splits into two options - $city or $zip. At the highest level, there are about 89 users in the complete file.
My goal would be to end up with a rectangular data frame or multiple data frames that I can merge together like this - where I don't actually need the zip code.
example table
I've tried jsonlite along with tidyverse and the farthest I seem to get is a data frame with one variable at the smallest level - cities and zip codes alternating rows
using this
df <- as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))
Any help/suggestions to get closer to the table above would be much appreciated. I have a feeling I'm failing at looping it back through the different levels.
Here is an example of the raw json file structure:
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
},
"asmith: {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"City": "Elmhurst",
"zip": "00013
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
Another (straightforward) solution doing the heavy-lifting with rrapply() in the rrapply-package:
library(rrapply)
library(dplyr)
rrapply(jdata, how = "melt") %>%
filter(L5 == "city") %>%
select(user_id = L2, date = L4, city = value)
#> user_id date city
#> 1 sjohnson 2020-09-25 Denver
#> 2 sjohnson 2020-10-01 Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4 asmith 2020-10-16 Cleavland
#> 5 asmith 2020-11-10 Elmhurst
Data
jdata <- jsonlite::fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
We can build our desired structure step by step:
library(jsonlite)
library(tidyverse)
df <- fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(city = unlist(value)) %>%
filter(var == 'city') %>%
select(-var, -value)
which gives:
# A tibble: 5 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
Alternative solution inspired by #Greg where we change the last two rows:
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(value = unlist(value)) %>%
pivot_wider(names_from = "var") %>%
select(user_id, date, city)
This gives almost the same results with the exception of one additional case where city is NA:
# A tibble: 6 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
Here's a solution in the tidyverse: a custom function unnestable() designed to recursively unnest into a table the contents of a list like you describe. See Details for particulars regarding the format of such a list and its table.
Solution
First ensure the necessary libraries are present:
library(jsonlite)
library(tidyverse)
Then define the unnestable() function as follows:
unnestable <- function(v) {
# If we've reached the bottommost list, simply treat it as a table...
if(all(sapply(
X = v,
# Check that each element is a single value (or NULL).
FUN = function(x) {
is.null(x) || purrr::is_scalar_atomic(x)
},
simplify = TRUE
))) {
v %>%
# Replace any NULLs with NAs to preserve blank fields...
sapply(
FUN = function(x) {
if(is.null(x))
NA
else
x
},
simplify = FALSE
) %>%
# ...and convert this bottommost list into a table.
tidyr::as_tibble()
}
# ...but if this list contains another nested list, then recursively unnest its
# contents and combine their tabular results.
else if(purrr::is_scalar_list(v)) {
# Take the contents within the nested list...
v[[1]] %>%
# ...apply this 'unnestable()' function to them recursively...
sapply(
FUN = unnestable,
simplify = FALSE,
USE.NAMES = TRUE
) %>%
# ...and stack their results.
dplyr::bind_rows(.id = names(v)[1])
}
# Otherwise, the format is unrecognized and yields no results.
else {
NULL
}
}
Finally, process the JSON data as follows:
# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")
# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)
# View the raw table.
flat_data
Naturally, you can reformat this table however you desire:
library(lubridate)
flat_data <- flat_data %>%
dplyr::transmute(
user_id = as.character(user_id),
date = lubridate::as_datetime(date),
city = as.character(city)
) %>%
dplyr::distinct()
# View the reformatted table.
flat_data
Results
Given a raw.json file like that sampled here
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}
then unnestable() will yield a tibble like this
# A tibble: 6 x 6
user_id date city zip location timestamp
<chr> <chr> <chr> <chr> <lgl> <dbl>
1 sjohnson 2020-09-25 Denver 80014 NA NA
2 sjohnson 2020-10-01 Atlanta 30301 NA NA
3 sjohnson 2020-11-04 Jacksonville 14001 NA NA
4 asmith 2020-10-16 Cleavland 34321 NA NA
5 asmith 2020-11-10 Elmhurst 00013 NA NA
6 asmith 2020-11-10 08:49:36 NA NA NA 1605016176013
which dplyr will format into the result below:
# A tibble: 6 x 3
user_id date city
<chr> <dttm> <chr>
1 sjohnson 2020-09-25 00:00:00 Denver
2 sjohnson 2020-10-01 00:00:00 Atlanta
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith 2020-10-16 00:00:00 Cleavland
5 asmith 2020-11-10 00:00:00 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
Details
List Format
To be precise, the list represents nested groupings by the fields {group_1, group_2, ..., group_n}, and it must be of the form:
list(
group_1 = list(
"value_1" = list(
group_2 = list(
"value_1.1" = list(
# .
# .
# .
group_n = list(
"value_1.1.….n.1" = list(
field_a = 1,
field_b = TRUE
),
"value_1.1.….n.2" = list(
field_a = 2,
field_c = "2"
)
# ...
)
),
"value_1.2" = list(
# .
# .
# .
)
# ...
)
),
"value_2" = list(
group_2 = list(
"value_2.1" = list(
# .
# .
# .
group_n = list(
"value_2.1.….n.1" = list(
field_a = 3,
field_d = 3.0
)
# ...
)
),
"value_2.2" = list(
# .
# .
# .
)
# ...
)
)
# ...
)
)
Table Format
Given a list of this form, unnestable() will flatten it into a table of the following form:
# A tibble: … x …
group_1 group_2 ... group_n field_a field_b field_c field_d
<chr> <chr> ... <chr> <dbl> <lgl> <chr> <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1 1 TRUE NA NA
2 value_1 value_1.1 ... value_1.1.….n.2 2 NA 2 NA
3 value_1 value_1.2 ... value_1.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
j value_2 value_2.1 ... value_2.1.….n.1 3 NA NA 3
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
k value_2 value_2.2 ... value_2.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

All data in a row. Move observations under each other

I have the following data:
data <- list(list(eventId = 8, subEventName = "Simple pass", tags = list(
list(id = 1801)), playerId = 122671, positions = list(list(
y = 50, x = 50), list(y = 53, x = 35)), matchId = 2057954,
eventName = "Pass", teamId = 16521, matchPeriod = "1H", eventSec = 1.656214,
subEventId = 85, id = 258612104), list(eventId = 8, subEventName = "High pass",
tags = list(list(id = 1801)), playerId = 139393, positions = list(
list(y = 53, x = 35), list(y = 19, x = 75)), matchId = 2057954,
eventName = "Pass", teamId = 16521, matchPeriod = "1H", eventSec = 4.487814,
subEventId = 83, id = 258612106))
I want to create a data frame out of this list. I use unlist(data), which creates a row with repeated variables.
> unlist(data)
eventId subEventName tags.id playerId positions.y positions.x positions.y
"8" "Simple pass" "1801" "122671" "50" "50" "53"
positions.x matchId eventName teamId matchPeriod eventSec subEventId
"35" "2057954" "Pass" "16521" "1H" "1.656214" "85"
id eventId subEventName tags.id playerId positions.y positions.x
"258612104" "8" "High pass" "1801" "139393" "53" "35"
positions.y positions.x matchId eventName teamId matchPeriod eventSec
"19" "75" "2057954" "Pass" "16521" "1H" "4.487814"
subEventId id
"83" "258612106"
Each observation starts with the eventId variable. So, basically I have to split the data into dataframes starting with eventId, and then moving those dataframes one under the other. I.e. having two observations in this case. Do you have any idea? thanks in advance
Try tibblify--
library(tibblify)
tibblify(data)
## A tibble: 2 x 12
# eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
# <dbl> <chr> <list<tbl_df[,1]>> <dbl> <list<tbl_df[,2]>> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#1 8 Simple pass [1 x 1] 122671 [2 x 2] 2057954 Pass 16521 1H 1.66 85 258612104
#2 8 High pass [1 x 1] 139393 [2 x 2] 2057954 Pass 16521 1H 4.49 83 258612106
You can use rbindlist from data.table :
result <- data.table::rbindlist(data)
result
# eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
#1: 8 Simple pass <list[1]> 122671 <list[2]> 2057954 Pass 16521 1H 1.66 85 2.59e+08
#2: 8 Simple pass <list[1]> 122671 <list[2]> 2057954 Pass 16521 1H 1.66 85 2.59e+08
#3: 8 High pass <list[1]> 139393 <list[2]> 2057954 Pass 16521 1H 4.49 83 2.59e+08
#4: 8 High pass <list[1]> 139393 <list[2]> 2057954 Pass 16521 1H 4.49 83 2.59e+08
Does this help solve your problem?
df <- data.frame(matrix(unlist(data), nrow=length(data), byrow=T))

Loop Output Stored as List

I have wide supervisory data where a single observation consists of a level 1 employee and their department all the way down to level 8. I use a loop with other commands to produce a list all employees and the departments beneath them in long format so that I can see what departments employees are responsible for at all levels. There may be a more elegant way to do this than a loop, but it works fine. Sample data (through level 3 for succinctness):
data <- tibble(LV1_Employee_Name = "Chuck", LV1_Employee_Nbr = "1", LV1_Department = "Tha Boss", LV1_Department_Nbr = "90",
LV2_Employee_Name = c("Alex", "Alex", "Paul", "Paul", "Jennifer", "Jennifer"), LV2_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV2_Department = c("Leadership", "Leadership", "Finance", "Finance", "Philanthropy", "Philanthropy"), LV2_Department_Nbr = c("91", "91", "92", "92", "93", "93"),
LV3_Employee_Name = c("Dan", "Wendy", "Sarah", "Monique", "Miguel", "Brandon"), LV3_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV3_Department = c("Analytics", "Pop Health", "Acounting", "Investments", "Yacht Aquisitions", "Golf Junkets"), LV3_Department_Nbr = c("94", "95", "96", "97", "98", "99"))
The loop below first produces six tibbles named level1_1, level1_2, level1_3, level2_2, level2_3, level3_3. Each tibble contains an employee name, number, and the department at the same department level or below. The code then lists and binds the rows of these tibbles with ls() and bind_rows(), then applies the distinct() command, and I've got what I need.
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
for(i in 1:3){
for(k in first_department:3){
assign(paste0("level", i, "_", k), setNames(distinct(as_tibble(c(data[ ,paste0("LV", i, "_", "Employee_Name")], data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")], data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames))
}
first_department = first_department + 1
}
employees_departments <- distinct(bind_rows(mget(ls(pattern = "^level")))) %>%
filter(is.na(Department) == FALSE)
rm(list = ls(pattern = "^level"))
What I'd like to do is, rather than produce an initial output of six tibbles, have the loop itself output the list. This will save me from having a huge list of tibbles in the output which, I'm told, is not very "R-like".
Here is a revised version that stores the results in a list within your loop. This will include an index idx incremented each time through the loop. Afterwards, you can use bind_rows on this list to get a complete result.
library(tidyverse)
idx <- 1
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
data_lst <- list()
for(i in 1:3){
for(k in first_department:3){
data_lst[[idx]] <- setNames(
distinct(as_tibble(
c(data[ ,paste0("LV", i, "_", "Employee_Name")],
data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")],
data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames)
idx <- idx + 1
}
first_department = first_department + 1
}
distinct(bind_rows(data_lst)) %>%
filter(!is.na(Department))
Output
Employee Employee_Id Department Department_Number
<chr> <chr> <chr> <chr>
1 Chuck 1 Tha Boss 90
2 Chuck 1 Leadership 91
3 Chuck 1 Finance 92
4 Chuck 1 Philanthropy 93
5 Chuck 1 Analytics 94
6 Chuck 1 Pop Health 95
7 Chuck 1 Acounting 96
8 Chuck 1 Investments 97
9 Chuck 1 Yacht Aquisitions 98
10 Chuck 1 Golf Junkets 99
11 Alex 2 Leadership 91
12 Paul 3 Finance 92
13 Jennifer 4 Philanthropy 93
14 Alex 2 Analytics 94
15 Alex 2 Pop Health 95
16 Paul 3 Acounting 96
17 Paul 3 Investments 97
18 Jennifer 4 Yacht Aquisitions 98
19 Jennifer 4 Golf Junkets 99
20 Dan 2 Analytics 94
21 Wendy 2 Pop Health 95
22 Sarah 3 Acounting 96
23 Monique 3 Investments 97
24 Miguel 4 Yacht Aquisitions 98
25 Brandon 4 Golf Junkets 99

Sum consecutive strings in multiple columns

I have a table with flight ids, arrivals, and departures:
> test
arrival departure flight_id
1 9 2233
2 8 1982
3 1 2164
4 9 2081
5 2130
6 2 2040
7 9 2030
8 2130
9 4 3169
10 6 2323
11 8 2130
12 2220
13 3169
14 9 2204
15 1 1910
16 2 837
17 1994
18 9 8 1994
19 1994
20 1994
21 9 1 2338
22 1 8 1981
23 9 2365
24 8 2231
25 9 2048
My objective is to count only the rows where arrival and departure are blank, and then to aggregate by flight_id. But there is a catch. I believe this cannot be done with table(), aggregate() or rle() because they do not account for breaks.
For example, only consecutive flight ids where arrival ="" and departure ="" should be counted, and the count should start again from zero if a flight id with a non-blank value occurs. NOTE: Other flight ids appearing in between don't matter - each flight id should be treated separately which is why flight 2130 is counted twice.
In other words, the resulting output from the test should look exactly like this:
output
flight_id count
1 2130 2
2 2220 1
3 3169 1
4 1994 1
5 1994 2
Notice that flight id 1994 occurs three times where arrival and departure are blank but that there is a break in between at row 18. Therefore, the flight id must be counted twice.
I have tried writing a for loop but get an error that there is missing value where TRUE/FALSE needed:
raw_data = test
unique_id = unique(raw_data$flight_id)
output<- data.frame("flight_id"= integer(0), "count" = integer(0), stringsAsFactors=FALSE)
for (flight_id in unique_id)
{
oneflight <- raw_data[ which(raw_data$flight_id == flight_id), ]
if(nrow(oneflight) >= 1 ){
for(i in 2:nrow(oneflight)) {
if(oneflight[i,"arrival"] == "" & oneflight[i,"departure"] == "") {
new_row <- c(flight_id, sum(flight_id)[i])
output[nrow(output) + 1,] = new_row
}
}
}
}
How could one improve the above code or could someone suggest a quicker method with dplyr for example? Here is a sample of the data:
> dput(test)
structure(list(arrival = c("", "", "1", "", "", "2", "9", "",
"", "6", "", "", "", "", "1", "", "", "9", "", "", "9", "1",
"9", "", "9"), departure = c("9", "8", "", "9", "", "", "", "",
"4", "", "8", "", "", "9", "", "2", "", "8", "", "", "1", "8",
"", "8", ""), flight_id = c(2233, 1982, 2164, 2081, 2130, 2040,
2030, 2130, 3169, 2323, 2130, 2220, 3169, 2204, 1910, 837, 1994,
1994, 1994, 1994, 2338, 1981, 2365, 2231, 2048)), .Names = c("arrival",
"departure", "flight_id"), row.names = c(NA, 25L), class = "data.frame")
A base R approach :
do.call("rbind", lapply(split(test, test$flight_id), function(x) {
o = rle(x[["arrival"]] == "" & x[["departure"]] == "")
data.frame(flight_id = rep(unique(x[["flight_id"]]), sum(o$values)),
count = o$lengths[o$values])
}))
#flight_id count
# 1994 1
# 1994 2
# 2130 2
# 2220 1
# 3169 1
We split the dataframe by flight_id and for every group we apply rle to find continuous empty rows in arrival and departure and return a dataframe with the flight_id and the number of continuous empty rows in the group.
If I understand your question, one trick you could use is to add a decimale to the flight_id, indicating a group.
For example, get an index vector
i <- find(oneflight$arrival == "" & oneflight$departure =="")
Then take cumsum(1-diff(i)) / 100 or a sufficiently large power of ten, add it to the flight IDs, and you then have groups flights that can be counted with table()
Here's a solution using data.table:
library(data.table)
flights <- test$flight_id[test$arrival=="" & test$departure==""]
setDT(test)[flight_id %in% flights, grp := rleid(arrival=="",departure=="")][
arrival=="" & departure=="",.(count = .N),.(flight_id, grp)]
# flight_id grp count
#1: 2130 1 2
#2: 2220 3 1
#3: 3169 3 1
#4: 1994 3 1
#5: 1994 5 2
Explanation:
First we attain the flight_id's that have at least one record with empty arrival and departure values. Then, we use this vector flights to subset your data and generate a run-length id column based on arrival=="" and departure =="" called "grp". Lastly we generate the count of of records (ie. .N) where, arrival=="" & departure =="", grouped by the columns flight_id and grp.
You can consequently drop the grp column if needed.

Find levels of a factors that appear more than once

I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?
A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"
Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"

Resources