Cleaning Origin and Destination data with duplicates but different factor level - r

I have some GIS data with origins and destinations (OD) and an information about the time of the day of the OD. I intending to make a map of this, and to color the ODs by the time of day information.
One thing is that some ODs are in the data set with both day and night and maybe with a different order. I would like to mark those differntly, e.g. "Day/Night"
Is there an easy way to do this? MY MWE is just one OD but I would need to identify it among several others. I can manage to find the duplicates regardless of the order, but I dont know how to find out wether or not there are both time cases there and how to replace them with "Day/Night"
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]

You can do that using dplyr package as follows;
Feel free to change the conditions to what fits your need.
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day

Thanks for the dplyr solution by #Nareman Darwisch that gave me the inspiration for my solution with data.table
I am creating a new variable as a unique ID for each Origin Destination
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
Then I can either calculate the length of the unique occurences of the factor by group or if they match more than once with any of the 3 levels. Based on this I can recode the labels with the "Day/Night" level whenever the length is > 1 or the other condition is TRUE
dt[,No.levels:=length(unique(c(Time))), by=unique.name]
dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
The thing I would like to understand how I could use a logical condition in the spirit of looking at the levels by group and compares those with the cases I want.
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
But I guess the levels command always gives me all three levels.

If I understand correctly, the OP wants to
identify city pairs regardless of the order of origin and destination, e.g. London-Paris belongs to the same city pair as Paris-London
collapse separate rows if a city pair is operated Day and Night or Day/Night
or update the original dataset
This is what I would do:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
Origin Destination Time Pair
1: London Paris Day/Night London-Paris
2: Paris London Day/Night London-Paris
3: Lisbon Berlin Day Berlin-Lisbon
4: Madrid Lisbon Day/Night Lisbon-Madrid
5: Berlin Lisbon Day Berlin-Lisbon
6: London Paris Day/Night London-Paris
The key point is to identify the city pairs which fullfil the second requirement:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
So, there is no need to deal with factor levels. BTW, factor levels are an attribute of the whole column and do not change when subsetting or grouping. What does change is which of the levels are used in a subset or group.
pairs_DN contains the unique key of those city pairs
Pair
1: London-Paris
2: Lisbon-Madrid

Related

"Compare a variable by state abbreviations

How can I compare a variable by state abbreviations?
My data set has 5 variables currently. One of them is Location, and it is written as: "Raleigh, NC"
I need to create a variable that contains the two-character state abbreviation for each observation, and afterward another to group them by state. Each observation is of a college including their classification(private/public), instate/out of state tuition, and location.
This should do for you, if I understood your issue correctly.
Note: Please always share sample data using dput(your_dataset) or dput(head(your_dataset))
library(tidyverse)
d<- tibble(id = 1:3,
Location = c("Newyork, NY", "Raleigh, NC", "Delhi, IN"))
d %>% separate(Location,into = c("city", "country")) %>%
mutate_at(vars("city","country"), str_trim)
# A tibble: 3 x 3
id city country
<int> <chr> <chr>
1 1 Newyork NY
2 2 Raleigh NC
3 3 Delhi IN

individuating rows based on conditions in nested data

I am new to r, and I am having some trouble manipulating the data in the way I need it for my analysis. I would be grateful if anyone could help, because this is essential for my research.
I already asked a similar question but the answer I got did not fully address my problem, I will try to be more clear this time to see if anyone can help.
my data looks something like this:
df<- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA","USA"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU"),
"Product.cat" = c("1", "11", "111", "112", "12", "2"),
"Product Description" = c("Food", "Fruit", "Apple",
"Banana", "Meat", "Manifactured"),
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970),
"trade value" = c( 100, 50, 30, 20, 50, 220),
stringsAsFactors = FALSE)
I have country-year observations about trade.
The vector 'product.cat' indicates what kind of commodity is exported. The more digits the product.cat has, the more the trade information is disaggregated.
For example product.cat. 111 (eg. apple) and 112 (e.g. bananas) are sub-product categories of product category 11 (e.g. fruit).
The same holds for the higher levels of aggregation. Product category 11 (fruit) is a subcategory of product.cat 1(food) together with product.cat 12 (meat).
To note that data in lower categories is nested in higher level of aggregation. Hence the value of product.cat 11 (50) is equal to the value of product.cat 111 (30) + product.cat 112 (20).
To do my analysis I need to identify those values that are not reported at the most disaggregated possible level - i.e. I need to identify the data not reported at the 3 digit level.
My problem is that for some country-year observation I have data reported accurately at all levels of aggregation (e.g. 1,11,111,112) while for others i only have data at the higher level of aggregation (e.g. 12 and 2). For instance, in my example, I only have product.cat 12 (meat), but not data on what kind of meat product.cat 121(pork), product.cat 122 (veal).
Similarly, in the example, data on product.cat 2 (manufacturing), is not reported at lower levels.
we do not know whether is product.cat 21 (clothing) or product.cat 22 (wood products).
In other words, I have data reported at the 2 digit (12) or first digit level (2) that could be reported at the 3 digit level. To note that every category should be disaggregated at the 3 level digit
What I would like to do is to find a way to individuate all the data exclusively reported at a higher level of aggregation and change their product.cat name adding an "m" to the end.
After manipulation the product.cat 12 should become* 12m to indicate that data was reported only at the 2nd digit.
Similarly I would like to identify exports that are reported only at the first digit. product.cat 2 should become 2mm to reflect that the data was reported only at the first digit.
To be sure, only the data for which I have information exclusively at a higher level of aggregation - i.e. in the example 12 and 2 - should include "m"s.
For instance, in the example, I do not want to have 1mm, since I have data at a lower level of aggregation (11,12). Similarly, I do not want to have 11m, because I have data at lower levels of aggregation (111,112). What I would like to have is 12m and 2mm because the data is reported only at a higher level of aggregation (12 and 2).
I know that this is a very specific question but I would really appreciate if anyone could help.
Note: in the real dataset, due to for measurement errors, the sum of the disaggregated values do not always perfectly add up to the higher level of aggregation. (for instance, 111+112 can be > 11). Hence, ideally to solve the issue the, I am looking to a function that is able to specify when to add the m based on the number of digits divided by country, partner, year, rather than the sum of the traded value.
I really thank everyone that could give me a help with this, it would be a huge step forward for my research.
---- attempts
I have been working on this function, but it does not seem to do what I am looking for. Maybe someone can find out what is going wrong
fillLevel <- function(x, width = 3, fill = "m"){
sp <- split(x, substr(x, 1, 1))
sp <- lapply(seq_along(sp), function(i){
n <- nchar(sp[[i]])
if(all(n < 3)){
j <- which(n == max(n))
sp[[i]][j] <- gsub(" ", "m", formatC(sp[[i]][j], width = -3))
}
sp[[i]]
})
unname(unlist(sp))
}
df <- df%>% mutate(prdcat2 = fillLevel(df$Product.cat.))
As you can see it only individuates 2mm but not 12m. Moreover when I run it on more complex codes it mess up the order of my data. I think this relates to sp <- lapply(seq_along(sp) but i am not sure how to go about it.
Best
Here's one way to do it:
library(data.table)
setDT(df)
# tag levels
df[, lvl := nchar(Product.cat)]
df[lvl < 3L, has_subcat := FALSE]
# use level-3 observations to flag level-2s as okay
df[
df[lvl == 3, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 2))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# use level-2 observations to flag level-1s as okay
df[
df[lvl == 2, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 1))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# create new cat, flagging observations with no subcategories
df[, newcat := Product.cat]
df[has_subcat == FALSE, newcat := paste0(Product.cat, strrep("m", 3-lvl))]
Reporter Partner Product.cat Product.Description Year trade.value lvl has_subcat newcat
1: USA EU 1 Food 1970 100 1 TRUE 1
2: USA EU 11 Fruit 1970 50 2 TRUE 11
3: USA EU 111 Apple 1970 30 3 NA 111
4: USA EU 112 Banana 1970 20 3 NA 112
5: USA EU 12 Meat 1970 50 2 FALSE 12m
6: USA EU 2 Manifactured 1970 220 1 FALSE 2mm
I'm assuming that this should be done separately per Reporter-Partner-Year.

If value in a column starts with...mutate another column with given text, in R

I'm trying to build an if function that allows me to mutate the "city" column of a dataframe with a certain city name if in the "zipcode" column the value starts with a certain number.
For example: If zipcode starts with 1, mutate city column value with "NYC",
else if zipcode starts with 6, mutate city column value with "Chicago",
else if zipcode starts with 2, mutate city column value with "Boston",
and so on.
From:
city zipcode
NYC 11211
DC 20910
NYC 11104
NA 11106
NA 2008
NA 60614
To:
city zipcode
NYC 11211
DC 20910
NYC 11104
NYC 11106
DC 2008
Chicago 60614
It's a way to deal with NA values: The if function would just rewrite the same city for the values in which they are already present, and type the city name in case there's an NA value
Dataframe name data.frame
Column name zipcode and city.
Both of them are factor type and have to remain such for my further models.
I want do directly mutate the dataframe as I will need it for further use.
PS: Sorry for bad writing. I'm new in the community.
Thanks in advance!
Here's a solution that might work for you.
Full code:
# load library
library(tidyverse)
# create the sample dataframe
df <- tribble(~city, ~zipcode,
'NYC',11211,
'DC',20910,
'NYC', 11104,
NA, 11106,
NA, 2008,
NA, 60614)
# change the NAs to the appropriate values
df <- df %>%
mutate(
city = case_when(
str_sub(zipcode, 1, 1) == '1' ~ 'NYC',
str_sub(zipcode, 1, 1) == '2' ~ 'DC',
str_sub(zipcode, 1, 1) == '6' ~ 'Chicago',
TRUE ~ city
)
)
# convert everything to factors
df <- df %>%
mutate(
city = as.factor(city),
zipcode = as.factor(zipcode)
)
#preview the output
glimpse(df)
The output of the glimpse() is:
Observations: 6
Variables: 2
$ city <fct> NYC, DC, NYC, NYC, DC, Chicago
$ zipcode <fct> 11211, 20910, 11104, 11106, 2008, 60614
The trick that I used was first keep everything as a string or number, fill in the missing values, and then convert to factor.

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Resources