Return values not found for each ID - R - r

I want to identify the unmatched values in Vendors data frame for each vendor. In other words, find the countries that are not located in the Vendors data frame for each vendor.
I have a data frame (Vendors) that looks like this:
Vendor_ID
Vendor
Country_ID
Country
1
Burger King
2
USA
1
Burger King
3
France
1
Burger King
5
Brazil
1
Burger King
7
Turkey
2
McDonald's
5
Brazil
2
McDonald's
3
France
Vendors <- data.frame (
Vendor_ID = c("1", "1", "1", "1", "2", "2"),
Vendor = c("Burger King", "Burger King", "Burger King", "Burger King", "McDonald's", "McDonald's"),
Country_ID = c("2", "3", "5", "7", "5", "3"),
Country = c("USA", "France", "Brazil", "Turkey", "Brazil", "France"))
and I have another data frame (Countries) that looks like this:
Country_ID
Country
2
USA
3
France
5
Brazil
7
Turkey
Countries <- data.frame (Country_ID = c("2", "3", "5", "7"),
Country = c("USA", "France", "Brazil", "Turkey"))
Desired Output:
Vendor_ID
Vendor
Country_ID
Country
2
McDonald's
2
USA
2
McDonald's
7
Turkey
Can someone please tell me how could this be achieved in R? I tried subset & ant-join but the results are not correct.

In Base R we could first split the data by Vendors
VenList <- split(df, df$Vendor)
and then we can check wich country is missing and return it.
res <- lapply(VenList, function(x){
# Identify missing country of vendors
tmp1 <- df2[!(df2[, "Country"] %in% x[, "Country"]), ]
# get vendor and vendor ID
tmp2 <- x[1:nrow(tmp1), 1:2]
# cbind
if(nrow(tmp2) == nrow(tmp1)){
cbind(tmp2, tmp1)
}
})
# Which yields
res
# $BurgerKing
# NULL
#
# $`McDonald's`
# Vendor_ID Vendor Country_ID Country
# 5 2 McDonald's 2 USA
# 6 2 McDonald's 7 Turkey
# If you want it as one df you could then flatten to
do.call(rbind, res)
# Vendor_ID Vendor Country_ID Country
# McDonald's.5 2 McDonald's 2 USA
# McDonald's.6 2 McDonald's 7 Turkey
Data
df <- read.table(text = "1 BurgerKing 2 USA
1 BurgerKing 3 France
1 BurgerKing 5 Brazil
1 BurgerKing 7 Turkey
2 McDonald's 5 Brazil
2 McDonald's 3 France", col.names = c("Vendor_ID", "Vendor", "Country_ID", "Country"))
df2 <- read.table(text = "2 USA
3 France
5 Brazil
7 Turkey", col.names = c("Country_ID", "Country")) `

Solution using expand.grid to create all possible Vendor - Country combinations (assuming that "Countries" has only one entry per country) and then using dplyr to join "Vendors" and find "missing countries"
Edit: The last two lines (left_joins) are only needed to "translate" the ID columns into "text":
library(dplyr)
expand.grid(Vendor_ID=unique(Vendors$Vendor_ID), Country_ID=Countries$Country_ID) %>%
left_join(Vendors) %>%
filter(is.na(Vendor)) %>%
select(Vendor_ID, Country_ID) %>%
left_join(Countries) %>%
left_join(unique(Vendors[, c("Vendor_ID", "Vendor")]))
Returns
Vendor_ID Country_ID Country Vendor
1 2 2 USA McDonald's
2 2 7 Turkey McDonald's

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
ยดยดยดยด
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

Function to subset dataframe on optional arguments

I have a dataframe as follows:
df1 <- data.frame(
Country = c("France", "England", "India", "America", "England"),
City = c("Paris", "London", "Mumbai", "Los Angeles", "London"),
Order_No = c("1", "2", "3", "4", "5"),
delivered = c("Yes", "no", "Yes", "No", "yes"),
stringsAsFactors = FALSE
)
and multiple other columns as well (around 50)
I want to write a function which is generic and can take in as many parameters as the user wants and return a subset of only those specific columns. So the user should technically be able to pass 1 column or 30 columns to get the result back from function
With what I was able to find online on optional arguments, I wrote this following code but I am running into issues. Can anyone help me out here?
SubsetFunction <- function(inputdf, ...)
{
params <- vector(...)
subset.df <- subset(inputdf, select = params)
return(subset.df)
}
This is the error I am getting -
error in vector(...) :
vector: cannot make a vector of mode 'Country'.
The use of vector(...) is making problems here. The ellipsis has to be converted into a list instead. Therefore, in order to finally obtain a vector out of the three-dot parameters, the seemingly awkward construction unlist(list(...)) should be used instead of vector(...):
SubsetFunction <- function(inputdf, ...){
params <- unlist(list(...))
subset.df <- subset(inputdf, select=params)
return(subset.df)
}
This allows to call the function SubsetFunction() with an arbitrary number of parameters:
> SubsetFunction(df1, "City")
# City
#1 Paris
#2 London
#3 Mumbai
#4 Los Angeles
#5 London
> SubsetFunction (df1, "City", "delivered")
# City delivered
#1 Paris Yes
#2 London no
#3 Mumbai Yes
#4 Los Angeles No
#5 London yes
We can use missing function here to check if arguments are present or not
select_cols <- function(df, cols) {
if(missing(cols))
df
else
df[cols]
}
select_cols(df1, c("Country", "City"))
# Country City
#1 France Paris
#2 England London
#3 India Mumbai
#4 America Los Angeles
#5 England London
select_cols(df1)
# Country City Order_No delivered
#1 France Paris 1 Yes
#2 England London 2 no
#3 India Mumbai 3 Yes
#4 America Los Angeles 4 No
#5 England London 5 yes

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

Resources