Function to subset dataframe on optional arguments - r

I have a dataframe as follows:
df1 <- data.frame(
Country = c("France", "England", "India", "America", "England"),
City = c("Paris", "London", "Mumbai", "Los Angeles", "London"),
Order_No = c("1", "2", "3", "4", "5"),
delivered = c("Yes", "no", "Yes", "No", "yes"),
stringsAsFactors = FALSE
)
and multiple other columns as well (around 50)
I want to write a function which is generic and can take in as many parameters as the user wants and return a subset of only those specific columns. So the user should technically be able to pass 1 column or 30 columns to get the result back from function
With what I was able to find online on optional arguments, I wrote this following code but I am running into issues. Can anyone help me out here?
SubsetFunction <- function(inputdf, ...)
{
params <- vector(...)
subset.df <- subset(inputdf, select = params)
return(subset.df)
}
This is the error I am getting -
error in vector(...) :
vector: cannot make a vector of mode 'Country'.

The use of vector(...) is making problems here. The ellipsis has to be converted into a list instead. Therefore, in order to finally obtain a vector out of the three-dot parameters, the seemingly awkward construction unlist(list(...)) should be used instead of vector(...):
SubsetFunction <- function(inputdf, ...){
params <- unlist(list(...))
subset.df <- subset(inputdf, select=params)
return(subset.df)
}
This allows to call the function SubsetFunction() with an arbitrary number of parameters:
> SubsetFunction(df1, "City")
# City
#1 Paris
#2 London
#3 Mumbai
#4 Los Angeles
#5 London
> SubsetFunction (df1, "City", "delivered")
# City delivered
#1 Paris Yes
#2 London no
#3 Mumbai Yes
#4 Los Angeles No
#5 London yes

We can use missing function here to check if arguments are present or not
select_cols <- function(df, cols) {
if(missing(cols))
df
else
df[cols]
}
select_cols(df1, c("Country", "City"))
# Country City
#1 France Paris
#2 England London
#3 India Mumbai
#4 America Los Angeles
#5 England London
select_cols(df1)
# Country City Order_No delivered
#1 France Paris 1 Yes
#2 England London 2 no
#3 India Mumbai 3 Yes
#4 America Los Angeles 4 No
#5 England London 5 yes

Related

How to remove values in a column based on other column values equaling the column values above it?

I am currently coding in R and merged two dataframes together so I could include all the information together but I don't want the one column "Cost" to be duplicated multiple times (it was due to the unique values of the last 3 columns). I want it to include the cost 100 only in the first column and then for every other instance where the columns "State", "Market", "Date", and "Cost" are the same as above. I attached what the dataframe looks like and what I want it to be changed to. Thank you!
What it currently looks like
What it should look like
Please use index like in this example:
name_of_your_dataset[nrow_init:nrow_fin, ncol] <- NA
In your case, assuming the name of your dataset as 'data'
data[2:4,4]<- NA
Just leave a positive feedback and if I was useful, just vote this answer up.
Here is a solution using duplicated with your dataframe (df)
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
Set duplicates to NA
df$Cost[duplicated(df$Cost)] <- NA
Output:
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
The column Date is different so I think you want to do replace duplicated Cost for every value of State and Market combination.
library(dplyr)
df <- df %>%
group_by(State, Market) %>%
mutate(Cost = replace(Cost, duplicated(Cost), NA)) %>%
ungroup
df
# State Market Date Cost Word format Type
# <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
#2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
#3 AZ Phoenix 10-22-2020 NA YES FM Country
#4 AZ Phoenix 10-23-2020 NA NONE CM Rock
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(State = c("AZ", "AZ", "AZ", "AZ"), Market = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix"), Date = c("10-20-2020", "10-21-2020",
"10-22-2020", "10-23-2020"), Cost = c(100, 100, 100, 100), Word = c("HELLO",
"GOODBYE", "YES", "NONE"), format = c("AM", "PM", "FM", "CM"),
Type = c("Sports related", "Non Sports related", "Country",
"Rock")), row.names = c(NA, -4L), class = "data.frame")

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
ยดยดยดยด
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

Return values not found for each ID - R

I want to identify the unmatched values in Vendors data frame for each vendor. In other words, find the countries that are not located in the Vendors data frame for each vendor.
I have a data frame (Vendors) that looks like this:
Vendor_ID
Vendor
Country_ID
Country
1
Burger King
2
USA
1
Burger King
3
France
1
Burger King
5
Brazil
1
Burger King
7
Turkey
2
McDonald's
5
Brazil
2
McDonald's
3
France
Vendors <- data.frame (
Vendor_ID = c("1", "1", "1", "1", "2", "2"),
Vendor = c("Burger King", "Burger King", "Burger King", "Burger King", "McDonald's", "McDonald's"),
Country_ID = c("2", "3", "5", "7", "5", "3"),
Country = c("USA", "France", "Brazil", "Turkey", "Brazil", "France"))
and I have another data frame (Countries) that looks like this:
Country_ID
Country
2
USA
3
France
5
Brazil
7
Turkey
Countries <- data.frame (Country_ID = c("2", "3", "5", "7"),
Country = c("USA", "France", "Brazil", "Turkey"))
Desired Output:
Vendor_ID
Vendor
Country_ID
Country
2
McDonald's
2
USA
2
McDonald's
7
Turkey
Can someone please tell me how could this be achieved in R? I tried subset & ant-join but the results are not correct.
In Base R we could first split the data by Vendors
VenList <- split(df, df$Vendor)
and then we can check wich country is missing and return it.
res <- lapply(VenList, function(x){
# Identify missing country of vendors
tmp1 <- df2[!(df2[, "Country"] %in% x[, "Country"]), ]
# get vendor and vendor ID
tmp2 <- x[1:nrow(tmp1), 1:2]
# cbind
if(nrow(tmp2) == nrow(tmp1)){
cbind(tmp2, tmp1)
}
})
# Which yields
res
# $BurgerKing
# NULL
#
# $`McDonald's`
# Vendor_ID Vendor Country_ID Country
# 5 2 McDonald's 2 USA
# 6 2 McDonald's 7 Turkey
# If you want it as one df you could then flatten to
do.call(rbind, res)
# Vendor_ID Vendor Country_ID Country
# McDonald's.5 2 McDonald's 2 USA
# McDonald's.6 2 McDonald's 7 Turkey
Data
df <- read.table(text = "1 BurgerKing 2 USA
1 BurgerKing 3 France
1 BurgerKing 5 Brazil
1 BurgerKing 7 Turkey
2 McDonald's 5 Brazil
2 McDonald's 3 France", col.names = c("Vendor_ID", "Vendor", "Country_ID", "Country"))
df2 <- read.table(text = "2 USA
3 France
5 Brazil
7 Turkey", col.names = c("Country_ID", "Country")) `
Solution using expand.grid to create all possible Vendor - Country combinations (assuming that "Countries" has only one entry per country) and then using dplyr to join "Vendors" and find "missing countries"
Edit: The last two lines (left_joins) are only needed to "translate" the ID columns into "text":
library(dplyr)
expand.grid(Vendor_ID=unique(Vendors$Vendor_ID), Country_ID=Countries$Country_ID) %>%
left_join(Vendors) %>%
filter(is.na(Vendor)) %>%
select(Vendor_ID, Country_ID) %>%
left_join(Countries) %>%
left_join(unique(Vendors[, c("Vendor_ID", "Vendor")]))
Returns
Vendor_ID Country_ID Country Vendor
1 2 2 USA McDonald's
2 2 7 Turkey McDonald's

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

extracting country name from city name in R

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.
You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.
With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

Resources