I am trying to extract texts based on a match in a character column of a dataframe with a column of another dataframe. Here is an example of reproducible dataframes.
productlist <- data.frame(prod_tg=c('Milk', 'Soybean', 'Pig meat'),
nomencl=c('milk|SMP|dairy|MK', 'Soybean|Soyabean', 'Pigmeat|PK|Pork|pigmeat') )
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'))
I would like to match the strings case insensitive. In the productlist, I have included all entries in nomencl column by using '|' so that any match of these will go specific entry of prod_tg such as Milk, Pig meat, Soybean.
my expected dataframe would look like as:
finaldf = data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'), prod_match=c('milk', 'dairy', 'MK','pig', 'Soybean'), product_tag=c('Milk', 'Milk', 'Milk', 'Pig meat', 'Soybean'))
I have been thinking of grepl function in base R but open to any other function. Grateful for your suggestions.
Here's a way using stringr::str_extract
library(stringr)
cbind(tctdf,t(sapply(tctdf$policy_label, function(x) {
v <- str_extract(x, regex(productlist$nomencl, ignore_case = TRUE))
c(prod_match = toString(na.omit(v)),
product_tag = toString(productlist$prod_tg[!is.na(v)]))
}))) |> `rownames<-`(NULL)
# policy_label prod_match product_tag
#1 Market Milk Milk Milk
#2 dairy products dairy Milk
#3 OCHA - MK MK Milk
#4 pigmeat pigmeat Pig meat
#5 Soybeans Soybean Soybean
data
Changed <= to <- for tctdf and replaced 'pig meat' to 'pigmeat' so that it actually matches with productlist.
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products',
'OCHA - MK', 'pigmeat', 'Soybeans'))
Related
The following names are in a column. I want to retain just five distinct names, while replace the rest with others. how do I go about that?
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
If I am understanding you correctly, use %in% and ifelse. Here, I chose the first five names as an example. I also created it in a new column, but you could just overwrite the column as well or create a vector:
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
fivenamez <- c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm')
df$names_transformed <- ifelse(df$names %in% fivenamez, df$names, "Other")
# names names_transformed
# 1 Marvel Comics Marvel Comics
# 2 Dark Horse Comics Dark Horse Comics
# 3 DC Comics DC Comics
# 4 NBC - Heroes NBC - Heroes
# 5 Wildstorm Wildstorm
# 6 Image Comics Other
# 7 <NA> Other
# 8 Icon Comics Other
# 9 SyFy Other
If you want to keep NA values as NA, just use df$names_transformed <- ifelse(df$names %in% fivenamez | is.na(df$names), df$names, "Other")
You can also use something like case when. The following code will keep marvel, dark horse, dc comics, JK Rowling and George Lucas the same and change all others to "Other". It functionally the same as u/jpsmith, but (in my humble opinion) offers a little more flexibility because you can change multiple things a bit more easily or make different comics have the same name should you choose to do so.
df = df %>%
mutate(new_names = case_when(names == 'Marvel Comics' ~ 'Marvel Comics',
names == 'Dark Horse Comics' ~ 'Dark Horse Comics',
names == 'DC Comics' ~ 'DC Comics',
names == 'George Lucas' ~ 'George Lucas',
names == 'J. K. Rowling' ~ 'J. K. Rowling',
TRUE ~ "Other"))
I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))
I have a dataframe like so:
df = data.frame('name' = c('California parks', 'bear lake', 'beautiful tree house', 'banana plant'), 'extract' = c('parks', 'bear', 'tree', 'plant'))
How do I remove the strings of the 'extract' column from the name column to get the following result:
name_new = California, lake, beautiful house, banana
I'm suspecting this demands a combination of str_extract and lapply but can quite figure it out.
Thanks!
The str_remove or str_replace are vectorized for both string and pattern. So, if we have two columns, just pass those columns 'name', 'extract' as the string, pattern to remove the substring in the 'name' column elementwise. Once we remove those substring, there are chances of having spaces before or after which can be removed or replaced with str_replace with trimws (to remove the leading/lagging spaces)
library(dplyr)
library(stringr)
df %>%
mutate(name_new = str_remove(name, extract),
name_new = str_replace_all(trimws(name_new), "\\s{2,}", " "))
# name extract name_new
#1 California parks parks California
#2 bear lake bear lake
#3 beautiful tree house tree beautiful house
#4 banana plant plant banana
A base R option using gsub + Vectorize
within(df,name_new <- Vectorize(gsub)(paste0("\\s",extract,"\\s")," ",name))
which gives
name extract name_new
1 California parks parks California
2 bear lake bear lake
3 beautiful tree house tree beautiful house
4 banana plant plant banana
I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG
I would like to create a new column in my dataframe that corresponds to values in list variables.
My dataframe includes many rows with a 'product names' column. My intention is to create a new column that allows me to sort products into categories.
Sample code -
library(dplyr)
products <- c('Apple', 'orange', 'pear',
'carrot', 'cabbage',
'strawberry', 'blueberry')
df <- data.frame(products)
ls <- list(Fruit = c('Apple', 'orange', 'pear'),
Veg = c('carrot', 'cabbage'),
Berry = c('strawberry', 'blueberry'))
test <- df %>%
mutate(category = products %in% ls)
I hope that illustrates what I'm trying to do. By creating the list, I've basically got a register of products and their categories which could change over time.
Is there a solution to this using a list, or am I over-complicating it and not seeing the wood for the trees?
edit - It might help to let you know that I'm working with 100s of products.
stack the list and then join with the data frame:
df %>%
left_join(stack(ls), by = c('products' = 'values')) %>%
rename(category = ind)
# products category
#1 Apple Fruit
#2 orange Fruit
#3 pear Fruit
#4 carrot Veg
#5 cabbage Veg
#6 strawberry Berry
#7 blueberry Berry