Matching two names in same column in r - r

I am new to coding in R and was finding trouble matching two names in the same column.
To be more specific I have a table of multiples rows with a column called "fileName" that gives the name of persay different colors. This table was combined from two different tables so the first table's color names are called new_red and the second is referred to as old_red for example.
I want to be able to make a new column that says that if the set of characters match multiple times in the fileName column, then in the new column to write "Match" for the row that the color is placed in. If the new_ is a unique color where there is no old_ with that color, to write "No_new_match" and the same for the old where it would write "No_old_match".
I believe there is a line of code that references a certain number of numbers/characters after a name i.e it would look for 3 characters for new_xxx. I tried doing it that way where it was like "new\d{3}" but it didn't work the way I intended to.
Here is an example of what I am referring to
fileName
new_red
new-blue
new_green
old_red
old_purple
Match
No_new_match
No_new_match
Match
No-old_match
Any help would be appreciated, I new how to create a new column and such for the table I want to make but I am having trouble with this part. Again, thank you!

Here's a way with dplyr -
df <- data.frame(fileName = c("new_red", "new_blue", "new_green", "old_red", "old_purple"),
stringsAsFactors = F)
df %>%
mutate(
Match = sapply(strsplit(fileName, "_"), "[", 2),
Match = duplicated(Match) | duplicated(Match, fromLast = T)
)
fileName Match
1 new_red TRUE
2 new_blue FALSE
3 new_green FALSE
4 old_red TRUE
5 old_purple FALSE
You can make cosmetic changes to the Match column as per your needs.

Here's a way using regular expressions:
fileName <- c("new_red", "new_blue", "new_green", "old_red", "old_purple")
color <- gsub("(new_)|(old_)", "", fileName)
color.freq <- table(color)
df <- data.frame(
fileName = fileName,
color = color,
match = ifelse(
color.freq[color] == 2,
"Match",
ifelse(
grepl("new", fileName),
"No_new_match",
"No_old_match"
)
)
)
fileName color match
1 new_red red Match
2 new_blue blue No_new_match
3 new_green green No_new_match
4 old_red red Match
5 old_purple purple No_old_match

Related

Bucket 2 columns of data (character class) into a newly created column in R

I will like to bin data from Column A (rideable_type) and Column B (member_casual) into column C (a newly created column called type_of_users). For instance, if Column A is electric_bike and Column B is member, I will like to show, in Column C, as member_electric. I have googled for a suitable code but nothing seems to work. Seeking anyone who can help me on this issue. Below are some details which may be vital:
Dataset name: df1
Column A (rideable_type) and Column B (member_casual): Given dataset. Both class and mode are “Characters”.
Column C (type_of_users): End result that I wish to achieve. Both class and mode are to be "Character”.
Dataset screenshot
I tried with the below code but it duplicate itself on the right hand side of the dataset.
df1$type_of_users %>%
mutate(case_when
(df1$member_casual =='casual' & df1$rideable_type =='docked_bike' ~ 'casual_docked',
df1$member_casual =='member' & df1$rideable_type =='docked_bike' ~ 'member_docked',
df1$member_casual =='casual' & df1$rideable_type =='electric_bike' ~ 'casual_electric',
df1$member_casual =='member' & df1$rideable_type =='electric_bike' ~ 'member_electric'))
If this isn't the suitable one, what other kinds will be useful? If function? Stack function?
Thanks.
I think you could do much simpler like this:
data <- data.frame('rideable_type' = c('electric_bike','docked_bike','docked_bike', 'electric_bike'), 'member_casual' = c('member','casual','casual','member'))
# columns to paste together
cols <- c( 'rideable_type' , 'member_casual' )
# create a new column `x` with the three columns collapsed together
data$type_of_users <- apply( data[ , cols ] , 1 , paste , collapse = "_" )
#clean with gsub to remove or substiture any string
data$type_of_users <- gsub("_bike", "", data$type_of_users)

Finding Matches Across Char Vectors in R

Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
Psudo code for what I think of as the solution:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
We can write a small functions to extract matches and then loop over the keys:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.
You can interate using grep
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

R: Conditional Formatting across excel files

I am trying to highlight rows of an excel file based on a match from the columns in a separate excel file. Pretty much, I want to highlight a row in file1 if a cell in that row matches a cell in file2.
I saw the R package "conditionalFormatting" has some of this functionality, but I cannot figure out how to use it.
the pseudo-code i think would look something like this:
file1 <- read_excel("file1")
file2 <- read_excel("file2")
conditionalFormatting(file1, sheet = 1, cols = 1:end, rows = 1:22,
rule = "number in file1 is found in a specific column of file 2")
Please let me know if this makes sense or if i need to clarify something.
Thanks!
The conditionalFormatting() function embeds active conditional formatting into the excel document but is likely more complicated than you need for a one-time highlight. I'd suggest loading each file into a dataframe, determining which rows contain a matching cell, creating a highlight style (yellow background), loading the file as a workbook object, setting the appropriate rows to the highlight style, and saving the updated workbook object.
The following function is the used to determine which rows have a match. The magrittr package provides the %>% pipes and the data.table package provides the transpose() function.
find_matched_rows <- function(df1, df2) {
require(magrittr)
require(data.table)
# the dataframe object treats each column as a list making it much easier and
# faster to search via column than row. Transpose the original file1 dataframe
# to treat the rows as columns.
df1_transposed <- data.table::transpose(df1)
# assuming that the location of the match in the second file is irrelevant,
# unlist the file2 dataframe so that each value in file1 can be searched in a
# vector
df2_as_vector <- unlist(df2)
# determine which columns contain a match. If one or more matches are found,
# attribute the row as 'TRUE' in the output vector to be used to subset the
# row numbers
match_map <- lapply(df1_transposed,FUN = `%in%`, df2_as_vector) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
sapply(function(x) sum(x) > 0)
# make a vector of row numbers using the logical match_map vector to subset
matched_rows <- seq(1:nrow(df1))[match_map]
return(matched_rows)
}
The following code loads the data, finds the matched rows, applies the highlight, and saves over the original file1.xlsx. The second tst_df1 and tst_df2 provide for an easy way of testing the find_matched_rows() function. As expected, it finds that the 1st and 3rd rows of the first dataframe contain a cell that matches a cell in second dataframe.
# used to ensure that the correct rows are highlighted. the dataframe does not
# include the header as an independent row unlike excel.
file1_header_row <- 1
file2_header_row <- 1
tst_df1 <- openxlsx::read.xlsx("./file1.xlsx",
startRow = file1_header_row)
tst_df2 <- openxlsx::read.xlsx("./file2.xlsx",
startRow = file2_header_row)
#example data for testing
tst_df1 <- data.frame(fname = c("John", "Bob", "Bill"),
lname = c("Smith", "Johnson", "Samson"),
wage = c(10, 15.23, 137.38),
stringsAsFactors = FALSE)
tst_df2 <- data.frame(a = c(10, 34, 284.2),
b = c("Billy", "Bill", "Billy-Bob"),
c = c("Samson", "Johansson", NA),
stringsAsFactors = FALSE)
df_matched_rows <- find_matched_rows(tst_df1, tst_df2)
# any color found in colours() can be used here or hex color beginning with "#"
highlight_style <- openxlsx::createStyle(fgFill = "yellow")
file1_wb <- openxlsx::loadWorkbook(file = "./file1.xlsx")
openxlsx::addStyle(wb = file1_wb,
sheet = 1,
style = highlight_style,
rows = file1_header_row + df_matched_rows,
cols = 1:ncol(tst_df1),
stack = TRUE,
gridExpand = TRUE)
openxlsx::saveWorkbook(wb = file1_wb,
file = "./file1.xlsx",
overwrite = TRUE)

NLP - identifying and replacing words (synonyms) in R

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

Resources