regex match with fuzzyjoin / dplyr

regex match with fuzzyjoin / dplyr - r

I have two data frames that I want to join by the first column and to ignore the case:
df3<- data.frame("A" = c("XX28801","ZZ9"), "B" = c("one","two"),stringsAsFactors = FALSE)
df4<- data.frame("Z" = c("X2880","Zz9"),"C" = c("three", "four"), stringsAsFactors = FALSE)
What I want is this:
df5<- data.frame(A = c("XX28801","ZZ9"), B = c("one","two"), Z = c(NA,"Zz9"), C = c(NA, "four"))
but interestingly, I get this using the fuzzyjoin package:
join <- regex_left_join(df3,df4,by= c("A" = "Z"), ignore_case = TRUE)
It's good ZZ9 and Zz9 matched but I have no idea why XX28801 matched with X2880. The only similarity is the X2880 in XX28801.
I also don't want to uppercase/lowercase the values before joining as I want column A and column Z to retain their original values. Thanks.

Regex joins join on regular expressions, this searchers for the text in the right hand table within the text of the left hand table. So as "X2880" is found within "XX28801" this is considered a match.
To understand regex better, you might find it useful to explore some comparisons using grepl(pattern, text) this returns true/false if the pattern is found within text:
> grepl('X2880', 'XX28801', ignore.case = TRUE)
[1] TRUE
It seems like you want to match only when the entire text string matches the entire text string, other than capital/lowercase. For this I would recommend you create temporary columns to join on:
df3_w_lower = df3 %>%
mutate(A_for_join = tolower(A))
df4_w_lower = df4 %>%
mutate(Z_for_join = tolower(Z))
join = left_join(df3_w_lower, df4_w_lower, by = c("A_for_join" = "Z_for_join")) %>%
select(-A_for_join, - Z_for_join)
By using temporary columns for joining you preserve the capitalization in the original columns.

Related

How to pipe in dplyr

I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)

Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.

I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')

Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)

How to order a DataTable using a hidden column

I am rather new to R and I am trying to prepare an interactive data table using the DT package. My data contains numeric values, but some of these values are preceded by < or > sign. What I want is for my data table is to allow interactive sorting on the numeric values, regardless of whether there is a < or > sign in front of it. So for example >10, <5, 9, >8 should sort to <5, >8, 9, >10.
My initial approach for this was to duplicate the column containing the numeric values with < and > signs, to remove the < and > signs from this duplicate column, and to convert this data to numeric values to obtain a column with only the numeric values. What I then would like is to be able to order the data in the table on these numeric values, but I want to be able to do this when clicking the ordening button of the column containing the numeric values with the < and > signs. Therefore, I want to hide the column containing only the numeric values (since I do not want this column to be present in the table), but I want to somehow link the ordining function of the original column to this hidden column.
Here are some example data and a script in which I have already duplicated the column (b to c), removed the < and > signs, and converted it to numeric values to obtain the column c, which I have then hidden:
library(DT)
df <- data.frame(a=1:5, b=c('10','5.0','2.0','< 1.0','> 20'), c=c(10,5,2,1,20))
DT <- DT::datatable(df,
options = list(columnDefs =
list(list(visible=FALSE,
targets=3))))
DT
I have not been able to find a way to sort the data in the table on this hidden column c by using the sorting button of column b.
I have found that this should be possible in JavaScript: jQuery DataTables - Ordering dates by hidden column
However, I am not able to figure out how to do the same in R, either by using a suitable function in R, or by providing it in JavaScript using the JS() function.
Could anyone help me with this problem?

Here is a solution using render:
library(DT)
render <- c(
"function(data, type, row){",
" if(type === 'sort'){",
" return parseFloat(data.match(/\\d+\\.?\\d+/)[0]);",
" }else{",
" return data;",
" }",
"}"
)
df <- data.frame(
a = 1:5,
b = c('10','5.0','2.0','< 1.0','> 20')
)
DT <- datatable(df,
options = list(
columnDefs = list(
list(render = JS(render), type = "num", targets = 2)
)
)
)
DT
This solution does not require a hidden column.

Here's a way to do it. To get the "sorting key" use order.
library(DT)
# df <- data.frame(a=1:5, b=c('10','5.0','2.0','< 1.0','> 20'), c=c(10,5,2,1,20))
df <- data.frame(a = 1:5, b = c('10', '5.0', '2.0', '< 1.0', '> 20'))
df
#ONE APPROACH
df$c <-
stringr::str_replace(string = df$b,
pattern = "[<>]",
replacement = "") %>%
as.numeric()
#ANOTHER APPROACH
df$c <- gsub("[<>]", "", df$b) %>% as.numeric()
DT::datatable(df[order(df$c), -3], rownames = FALSE)

library(DT)
df <- data.frame(a=1:5, b=c('10','5.0','2.0','< 1.0','> 20'), c=c(10,5,2,1,20))
DT <- DT::datatable(df,
options = list(columnDefs =
list(list(visible=FALSE, targets=3),
list(orderData=3, targets=2)
)))
DT
Note: This answer is based on this one here, but DT now uses R indexing instead of JS indexing.

R: Conditional Formatting across excel files

I am trying to highlight rows of an excel file based on a match from the columns in a separate excel file. Pretty much, I want to highlight a row in file1 if a cell in that row matches a cell in file2.
I saw the R package "conditionalFormatting" has some of this functionality, but I cannot figure out how to use it.
the pseudo-code i think would look something like this:
file1 <- read_excel("file1")
file2 <- read_excel("file2")
conditionalFormatting(file1, sheet = 1, cols = 1:end, rows = 1:22,
rule = "number in file1 is found in a specific column of file 2")
Please let me know if this makes sense or if i need to clarify something.
Thanks!

The conditionalFormatting() function embeds active conditional formatting into the excel document but is likely more complicated than you need for a one-time highlight. I'd suggest loading each file into a dataframe, determining which rows contain a matching cell, creating a highlight style (yellow background), loading the file as a workbook object, setting the appropriate rows to the highlight style, and saving the updated workbook object.
The following function is the used to determine which rows have a match. The magrittr package provides the %>% pipes and the data.table package provides the transpose() function.
find_matched_rows <- function(df1, df2) {
require(magrittr)
require(data.table)
# the dataframe object treats each column as a list making it much easier and
# faster to search via column than row. Transpose the original file1 dataframe
# to treat the rows as columns.
df1_transposed <- data.table::transpose(df1)
# assuming that the location of the match in the second file is irrelevant,
# unlist the file2 dataframe so that each value in file1 can be searched in a
# vector
df2_as_vector <- unlist(df2)
# determine which columns contain a match. If one or more matches are found,
# attribute the row as 'TRUE' in the output vector to be used to subset the
# row numbers
match_map <- lapply(df1_transposed,FUN = `%in%`, df2_as_vector) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
sapply(function(x) sum(x) > 0)
# make a vector of row numbers using the logical match_map vector to subset
matched_rows <- seq(1:nrow(df1))[match_map]
return(matched_rows)
}
The following code loads the data, finds the matched rows, applies the highlight, and saves over the original file1.xlsx. The second tst_df1 and tst_df2 provide for an easy way of testing the find_matched_rows() function. As expected, it finds that the 1st and 3rd rows of the first dataframe contain a cell that matches a cell in second dataframe.
# used to ensure that the correct rows are highlighted. the dataframe does not
# include the header as an independent row unlike excel.
file1_header_row <- 1
file2_header_row <- 1
tst_df1 <- openxlsx::read.xlsx("./file1.xlsx",
startRow = file1_header_row)
tst_df2 <- openxlsx::read.xlsx("./file2.xlsx",
startRow = file2_header_row)
#example data for testing
tst_df1 <- data.frame(fname = c("John", "Bob", "Bill"),
lname = c("Smith", "Johnson", "Samson"),
wage = c(10, 15.23, 137.38),
stringsAsFactors = FALSE)
tst_df2 <- data.frame(a = c(10, 34, 284.2),
b = c("Billy", "Bill", "Billy-Bob"),
c = c("Samson", "Johansson", NA),
stringsAsFactors = FALSE)
df_matched_rows <- find_matched_rows(tst_df1, tst_df2)
# any color found in colours() can be used here or hex color beginning with "#"
highlight_style <- openxlsx::createStyle(fgFill = "yellow")
file1_wb <- openxlsx::loadWorkbook(file = "./file1.xlsx")
openxlsx::addStyle(wb = file1_wb,
sheet = 1,
style = highlight_style,
rows = file1_header_row + df_matched_rows,
cols = 1:ncol(tst_df1),
stack = TRUE,
gridExpand = TRUE)
openxlsx::saveWorkbook(wb = file1_wb,
file = "./file1.xlsx",
overwrite = TRUE)

How to rename columns in Sparklyr in R?

This is the code I have used in R via Spark Cluster, and error also given below
mydata<-spark_read_csv(spark_cluster,name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",header = F,delimiter = "|")
mydata %>% select(customer=V1,device_subscriber_id=V2,user_subscriber_id=V3,user_id=V4,location_id=V5)
Error in .f(.x[[i]], ...) : object 'V1' not found

The renaming convention goes the other way around (new name = old name)
You are looking for the following:
mydata %>%
select(V1 = customer,
V2 = device_subscriber_id,
V3 = user_subscriber_id,
V4 = user_id,
V5 = location_id)

If you want specific names just provide a vector of names on read:
columns <- c("customer", "device_subscriber_id",
"user_subscriber_id", "user_id", "location_id")
spark_read_csv(
spark_cluster, name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",
header = FALSE, columns = columns, delimiter = "|"
)
The number of columns should match the number of columns in the input.

Of the top of my head you could try customer = mydata$V1 and similar for the other variables (assuming V1,... are column names of mydata).

Using mutate in R to rename items in a column

EDIT
I am trying to name a column and rename all items within the column of a dataset:
dataSet <- read.csv(url) %>%
rename("newColumn1" = V1) %>%
mutate(newColumn1 = recode(newColumn1, "oldEntryX" = "newEntryX") %>%
select(dataSet, newColumn1)
And I get this error:
Error in recode(newColumn1, oldEntryX = "newEntryX" :
object 'newColumn1' not found
What am I missing?
The code runs correctly up through the rename function and displays the renamed column correctly, but soon as I include mutate it throws an error.
I have no problem sharing the real code but wanted to generalize it for the crowd.
source info was from https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

IN the mutate step, you don't need quotes for column names on the lhs of =. Also, there are couple of case mismatches
Assuming the dataset is read correctly, we can
df1 %>%
rename(newColumn1 = V1, newColumn2 = V2) %>%
mutate(newColumn1 = recode(newColumn1, oldEntryX = "newEntryX"),
newColumn2 = recode(newColumn2, oldEntryY = "newEntryY"))
Based on the OP's code there is no closing quote as well "newColumn1
data
set.seed(24)
df1 <- data.frame(V1 = sample(c("oldEntryX", "x", "y"), 10, replace = TRUE),
V2 = sample(c("oldEntryY", "x", "y"), 10, replace = TRUE), stringsAsFactors= FALSE)

you can do this with some simple codes of R programming:
How to read csv file
Syntax :- `read.csv("filename.csv")
by using this command 1st row will be used as header. To improve this fault one should write
data <- read.csv("datafile.csv", header=FALSE)
How to rename the header/Column name:
names(data) <- c("Column1", "Column2", "Column3")
Now your headers are replaced by Column1, Column2 and Column3
Now to change Column1 data you can follow steps
data$Column1 <- c(write down set of values with which you want to replace)
To see the output type
data

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

regex match with fuzzyjoin / dplyr - r

Related

How to pipe in dplyr

How to order a DataTable using a hidden column

R: Conditional Formatting across excel files

How to rename columns in Sparklyr in R?

Using mutate in R to rename items in a column

Categories

Resources