Multiple String Matching in R

Multiple String Matching in R - r

Consider A,B,C,D .... as words.
I have two DFs.
df1:
ColA
A B
B C
C D
E F
G H
A M
M
df2:
ColB
A B C D X Y Z
C D M N F K L
S H A F R M T U
Operation:
I want to search all element of df1 in df2 then append all the matching values in a new column OR may be create multiple rows.
Output 1:
ColB COlB
A B C D X Y Z A,A B,B C,C D
C D M N F K L C D,M
S H A F R M T U A,A M
Output2:
ColB Output
A B C D X Y Z A
A B C D X Y Z A B
A B C D X Y Z B C
A B C D X Y Z C D
C D M N F K L C D
C D M N F K L M
S H A F R M T U A
S H A F R M T U A M

I think this will do it, although it differs a bit from your expected answer, which I think is wrong.
First set up the input data frames:
# set up the data
df1 <- data.frame(ColA = c("A B",
"B C",
"C D",
"E F",
"G H",
"A M",
"M"),
stringsAsFactors = FALSE)
df2 <- data.frame(ColB = c("A B C D X Y Z",
"C D M N F K L",
"S H A F R M T"),
stringsAsFactors = FALSE)
Next we will form all the pairwise combinations of the things to search with the things to be searched:
# create a vector of patterns and items to search
intermediate <- as.vector(outer(df2$ColB, df1$ColA, paste, sep = "|"))
# split it into a list
intermediate <- strsplit(intermediate, "|", fixed = TRUE)
Then we can create a function to match the elements for each row of this full combination dataset The core is the foundMatch which returns a logical indicating whether all elements in ColA were present in ColB. In your examples, order does not matter, so here we split the elements and look for all of the first to be in the second.
# set up the output data.frame
Output2 <- data.frame(do.call(rbind, intermediate))
names(Output2) <- c("ColB", "Output")
# here is the core, which does the element matching
foundMatch <- apply(Output2, 1, function(x) {
tokens <- strsplit(x, " ", fixed = TRUE)
all(tokens[[2]] %in% tokens[[1]])
})
# filter out the ones with the match
Output2 <- Output2[foundMatch, ]
Output2
## ColB Output
## 1 A B C D X Y Z A B
## 2 C D M N F K L A B
## 3 S H A F R M T A B
## 10 A B C D X Y Z E F
## 14 C D M N F K L G H
## 20 C D M N F K L M
## 21 S H A F R M T M
Not exactly what you have above but I think it's correct.

It is not obvious for me how your data.frames df1 and df2 are built. But you can try to vectorise your data and match both sets.
d1 <- sort(as.character(unlist(df1)))
d2 <- sort(as.character(unlist(df2)))
# get the intersection/difference without duplicates
intersect(d1,d2)
setdiff(d1,d2)
# get all values matching with the first or with the second dataset, respectively
d1[ d1 %in% d2 ]
d2[ d2 %in% d1 ]

Related

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na

library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.

Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Generate list of second-degree neighbors using lapply and igraph

I got some excellent advice here on how to lookup neighbors for a list of network nodes. See: lapply function to look up neighbors in igraph (when not all nodes are found)
Now I need to do the same thing with second-degree neighbors. However, substituting either ego or neighborhood function into this loop produces an error.
edgelist <- read.table(text = "
A B
B C
C D
D E
C F
F G")
testlist <- read.table(text = "
A
H
C
D
J")
testlist2 <- read.table(text = "
A
C
B
D
E")
library(igraph)
graph <- graph.data.frame(edgelist)
str(graph)
get_neighbors <- function(graph, n) {
do.call(rbind, lapply(n, function(x) {
if (x %in% V(graph)$name) {
nb <- neighborhood(graph,2, x) ##HERE##
if (length(nb) > 0) {
data.frame(lookupnode=x,
neighbor=nb$name, # h/t #MrFlick for this shortcut
stringsAsFactors=FALSE)
} else {
data.frame(lookupnode=x, neighbor=NA, stringsAsFactors=FALSE)
}
} else {
data.frame(lookupnode=x, neighbor=NA, stringsAsFactors=FALSE)
}
}))
}
A=get_neighbors(graph, as.character(testlist$V1))
Error in data.frame(lookupnode = x, neighbor = nb$name, stringsAsFactors = FALSE) : arguments imply differing number of rows: 1, 0
I gather the issue is that ego and neighborhood can't be directly coerced into a data frame. I can use unlist and then put in a data frame, but the values I want end up as row.names not values that I can put into my output.
How can I create an output of second-degree neighbors?

Changed
neighbor=nb$name, # h/t #MrFlick for this shortcut
to
neighbor=names(unlist(nb)), # h/t #MrFlick for this shortcut
and it is working for me now.
> A
lookupnode neighbor
1 A A
2 A B
3 A C
4 H <NA>
5 C C
6 C B
7 C D
8 C F
9 C A
10 C E
11 C G
12 D D
13 D C
14 D E
15 D B
16 D F
17 J <NA>
>

Apply List of functions on List of columns based on different combinations

I have a dataframe df with three categorical variables cat1,cat2,cat3 and two continuous variables con1,con2. I would like to compute list of functions sd,mean on list of columns con1,con2 based on different combinations of list of columns cat1,cat2,cat3. I have done them explicitly subsetting all different combinations.
# Random generation of values for categorical data
set.seed(33)
df <- data.frame(cat1 = sample( LETTERS[1:2], 100, replace=TRUE ),
cat2 = sample( LETTERS[3:5], 100, replace=TRUE ),
cat3 = sample( LETTERS[2:4], 100, replace=TRUE ),
con1 = runif(100,0,100),
con2 = runif(100,23,45))
# Introducing null values
df$con1[c(23,53,92)] <- NA
df$con2[c(33,46)] <- NA
results <- data.frame()
funs <- list(sd=sd, mean=mean)
# calculation of mean and sd on total observations
sapply(funs, function(x) sapply(df[,c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1
sapply(funs, function(x) sapply(df[df$cat1=='A',c(4,5)], x, na.rm=T))
sapply(funs, function(x) sapply(df[df$cat1=='B',c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1 and cat2
sapply(funs, function(x) sapply(df[df$cat1=='A' & df$cat2=='C' ,c(4,5)], x, na.rm=T))
.
.
.
sapply(funs, function(x) sapply(df[df$cat1=='B' & df$cat2=='E' ,c(4,5)], x, na.rm=T))
# Similarly for the combinations of three cat variables cat1, cat2, cat3
I would like to write a function on dynamically computing the list of functions for list of columns based on different combinations. Could you please give some suggestions. Thanks !
Edit:
I have already got some smart suggestions using dplyr. It would be great if someone provides suggestions using the apply family functions as it will help in using them(dataframes) in the further requirements.

This is a simple one-line base solution:
> do.call(cbind, lapply(funs, function(x) aggregate(cbind(con1, con2) ~ cat1 + cat2 + cat3, data = df, FUN = x, na.rm = TRUE)))
sd.cat1 sd.cat2 sd.cat3 sd.con1 sd.con2 mean.cat1 mean.cat2 mean.cat3 mean.con1 mean.con2
1 A C B NA NA A C B 25.52641 37.40603
2 B C B 32.67192 6.966547 B C B 46.70387 34.85437
3 A D B 31.05224 6.530313 A D B 37.91553 37.13142
4 B D B 23.80335 6.001468 B D B 59.75107 30.29681
5 A E B 22.79285 1.526472 A E B 38.54742 25.23007
6 B E B 32.92139 2.621067 B E B 51.56253 29.52367
7 A C C 26.98661 5.710335 A C C 36.32045 36.42465
8 B C C 20.22217 8.117184 B C C 60.60036 34.98460
9 A D C 33.39273 7.367412 A D C 40.77786 35.03747
10 B D C 12.95351 8.829061 B D C 49.77160 33.21836
11 A E C 33.73433 4.689548 A E C 55.53135 32.38279
12 B E C 25.38637 9.172137 B E C 46.69063 31.56733
13 A C D 36.12545 6.323929 A C D 48.34187 32.36789
14 B C D 30.01992 7.130869 B C D 53.87571 33.12760
15 A D D 15.94151 11.756115 A D D 35.89909 31.76871
16 B D D 10.89030 6.829829 B D D 22.86577 32.53725
17 A E D 24.88410 6.108631 A E D 47.32549 35.22782
18 B E D 12.73711 8.151424 B E D 33.95569 36.70167

Whole dataset shows up, although a subset has been selected and newly defined

I a dataframe which I have subsetted using normal indexing. Code below.
dframe <- dframe[1:10, c(-3,-7:-10)]
But when I write dframe$Symbol I get the output.
BABA ORCL LFC TSM ACT ABBV MA ABEV KMI UPS
3285 Levels: A AA AA^B AAC AAN AAP AAT AAV AB ABB ABBV ABC ABEV ABG ABM ABR ABR^A ABR^B ABR^C ABRN ABT ABX ACC ACCO ACE ACG ACH ACI ACM ACN ACP ACRE ACT ACT^A ACW ADC ADM ADPT ADS ADT ADX AEB AEC AED AEE AEG AEH AEK AEL AEM AEO AEP AER AES AES^C AET AF AF^C ... ZX
I'm wondering what is happening here. Does the dframe dataframe only contain 10 rows or still all rows, but only outputs 10 rows?
Thanks

That's just the way factors work. When you subset a factor, it preserves all levels, even those that are no longer represented in the subset. For example:
f1 <- factor(letters);
f1;
## [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
f2 <- f1[1:10];
f2;
## [1] a b c d e f g h i j
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
To answer your question, it's actually slightly tricky to append all missing levels to a factor. You have to combine the existing factor data with all missing indexes (here I'm referring to the integer indexes that the factor class internally uses to map the actual factor data to its levels vector, which is stored as an attribute on the factor object), and then rebuild a factor (using the original levels) from that combined data. Below I demonstrate this, now randomizing the subset taken from f1 to demonstrate that order does not matter:
set.seed(1); f3 <- sample(f1,10);
f3;
## [1] g j n u e s w m l b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
factor(c(f3,setdiff(1:nlevels(f3),as.integer(f3))),labels=levels(f3));
## [1] g j n u e s w m l b a c d f h i k o p q r t v x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

combining rows in sequence file

I have a data frame, in which each individual has two rows and I want to combine these two row in on row.
Code lines:
dat <- read.table("cbin.csv",sep="\t", row.names=1)
dat
V2 V3 V4 V5
1_1 A B C D
1_2 a b c d
2_1 E F G H
2_2 e f g h
3_1 J K L M
3_2 j k l m
d <- apply( dat[ , colnames(dat) ] , 2 , paste , collapse = " " )
d
V2 V3 V4 V5
"A a E e J j" "B b F f K k" "C c G g L l" "D d H h M m"
But I want to combine each two rows like this
1 A a B b C c D d
2 E e F f G g H h
3 I i J j K k L l
How can I do this?

This will get you more or less the data.frame you want. I just pull out the even rows and cbind them next to the odd rows.
dat2 <- cbind(dat[seq(1, nrow(dat), by = 2), ],
dat[seq(2, nrow(dat), by = 2), ])
I'll leave reordering the columns (or pasting them together, if you want to combine them into individual strings) as an exercise for the reader.

Here are a couple of options:
Option 1: Use stack to get a long data.frame, then use paste within aggregate to get the output you want.
Here's how you make your "long" data.frame.
Long <- cbind(rn = rownames(dat), stack(dat))
head(Long)
# rn values ind
# 1 1_1 A V2
# 2 1_2 a V2
# 3 2_1 E V2
# 4 2_2 e V2
# 5 3_1 J V2
# 6 3_2 j V2
If the values in "dat" are factors, you might need to do:
Long <- cbind(rn = rownames(dat), stack(lapply(dat, as.character)))
Once your data are in a long form, use aggregate along with substr (among other choices) to get the values you need to paste together.
aggregate(values ~ substr(rn, 1, 1), Long, paste, collapse = " ")
# substr(rn, 1, 1) values
# 1 1 A a B b C c D d
# 2 2 E e F f G g H h
# 3 3 J j K k L l M m
An alternative is a similar approach to what #Gregor is suggesting. This is basically an alternative approach to getting every alternate row and binding it, but goes the extra step to reorder and paste the values together.
do.call(paste,
cbind(dat[c(TRUE, FALSE), ],
dat[c(FALSE, TRUE), ])[order(rep(names(dat), 2))])
# [1] "A a B b C c D d" "E e F f G g H h" "J j K k L l M m"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multiple String Matching in R - r

Related

R - Adding a total row in Excel output

Generate list of second-degree neighbors using lapply and igraph

Apply List of functions on List of columns based on different combinations

Whole dataset shows up, although a subset has been selected and newly defined

combining rows in sequence file

Categories

Resources