Using R to text mine and extract words - r

I asked a similar questions before but i still need some help/be pointed into the right direction.
I am trying to locate certain words within a column that consists of a SQL statement on all the rows and extract the next word in R studio.
Example: lets call this dataframe "SQL
| **UserID** | **SQL Statement**
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
So I am trying to pull out the table name. So I am trying to find the words "From" and "Join" and pulling the next table names.
I have been using some code with help from earlier:
I make the column "SQL Statement" in a list of 2 name "b"
I use the code:
z <- mapply(grepl,"(FROM|JOIN)",b)
which gives me a True and fasle for each word in each list.
z <- mapply(grep,"(FROM|JOIN)",b)
The above is close. It give me a position of every match in each of the lists.
But I am just trying to find the word Join or From and take the text word out. I was trying to get an output something like
| **UserID** | **SQL Statement** | Tables
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON | Table.1, Table.2
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = | Table.11, Table.31
v.BusID WHERE SubID = 1 ORDER BY v.Name;"

Here is a working script which uses base R options. The inspiration here is to leverage strsplit to split the query string on the keywords FROM or JOIN. Then, the first separate word of each resulting term (except for the first term) should be a table name.
sql <- "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
terms <- strsplit(sql, "(FROM|JOIN)\\s+")
out <- unlist(lapply(terms, function(x) gsub("^([^[:space:]]+).*", "\\1", x)))
out <- out[2:length(out)]
out
[1] "Table.1" "Table.2" "Table.3"
Demo
To understand better what I did, follow the demo and have a look at the terms list which resulted from splitting.
Edit:
Here is a link to another demo which shows how you might use the above logic on a vector of query strings, to generate a list of vector of tables, for each query
Demo

Related

Getting data from queryparser

I am attempting to use queryparser to extract table relationships from an SQL query. I can get most of what I need, I'm just having issues unpacking the lists.
library(queryparser)
file <- "select
p.name,
p.age,
p.hometown,
c.state,
c.country,
n.capitol,
n.leader
from person p
inner join city c
on p.hometown = c.name
inner join nation n
on c.county = n.name
where C.country != 'Antarctica' "
query <- parse_query(file, tidyverse = TRUE)
query$from
yields the following lists:
> query$from
$p
person
$c
city
$n
nation
attr(,"join_types")
[1] "inner join" "inner join"
attr(,"join_conditions")
attr(,"join_conditions")[[1]]
p.hometown == c.name
attr(,"join_conditions")[[2]]
c.county == n.name
I would like to have a datafame that has each table name and it's alias, and a second table with the join criteria. What is the easiest way to do this dynamically so that I don't have to adjust code between scanning different scripts?
Convert to character and then use stack. For the info in the attributes remove the names and simplify giving the character matrix shown.
stack(sapply(query$from, as.character))
## values ind
## 1 person p
## 2 city c
## 3 nation n
simplify2array(attributes(query$from)[-1])
## join_types join_conditions
## [1,] "inner join" "p.hometown == c.name"
## [2,] "inner join" "c.county == n.name"

Compare two colums in diffrents tables

Say I have two tables, Table A and Table B, and I want to compare a certain column.
For example,
Table A has the columns:
Name,Surname ,Family, species
Table B has the columns:
IP,Genes,Types,Species,Models
How do I compare the Species column between the two tables to get the matches , that means that i want to extract name of species that exist in both tables?
for exemple if the first species column have
a b c d e f g h i
information and the second species colum have
k l m n a b y i l
i want this result :
a b i
Can you tell me please the way i can do that , and also if there s anyway i can do it without usin join
Thank you very much
Try any of these options. I have used dummy data:
#Data
TableA <- data.frame(Species=c('a','b','c','d','e','f','g','h','i'),
Var=1,stringsAsFactors = F)
TableB <- data.frame(Species=c('k','l','m','n','a','b','y','i','l'),
Var2=2,stringsAsFactors = F)
#Option1
TableA$Species[TableA$Species %in% TableB$Species]
#Option 2
intersect(TableA$Species,TableB$Species)
In both cases the output will be:
[1] "a" "b" "i"

How to I get igraph to ignore blank cells?

So I have this dataframe that we will call test_a2. I want to use igraph to create a network map.
Col 1 Col 2 Col 3 Col 4
Table A | Table B | Table C |
Table Z | Table A | Table C | Table Y
Table K | Table L | Table M | Table B
Table J | Table H |
I am currently using the following code to map multiple columns
plot(graph.data.frame(rbindlist(lapply(seq(ncol(test_a2)-1), function(i) test_a2[i:(i+1)]))))
This give me a graph with nodes and edges. However, where there is an empty space which it creates a node for and create unnecessary connection. Anyway to have it ignore this?
Would this work?
library(igraph)
library(data.table)
test_a2 <- data.frame(col1 = c("A","Z","K","J"),
col2 = c("B","A","L","H"),
col3 = c("C","C","M",""),
col3 = c("","Y","B",""), stringsAsFactors=FALSE)
test_a2[test_a2 ==""] <- NA
test_a3 <- na.omit(rbindlist(lapply(seq(ncol(test_a2)-1), function(i) test_a2[i:(i+1)])))
plot(graph.data.frame(test_a3))][1]][1]
One note about this approach: the graph will not contain vertices that are not connected with anything else but "empty" cells. If you need to include them you can add them afterwards.

Split string using delimiter alternatively

I have a list of urls like this:
mydata <- read.table(header=TRUE, text="
Id
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4
")
I need to pull out value of parameters like the brand, verticalcolorfamily, q= etc from the urls. These parameters are the filters applied on the website.
The output which i am looking for is a data frame with three columns:parameter,value and the frequency of occurrence of the value. For Ex:
parameter | value | frequency
----------|----------------|----------
brand | FLYING+MACHINE | 2
q= | relevance | 5
price | 300%2C10500 | 2
brand | BASICS | 1
Currently i am able to think of is to collect each urls as a string vectors separated by alternating values of "%3A" as a delimiter:[q=%3Arelevance ,brickpattern%3ADecorative%2FArt+Deco,brickpattern%3AFloral , brickpattern%3AGeometric , brickpattern%3AGraphic , brickpattern%3ATropical , price%3A300%2C10500].
Then place each element in a column of a data frame and then again split by '%3A' and do a group by.
Suggestions on an other approach will be really appreciated.
Also if i am supposed to use this approach i am unaware of the method of using alternating '%3A' as delimiter .
urltools looks like an awesome package for what you want to do. Here's a hacked answer in the meantime. Starting with your data.frame:
# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\\?", "", L)
L <- unlist(strsplit(L, "%3A"))
head(L)
[1] "q=" "relevance" "brickpattern"
[4] "Decorative%2FArt+Deco" "brickpattern" "Floral"
Then:
# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
group_by(parameter, value) %>%
summarize(frequency=sum(!is.na(value)))
I will show only the following entries where frequency >= 2:
# Show only entries with frequency >= 2
filter(df, frequency >= 2)
parameter value frequency
<fctr> <fctr> <int>
1 brand Celio 2
2 bricksleeve Short 2
3 q= relevance 6
4 verticalcolorfamily Black 2
5 verticalcolorfamily White 2
Note that brand::FLYING+MACHINE != 2 because FLYING+MACHINE occurs as FLYING%20MACHINE and FLYING+MACHINE.

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

Resources