Given:
let table1 = datatable (col1:string, col2:string)
["abc","def"]
;
let table2 = datatable (col3:string, col4:string)
["ghi","jkl"]
where table1 | union table2 gives:
col1 col2 col3 col4
abc def
ghi jkl
how to get this instead?
col1 col2 col3 col4
abc def ghi jkl
Assume there might be many more columns, so any solution requiring enumerating all of them doesn't work.
Related question:
Is there a way to combine data from two tables in Kusto?
One solution I know of is:
table1 | extend pivot=""
| join kind=innerunique (table2 | extend pivot = "") on pivot
| project-away pivot*
Related
I have a dataframe like this:
col1 col2 cole3
abc def xzy
lmk qwe abc
def lmk xzy
xzy abc qwe
The three columns hold character datatype values.
Across the 3 columns I have 5 unique values: abc, def, xzy, lmk and qwe.
What I need is a count of number of times each of these values appears in the whole dataframe.
abc 3
qwe 2
def 2
xzy 3
lmk 2
All the count() and aggregate functions only work column-wise and when I unlist, it doesnt seem to work either.
Any suggestions for functions that I can use?
Many thanks in advance.
You should do something like that (assuming that your previous data frame is called df1):
data <- c(df1$col1, df1$col2, df1$col3)
table(data)
And it gives you desired values. Make sure your data in df1 are characters not factors.
You can use an aggregate function with 'union all' - make sure you use 'union all', not just 'union'.
Let us assume that your table is called tmpcol with columns col1, col2, and col3...
select value, sum(cnt)
from (
select col1 as value, count(*) as cnt
from tmpcol
group by col1
union all
select col2 as value, count(*) as cnt
from tmpcol
group by col2
union all
select col3 as value, count(*) as cnt
from tmpcol
group by col3
) dataset
group by value;
So I have this dataframe that we will call test_a2. I want to use igraph to create a network map.
Col 1 Col 2 Col 3 Col 4
Table A | Table B | Table C |
Table Z | Table A | Table C | Table Y
Table K | Table L | Table M | Table B
Table J | Table H |
I am currently using the following code to map multiple columns
plot(graph.data.frame(rbindlist(lapply(seq(ncol(test_a2)-1), function(i) test_a2[i:(i+1)]))))
This give me a graph with nodes and edges. However, where there is an empty space which it creates a node for and create unnecessary connection. Anyway to have it ignore this?
Would this work?
library(igraph)
library(data.table)
test_a2 <- data.frame(col1 = c("A","Z","K","J"),
col2 = c("B","A","L","H"),
col3 = c("C","C","M",""),
col3 = c("","Y","B",""), stringsAsFactors=FALSE)
test_a2[test_a2 ==""] <- NA
test_a3 <- na.omit(rbindlist(lapply(seq(ncol(test_a2)-1), function(i) test_a2[i:(i+1)])))
plot(graph.data.frame(test_a3))][1]][1]
One note about this approach: the graph will not contain vertices that are not connected with anything else but "empty" cells. If you need to include them you can add them afterwards.
I asked a similar questions before but i still need some help/be pointed into the right direction.
I am trying to locate certain words within a column that consists of a SQL statement on all the rows and extract the next word in R studio.
Example: lets call this dataframe "SQL
| **UserID** | **SQL Statement**
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
So I am trying to pull out the table name. So I am trying to find the words "From" and "Join" and pulling the next table names.
I have been using some code with help from earlier:
I make the column "SQL Statement" in a list of 2 name "b"
I use the code:
z <- mapply(grepl,"(FROM|JOIN)",b)
which gives me a True and fasle for each word in each list.
z <- mapply(grep,"(FROM|JOIN)",b)
The above is close. It give me a position of every match in each of the lists.
But I am just trying to find the word Join or From and take the text word out. I was trying to get an output something like
| **UserID** | **SQL Statement** | Tables
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON | Table.1, Table.2
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = | Table.11, Table.31
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
Here is a working script which uses base R options. The inspiration here is to leverage strsplit to split the query string on the keywords FROM or JOIN. Then, the first separate word of each resulting term (except for the first term) should be a table name.
sql <- "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
terms <- strsplit(sql, "(FROM|JOIN)\\s+")
out <- unlist(lapply(terms, function(x) gsub("^([^[:space:]]+).*", "\\1", x)))
out <- out[2:length(out)]
out
[1] "Table.1" "Table.2" "Table.3"
Demo
To understand better what I did, follow the demo and have a look at the terms list which resulted from splitting.
Edit:
Here is a link to another demo which shows how you might use the above logic on a vector of query strings, to generate a list of vector of tables, for each query
Demo
Let's suppose I have a data frame with the following parameters
DATA <- data.frame(ROWID, ID1, NAME1, ...IDn, NAMEn)
Sample of what the data might look like:
ROWID | ID1 | NAME1 | ID2 | NAME2 | IDn | NAMEn
001 | 001 | FAS | 002 | MAS | 999 | ZOO
002 | 003 | BIN | 004 | DUN | 998 | SOO
Where I have 201 columns by 10k+ rows. What I would like to do is to reshape this data such that for each row in the original DATA, I produce a set of rows in a subsequent data frame. Each row would consist of the originating ROWID, IDa, NAMEa, IDb, NAMEb pairs such that the first is matched with all others (99 pairs containing ID1, 98 with ID2, and so on). This would occur for each row producing a large data frame of all possible combinations within rows for every row. The result would look like:
ROWID1 | ID1 | NAME1 | ID2 | NAME2
ROWID1 | ID1 | NAME1 | ID3 | NAME3
...
ROWID1 | ID2 | NAME2 | ID3 | NAME3
...
ROWID2 | ID1 | NAME1 | ID2 | NAME2
ROWID2 | ID1 | NAME1 | ID3 | NAME3
...
The code I produced to do this is as follows. It works great, but only on smaller data frames. The full data frame is painfully slow, and I am hoping to have alternatives to speed it up using functions or something else of which I am unaware. Thanks in advance!!
DATA <- data.frame(as described above)
META <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
for (i in 1:length(DATA$ROWID)) {
SET <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
ROWID <- DATA[i,1]
for (x in seq(3,ncol(DATA),2)) {
for (y in seq(x,ncol(DATA),2)) {
ID1 <- DATA[i,x-2]
BUS1 <- DATA[i,x]
ID2 <- DATA[i,y-2]
BUS2 <- DATA[i,y]
if (!is.na(BUS1) && !is.na(BUS2)) {
NEW <- cbind(ROWID, ID1, BUS1, ID2, BUS2)
SET <- rbind(SET, NEW)
}
}
}
META <- rbind(META, SET)
}
Here is my way to write it, which includes all the 3 optimizations I wrote as comments. Also, be careful! your code had some bugs in the addressing the columns... which I hopefully also fixed.
require('compiler')
enableJIT(3)
DATA2 = as.matrix(DATA)
META2 <- matrix(character(),ncol=5,nrow=(nrow(DATA2)*(ncol(DATA2)-2)^2/2)) # you want a matrix instead of a data.frame, and you want to pre-allocate its size
colnames(META2) = c("ROWID","ID1","BUS1","ID2","BUS2")
k=0
for (i in 1:nrow(DATA2)) {
for (x in seq(3,ncol(DATA2)-2,2)) {
for (y in seq(x+2,ncol(DATA2),2)) {
k=k+1
META2[k,] = c(DATA2[i,1],DATA2[i,x-1], DATA2[i,x], DATA2[i,y-1], DATA2[i,y]) # no need to use temporary variables
}
}
}
META2 = as.data.frame(META2) # converting back to data.frame
META2$BUS1 = as.numeric(META2$BUS1)
META2$BUS2 = as.numeric(META2$BUS2)
I will let you handle yourself the case in which BUS1 or BUS2 is NA - basically, you need to not add these lines (and not increment the variable k), and after the loops you need to crop your matrix to remove the trailing empty rows.
My question relates to the creation of a variable which depends upon other columns within a data.table when none of the variable names are known in advance.
Below is a toy example where I have 5 rows and the new variable should be 1 when the condition is equal to A and 4 elsewise.
library(data.table)
DT <- data.table(Con = c("A","A","B","A","B"),
Eval_A = rep(1,5),
Eval_B = rep(4,5))
Col1 <- "Con"
Col2 <- "Eval_A"
Col3 <- "Eval_B"
Col4 <- "Ans"
The code below works but feels like I'm misusing the package!
DT[,Col4:=ifelse(DT[[Col1]]=="A",
DT[[Col2]],
DT[[Col3]]),with=FALSE]
Update:
Thanks, I did some quick timing of the answers below. Once on a data.table with 5 million rows and only the relevant columns and again after adding 10 non relevant columns, below are the results:
+-------------------------+---------------------+------------------+
| Method | Only relevant cols. | With extra cols. |
+-------------------------+---------------------+------------------+
| List method | 1.8 | 1.91 |
| Grothendieck - get/if | 26.79 | 30.04 |
| Grothendieck - get/join | 0.48 | 1.56 |
| Grothendieck - .SDCols | 0.38 | 0.79 |
| agstudy - Substitute | 2.03 | 1.9 |
+-------------------------+---------------------+------------------+
Look's like .SDCols is best for speed and using substitute for easy to read code.
1. get/if
Try using get :
DT[, (Col4) := if (get(Col1) == "A") get(Col2) else get(Col3), by = 1:nrow(DT)]
2. get/join
or try this approach:
setkeyv(DT, Col1)
DT[, (Col4):=get(Col3)]["A", (Col4):=get(Col2)]
3. .SDCols
or this:
setkeyv(DT, Col1)
DT[, (Col4):=.SD, .SDcols = Col3]["A", (Col4):=.SD, .SDcols = Col2]
UPDATE: Added some additional approaches.
Using ifelse and get:
DT[, (Col4) := ifelse (get(Col1) == "A",get(Col2) , get(Col3))]
Or using substitute to create the expression like this :
expr <- substitute(a4 := ifelse (a1 == "A",a2 , a3),
list(a1=as.name(Col1),
a2=as.name(Col2),
a3=as.name(Col3),
a4=as.name(Col4)))
DT[, eval(expr)]