I have column with array type where values are integers but wrapped as strings. How can i convert this column into array type with values as proper integers
Input :
Column1
["1","2","3"]
Desired output
Column1
[1,2,3]
PS : I see i can do it using mv-expand operator and then use make_list with aggregation function, but i see it is causing lot of perf over-head and there are multiple other columns in my table which needs to be taken care with aggregation differently
you can:
leave it as strings, depending on how you consume this array later on. or,
reformat the data at its source, before you ingest it into Kusto. or,
use mv-apply for the conversion at query runtime (can also be done at ingestion time, using an update policy):
print Column1 = dynamic(["1","2","3"])
| mv-apply Column1 on (
summarize Column1 = make_list(toint(Column1))
)
Column1
[ 1, 2, 3]
Related
The data frame has multiple columns in dictionary format - which have the same key.
How can I explode them into rows without having to use any joins keeping the key from any of the columns?
The schema of the data frame is here
The columns that need to be exploded are pct_ci_tr, pct_ci_rn, pct_ci_ttv and pct_ci_comm
I would do something like this :
from pyspark.sql import functions as F
df.select(
"s__",
F.expr("""
stack(
4,
"pct_ci_tr",
pct_ci_tr,
"pct_ci_rn",
pct_ci_rn,
"pct_ci_ttv",
pct_ci_ttv,
"pct_ci_comm",
pct_ci_comm,
) as (lib, map_values)"""
),
).select("s__", "lib", F.explode(F.col("map_values")))
I have imported a data set into R studio and want to count the rows that have certain values in multiple columns. The columns I want to sort by are titled "ROW" which I want less than or equal to 90, "house" which I want equal to 1 and "type" which I want equal to 1.
I know that I can use the sum command like this:
sum(data$type==1)
and that returns the rows with the value 1 in the "type" column. I have tried to combine these functions like this:
with(data, sum((type==1),(ROW<=90),(house==1))
to no avail.
Any suggestions on what I can do?
If we need to combine logical expressions, use & (if all of the conditions return TRUE) or | (if any of the conditions return TRUE)
with(data, sum((type==1)&(ROW<=90)&(house==1)))
So I have to import some data into R and find it reasonably difficult.
I have multiple similar tables in a directory and would like to make a script looks for specific row (based on string not raw number) and add them to a new table.
Example data:
Table one:
name Johnny
registeration data 01012001
userid>= 47
table two:
name Jimmy
registeration data 02052005
userid>= 1972
What I want is a table that contains:
userid>= 47
userid>= 1972
Note: separated by tap..
What I tried to do is the following:
A: Create a list of files in the working directory:
list = (Sys.glob("*.table"))
B: created one a table using lappy:
table <- lapply(list, function(x) read.table(x, header = FALSE, sep = "\t", fill = TRUE))
C. Tried to grep the word "userid" (failed):
table[grep("userid", rownames("userid")), ]
Error: incorrect number of dimensions
Is there a 'simpler way' to fitch row of interest (userid>= in the example) based on string without relying on external packages? I can also think about using "grep userid *.table > newtable" in bash, but I want to use only R.
How about this (given that the rownames are in the first variable as in your example):
# list all tables in current directory, optionally recursively
tbls <- list.files(getwd(),'.table$') # if more dirs, maybe add recursive = TRUE
# create a list of tables
tbls_r <- lapply(tbls,function(x) read.table(x,header=FALSE,sep='\t',stringAsFactors = FALSE))
# using lapply to extract the row of interest
tbls_r <- lapply(tbls_r, function(x) x[x[,1] == 'userid>=',])
With using lapply, we apply a function to each element of the list (in this case each table). x references a single table, so with x[,1] == 'userid>=' I'm creating a logical vector (TRUE and FALSE) to see which values of the first column (indexed by x[,1] - note I'm leaving the first position empty as I want to index all rows but only the first column) are equal to the desired string.
I then use this logical vector right away, to index the table itself, returning only the rows which have a corresponding TRUE value in the vector).
# Bind the resulting rows to a single table
result_table <- do.call(rbind,tbls_r)
Hope that clears it up.
Edit:
If you just want to extract the values you can use this:
tbls_r <- sapply(tbls_r, function(x) x[x[,1] == 'userid>=',2])
In this case, I'm specifying during indexing, that I only want the column 2, leaving me with only the values.
Also I'm using sapply instead of lapply, which already returns a handy vector instead of a list. So no need for calling do.call.
If you then want a data.frame, just go with something along the lines of
res <- data.frame(UID = tbls_r,stringsAsFactors=FALSE)
Of course you can add more variables to this data.frame given they have the same length.
I have two tables which I need to compare
Table 1:XLOC IDs
Column A: Xloc id
Column B: gene id
Table 2: Ensembl IDs
Column A: Ensembl id
Column B: gene Id
In both tables, there are identical Gene ids (names e.g. cpa6). In table 1 there are 25000 entries, in table 2 there are 46000 entries.
I need to insert the Ensemble Ids from ColA, Table 2 into ColC of Table1, when both gene ids in column B match and create an output file with new data- e.g.
Table 1
ENS0002 cpa6
Table 2:
Xloc0014 cpa6
Output file, Table 3:
ENS0002 cpa6 Xloc0014
The columns are not in the same order and cannot be sorted alphabetically etc. The remaining 21000 entries without corresponding Xlocs I will get rid of (but can easily do this post-output).
Does anyone know how to do this in either R, Excel, or other software?, relatively easily?
N.B. Both tables can not be sorted into the same order, so I really need to use a formula/script/bash to do this.
Try this. I have created an example data frame to show how you can merge and keep only the values that exist in both tables.
As you can see the new table is a result of these values that exist in both and now you have 3 columns with the value of the second table.
In case you want to keep all the rows that exist in both you must use the column gene Id in order to keep these gene Id that exist in both.newTable <- merge(tab1,tab2,by = "gen_id") for example.
tab1 <- data.frame(col1=c("id1","id2","id3","id4"),col2=c(1,2,3,4))
tab2 <- data.frame(col1=c("id1","id2","id3","id5","id7"),col2=c(1,3,3,5,6))
newTable <- merge(tab1,tab2,by = "col1")
in case you want to keep all from table1 but maybe they dont exist in table2 use this.
newTable <- merge(tab1,tab2,by = "col1",all.x=T)
these will keep all the rows of table1 and will give a value at col2.y otherwise you will have NAs.
In R I would use the merge function merge(Table 1, Table 2,by="cpa6").
However, I have done this in Excel before, which worked well too using the VLOOKUP function. You just need to use a IF function in R, with a nested VLOOKUP inside:
=IF(ISERROR(VLOOKUP(cell with gene name in Table1,array of cells that contain the gen names in Table2, number of the column in the array in Table2,"TRUE" so they match exactly)), Output if true, output if false).
Example:
=IF(ISERROR(VLOOKUP(C4,List1!A1:List1!A$2:A$1000,1,TRUE)), "Does NOT exist in List 1","Exists in List 1")
Q1:
Is it possible for me to search on two different columns in a data table. I have a 2 million odd row data and I want to have the option to search on either of the two columns. One has names and other has integers.
Example:
x <- data.table(foo=letters,bar=1:length(letters))
x
want to do
x['c'] : searching on foo column
as well as
x[2] : searching on bar column
Q2:
Is it possible to change the default data types in a data table. I am reading in a matrix with both character and integer columns however everything is being read in as a character.
Thanks!
-Abhi
To answer your Q2 first, a data.table is a data.frame, both of which are internally a list. Each column of the data.table (or data.frame) can therefore be of a different class. But you can't do that with a matrix. You can use := to change the class (by reference - no unnecessary copy being made), for example, of "bar" here:
x[, bar := as.integer(as.character(bar))]
For Q1, if you want to use fast subset (using binary search) feature of data.table, then you've to set key, using the function setkey.
setkey(x, foo)
allows you to fast-subset on 'x' alone like: x['a'] (or x[J('a')]). Similarly setting a key on 'bar' allows you to fast-subset on that column.
If you set the key on both 'foo' and 'bar' then you can provide values for both like so:
setkey(x) # or alternatively setkey(x, foo, bar)
x[J('c', 3)]
However, this'll subset those where x == 'c' and y == 3. Currently, I don't think there is a way to do a | operation with fast-subset directly. You'll have to resort to a vector-scan approach in that case.
Hope this is what your question was about. Not sure.
Your matrix is already a character. Matrices hold only one data type. You can try X['c'] and X[J(2)]. You can change data types as X[,col := as.character(col)]