How can I filter out redundant results except one? - azure-data-explorer

Type
Count
Date
2
20
11-24
3
0
11-24
4
0
11-24
2
13
12-01
3
1
12-01
4
0
12-01
2
0
12-08
3
0
12-08
4
0
12-08
The table above has entries from three dates (11/24, 12/1, and 12/8). What I want to do is if the date group (say 11/24) has at least one row with a positive count, remove all rows with a 0 count. But if the date group (like 12/8) has all rows with a 0 count, then remove all but one of the rows. And ideally, the remaining 0 count row has a new type, like 5 for instance. For example, the desired output would be:
Type
Count
Date
2
20
11-24
2
13
12-01
3
1
12-01
5
0
12-08
Is this possible with KQL?

let T = datatable(Type:int, Count:int, Date:string)
[
,2 ,20 ,"11-24"
,3 ,0 ,"11-24"
,4 ,0 ,"11-24"
,2 ,13 ,"12-01"
,3 ,1 ,"12-01"
,4 ,0 ,"12-01"
,2 ,0 ,"12-08"
,3 ,0 ,"12-08"
,4 ,0 ,"12-08"
];
let max_type = toscalar(T | summarize max(Type));
union (T
| where Count > 0
)
,(T
| summarize Count = max(Count) by Date
| where Count == 0
| serialize Type = toint(max_type + row_number())
)
Type
Count
Date
2
20
11-24
2
13
12-01
3
1
12-01
5
0
12-08
Fiddle

Related

Ungroup column R and assign binary variable for each unique entry

I have following dataframe [df] in R:
Flair
Text
Time
YOLO
asder
10:01
Fluff
qetrtgf
10:02
Fluff
eqargrstgfh
10:04
Fluff
qrettzuh
10:05
Fluff
erghgfxbhs
10:17
Art and Media
qaerzgh wtruws
10:27
Charity
eztzutzui
10:31
Memes
etzuj
10:54
Question
rehbgfd
10:55
Provocative
hetzjas
10:56
...
...
...
which i like to transform to this:
Text
Time
Flair_YOLO
Flair_Fluff
Flair_Art
...
asder
10:01
1
0
0
...
qetrtgf
10:02
0
1
0
...
eqargrstgfh
10:04
0
1
0
...
qrettzuh
10:05
0
1
0
...
erghgfxbhs
10:17
0
1
0
...
qaerzgh wtruws
10:27
0
0
1
...
eztzutzui
10:31
etzuj
10:54
rehbgfd
10:55
hetzjas
10:56
...
...
As the problem is hard to describe i was looking into ungroup() but was unable to find the correct formula.
This is a reshaping problem from long to wide format. But, since you are asking some additional arrangements I did not consider this as a duplicate.
Firstly, we can create an additional column with ones for 'value' requirement in reshaping by,
df$Flair_ <- 1
Than by using the reshape function from BaseR we can get the result.
out <- reshape(df, idvar = c("Text","Time"),
timevar ="Flair",sep = "",
direction = "wide")
out[is.na(out)] <- 0 # changing NA columns to 0
out
gives,
# Text Time Flair_YOLO Flair_Fluff Flair_Art and Media ...
#1 asder 10:01 1 0 0
#2 qetrtgf 10:02 0 1 0
#3 eqargrstgfh 10:04 0 1 0
#4 qrettzuh 10:05 0 1 0
#5 erghgfxbhs 10:17 0 1 0
#6 qaerzgh wtruws 10:27 0 0 1
Data:
df <- read.table(text="
Flair,Text,Time
YOLO,asder,10:01
Fluff,qetrtgf,10:02
Fluff,eqargrstgfh,10:04
Fluff,qrettzuh,10:05
Fluff,erghgfxbhs,10:17
Art and Media,qaerzgh wtruws,10:27
Charity,eztzutzui,10:31
Memes,etzuj,10:54
Question,rehbgfd,10:55
Provocative,hetzjas,10:56",sep=",",header=T,stringsAsFactors=F)

Merge by multiple criteria and split out duplicates into separate columns?

I'm quite sure this has been asked and answered at some point but I'm a novice and really lack the vocabulary to effectively find the question and solution. I have a simple task that I can't perform in Excel because of the internal memory limitations, but I don't know enough about SQL or R to figure out how to do it in either of those platforms.
I have two tables, one has unique entries with unique ID numbers, the other has multiple duplicates of those ID numbers, each showing a different number (representing each new salary over the course of a career). I'm trying to map each of the salaries to the original unique ID table, creating new columns for every possible change (Salary1:Salary50). Ultimately I'll also need to map on the dates and differences of each change for analysis. Here's an example:
This is the unique ID table:
Table 1
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 ? ? ? ? ?
2 ? ? ? ? ?
3 ? ? ? ? ?
4 ? ? ? ? ?
5 ? ? ? ? ?
Here's the salary table with duplicate IDs and the info I want:
Table2
ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016
And the end state should look like this:
Table3
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 10 11 12 0 0
2 12 13 0 0 0
3 10 0 0 0 0
4 10 12 0 0 0
5 10 0 0 0 0
I built a multiple-criteria Vlookup to pull everything into the right columns but the dataset has well over 100,000 rows to check so it can't complete it memory-wise. Can anyone advise on how I can do the same thing in Access, R, SPSS or if there is some efficient Excel-VBA code I can use?
Thanks for any help!
I have no idea what a "Vlookup" is, but apparently you are looking for something like this:
DF <- read.table(text = "ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016", header = TRUE)
#years of employment assuming the table is sorted by dates
DF$y <- ave(DF$ID, DF$ID, FUN = seq_along)
#reshape
library(reshape2)
dcast(DF, ID ~ y, value.var = "Salary", fill = 0)
# ID 1 2 3
#1 1 10 11 12
#2 2 12 13 0
#3 3 10 0 0
#4 4 10 12 14
#5 5 10 0 0
Note that this is not a very useful data format in R. Your original data format seems much more useful for further analyses.
Assume that the IDs in Table1 are a subset of the IDs in Table2 and we want just those. Also we want the first Salary for any ID in the Salary1 result column, the second salary in the Salary2 result column and so on. First compute Seq which is 1 for the first date in any ID, 2 for the second and so on. Then create a factor out of those sequence numbers whose levels are labelled by the Salary columns in Table1. In the last statement subset Table2 to the ID values of Table1 (in the case of the data shown they are the same so it does not have any effect) and reshape from long to wide form using xtabs. No packages are used.
Seq <- ave(1:nrow(Table2), Table2$ID, FUN = seq_along)
Table0 <- Table1[-1] # Table0 is Table1 without ID column
Table2$SalaryNo <- factor(Seq, levels = 1:ncol(Table0), labels = colnames(Table0))
xtabs(Salary ~ ID + SalaryNo, data = subset(Table2, ID %in% Table1$ID))
giving:
Salary_No
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 10 11 12 0 0
2 12 13 0 0 0
3 10 0 0 0 0
4 10 12 14 0 0
5 10 0 0 0 0
Note: The tables were not provided in reproducible form and the solution may depend on specifically what they are so we have assumed this:
Lines1 <- "
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 ? ? ? ? ?
2 ? ? ? ? ?
3 ? ? ? ? ?
4 ? ? ? ? ?
5 ? ? ? ? ?"
Table1 <- read.table(text = Lines1, header = TRUE)
Lines2 <- "
ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016"
Table2 <- read.table(text = Lines2, header = TRUE)
Update: Changed assumptions and code correspondingly. Also fixed an error (that did not affect the data shown but could affect other data).

average patient level variables on blocks defined by record_id

I have a multilevel dataset of records with repeated measurements (example below).
I know in MLwiN it is possible to average these patient level variables (age, date_admission, date_discharge) on blocks defined by record_id, is it possible to do the same in R?
At the moment if I try and find the duration of stay (date_discharge - date_admission) it comes up as NA, presumably because they are in different rows. And if I try any multilevel modelling it restricts the dataset to obs_id "1" and "8" where age is present.
Many thanks, Annemarie
obs_id record_id day age tn date_admission date_discharge
1 1 0 40 122 12/02/2015 00:00
2 1 1 90
3 1 2 71
4 1 3 71
5 1 4 75
6 1 5 73
7 1 182 17/02/2015 00:00
8 2 0 58 139 14/02/2015 00:00
9 2 1 130
10 2 2 119
11 2 3 106
12 2 4 102
13 2 5 111
14 2 182 19/02/2015 00:00
I believe you main question is how to get your data into a format so that is can be used by most R routines (such as lme4).
To get you example into R, I added some comma's. Next I converted the dates to the internal date format used by R (one of them actually, POSIXct):
lines <- "obs_id, record_id, day, age, tn, date_admission, date_discharge
1 ,1 ,0 ,40 ,122 ,12/02/2015 00:00,
2 ,1 ,1 , ,90 , ,
3 ,1 ,2 , ,71 , ,
4 ,1 ,3 , ,71 , ,
5 ,1 ,4 , ,75 , ,
6 ,1 ,5 , ,73 , ,
7 ,1 ,182 , , , ,17/02/2015 00:00
8 ,2 ,0 ,58 ,139 ,14/02/2015 00:00,
9 ,2 ,1 , ,130 , ,
10 ,2 ,2 , ,119 , ,
11 ,2 ,3 , ,106 , ,
12 ,2 ,4 , ,102 , ,
13 ,2 ,5 , ,111 , ,
14 ,2 ,182 , , , ,19/02/2015 00:00"
data <- read.csv(textConnection(lines))
data$date_admission <- as.POSIXct(data$date_admission, format="%d/%m/%Y %H:%M")
data$date_discharge <- as.POSIXct(data$date_discharge, format="%d/%m/%Y %H:%M")
You then need to have a date of admission and discharge for each of th records for a patient. There are dozens of ways to do this, but one of them is to use the dplyr package. We first group the data by record_id after which we can do computations per patient. Below I take the first and last values of the date_admission, date_discharge and age columns, but you could also calculate average values (although that wouldn't make much sense in this case):
library(dplyr)
data <- data %>% group_by(record_id) %>% mutate(
date_admission = first(date_admission),
date_discharge = last(date_discharge),
age = first(age),
duration = difftime(date_discharge, date_admission, "days"))
A quick google on dplyr will give you plenty of introductions to the package for more information. Especially the data wrangling cheat sheet is very useful.

Merge columnwise from file_list

I have 96 files in file_list
file_list <- list.files(pattern = "*.mirna")
They all have the same columns, but the number of rows varies. Example file:
> head(test1)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 TGGAGTGTGATAATGGTGTTT seq_100003_x4 4 hsa-miR-122-5p 15 35 11TC 0 0 g GCTGTGGA TTTGTGTC miRNA
2 TGTAAACATCCCCGACCGGAAGCT seq_100045_x4 4 hsa-miR-30d-5p 6 29 17CT 0 0 CT TTGTTGTA GAAGCTGT miRNA
3 CTAGACTGAAGCTCCTTGAAAA seq_100048_x4 4 hsa-miR-151a-3p 47 65 0 I-AAA 0 gg CCTACTAG GAGGACAG miRNA
4 AGGCGGAGACTTGGGCAATTGC seq_100059_x4 4 hsa-miR-25-5p 14 35 0 0 0 C TGAGAGGC ATTGCTGG miRNA
5 AAACCGTTACCATTACTGAAT seq_100067_x4 4 hsa-miR-451a 17 35 0 I-AT 0 gtt AAGGAAAC AGTTTAGT miRNA
6 TGAGGTAGTAGCTTGTGCTGTT seq_10007_x24 24 hsa-let-7i-5p 6 27 12CT 0 0 0 TGGCTGAG TGTTGGTC miRNA
precursor ambiguity
1 hsa-mir-122 1
2 hsa-mir-30d 1
3 hsa-mir-151a 1
4 hsa-mir-25 1
5 hsa-mir-451a 1
6 hsa-let-7i 1
second file
> head(test2)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 ATTGCACTTGTCCTGGCCTGT seq_1000013_x1 1 hsa-miR-92a-3p 49 69 14TC 0 t 0 AAAGTATT CTGTGGAA miRNA
2 AAACCGTTACTATTACTGAGA seq_1000094_x1 1 hsa-miR-451a 17 36 11TC I-A 0 tt AAGGAAAC AGTTTAGT miRNA
3 TGAGGTAGCAGATTGTATAGTC seq_1000169_x1 1 hsa-let-7f-5p 8 28 9CT I-C 0 t GGGATGAG AGTTTTAG miRNA
4 TGGGTCTTTGCGGGCGAGAT seq_100019_x12 12 hsa-miR-193a-5p 21 40 0 0 0 ga GGGCTGGG ATGAGGGT miRNA
5 TGAGGTAGTAGATTGTATAGTG seq_100035_x12 12 hsa-let-7f-5p 8 28 0 I-G 0 t GGGATGAG AGTTTTAG miRNA
6 TGAAGTAGTAGGTTGTGTGGTAT seq_1000437_x1 1 hsa-let-7b-5p 6 26 4AG I-AT 0 t GGGGTGAG GGTTTCAG miRNA
precursor ambiguity
1 hsa-mir-92a-2 1
2 hsa-mir-451a 1
3 hsa-let-7f-2 1
4 hsa-mir-193a 1
5 hsa-let-7f-2 1
6 hsa-let-7b 1
I would like to create a unique ID consisting of the columns mir and seq:
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT
Then I would like to merge all the 96 files based in this ID and take the column freq form each file.
ID freq_file1 freq_file2 ...
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT 4 12
If an ID is not pressent in a specific file the freq should be NA
We can use Reduce with merge on a list of data.frames.
lst <- lapply(mget(ls(pattern="test\\d+")),
function(x) subset(transform(x, ID=paste(precursor,
seq)), select=c("ID", "freq")))
Reduce(function(...) merge(..., by = "ID"), lst)
NOTE: In the above, I assumed that the "test1", "test2" objects are already created in the global environment by reading the files in 'file_list'. If not, we can directly read the files into a list instead of creating additional data.frame objects i.e.
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("precursor", "seq", "freq"))[,
list(ID=paste(precursor, seq), freq=freq)])
Reduce(function(x,y) x[y, on = "ID"], lst)
Or instead of fread (from data.table) use read.csv/read.table and use merge as before on 'lst'

Not-equal to character in R

What is the command for printin the string that are not equal to a specific character? From the data below I would like to print the number of rows where the t5-column does not start with d-. (In this example that is all the rows)
I tried
dim(df[df$t5 !="d-",])
df:
name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
6 seq_10002_x17 17 hsa-miR-10a-5p 23 44 5GT 0 d-T 0 TATATACC TGTGTAAG miRNA 1
19 seq_100091_x3 3 hsa-miR-142-3p 54 74 0 u-CA d-TG 0 AGGGTGTA TGGATGAG miRNA 1
20 seq_100092_x1 1 hsa-miR-142-3p 54 74 0 u-CT d-TG 0 AGGGTGTA TGGATGAG miRNA 1
23 seq_100108_x5 5 hsa-miR-10a-5p 23 44 4NC 0 d-T 0 TATATACC TGTGTAAG miRNA 1
26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
28 seq_100121_x1 1 hsa-miR-192-5p 25 45 1CT u-CT d-C d-A GGCTCTGA AGCCAGTG miRNA 1
df1 <- df[!grepl("^d-",df[,8]),]
nrow(df1)
print(df1)
There is one row in your data that has a t5 entry that does not start with "d-". To find this row, you could try:
df[!grepl("^(d-)",df$t5),]
# name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
#26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
If you only want to know the row number, you can get it with rownames()
> rownames(df[!grepl("^(d-)",df$t5),])
#[1] "26"
or with which(),
> which(!grepl("^(d-)",df$t5))
#[1] 5
depending on whether you want the row number counting from the top of your data frame or the row number according to the value on the left.

Resources