Calculate U-SQL Percentage - u-sql

I am trying to calculate Deployment percentage. but my CASE statement is returning a whole number.
Example:
Deployment = 133
Licensing = 930
Utilization should == 14%
However, returns 0
Here is the table schema for utilization:
,[Deployment (%)] float?
Here is how I calculate utilization:
(summary.[Deployments] == 0 OR summary.[Licensing] == 0) ? 0 :
(summary.[Deployments] / summary.[[Licensing]) AS [Deployment (%)]

Just needed to add (decimal) on the dividend
(summary.[Deployments] == 0 OR summary.[Licensing] == 0) ? 0 :
((decimal)summary.[Deployments] / summary.[[Licensing]) AS [Deployment (%)]

Related

Error in if statement with NA in data in R [duplicate]

This question already has answers here:
Error in if/while (condition) {: missing Value where TRUE/FALSE needed
(4 answers)
Closed 1 year ago.
I have a problem with a in if statement. I get an error message saying "absent value where TRUE / FALSE is required". I am trying to calculate a new variable using an if statement and a for cycle, but the data has NA values and the cycle I used cannot work any further after finding a NA value.
This is the variables I am using to create the new variable:
x=c(3,3,3,2,NA,2,3,NA,3,NA)
y=c(3,6,5,4,NA,3,2,NA,3,NA)
h=c(1,2,1.6666667,2,NA,1.5,0.6666667,NA,1,NA)
This the code I am using that has the problem with NA value:
z=rep(NA,length(y))
for(i in 1:length(x)){
if((x[i]==0 & y[i]>=3) | h[i]>=3){
z[i]=1
} else if((x==0 & y[i]<3) | h[i]<3){
z[i]=0
}
}
Can you tell me how could I include the NA values into the if statement or what should I do?
Thanks for your reply.
We can make changes based on the NA by inserting is.na
for(i in 1:length(x)){
if((x[i] %in% 0 & y[i]>=3 & !is.na(y[i])) | h[i]>=3 & !is.na(h[i])){
z[i]=1
} else if((x[i] %in% 0 & y[i]<3 & !is.na(y[i])) | h[i]<3 & !is.na(h[i])){
z[i]=0
}
}
You can check with !is.na(). Also this operation is vectorized so you don't need for loop.
inds <- x == 0 & y >= 3 | h >= 3
as.integer(inds & !is.na(inds))
#[1] 0 0 0 0 0 0 0 0 0 0
None of the value match the condition here.

Making new variable through mutate

I want to make a new variable "churned" by taking into account five variables :
Include in churn
A-Churn
B-Churn
C-Churn
D-Churn
My condition is - If variable "Include in churn" has 1 and for all other variables , if any one of the variables has 1 than my new variable "Churned" should have 1 else 0. I am a newbie in using mutate function.
Please help me to create this new variable thru 'mutate' function.
If I understand your formulation logically, you want
mutate(data, Churned = Include.in.Churn == 1 & (A.Churn == 1 | B.Churn == 1 | C.Churn == 1 | D.Churn == 1))
This will make Churned a logical. If you really need an integer, as.integer will produce 1 for TRUE and 0 for FALSE.
If all mentioned Variables are either 1 or 0 you can also use the possibly faster
mutate(data, Churned = Include.in.Churn * (A.Churn + B.Churn + C.Churn + D.Churn) >= 1)

How to use AND in R to modify dataframe

I have a data matrix 1200 (row, sample name)* 20000 (col, gene name), I want to delete row when my interested 5 genes have zero values in all samples
command I used for single gene:
allexp <-preallexp[preallexp$GZMB > 0, ]
but I want to use AND in above command, like this:
allexp <-preallexp[preallexp$GZMB && preallexp$TP53 && preallexp$EGFR && preallexp$BRAF && preallexp$VGEF > 0, ]
but this command doesnt work, please I need your help..How to use AND in above command.
EDIT: in response to OP.
I'm sure there's a much more efficient way to code this, but this is what you're after:
allexp <-preallexp[preallexp$GZMB + preallexp$TP53 + preallexp$EGFR +
preallexp$BRAF + preallexp$VGEF > 0, ]
Unless you have negative expression values I would have thought mkt's should work. But here is mine. It will remove values rows where each of the 5 genes and a value of 0
which(preallexp$GZMB == 0 && preallexp$TP53 &&
preallexp$EGFR == 0 && preallexp$BRAF == 0 && preallexp$VGEF == 0)
This gives so the rows where all 5 genes have a value of zero
So we can remove these rows if from the dataframe like follows
allexp <-preallexp[
-(which(preallexp$GZMB == 0 && preallexp$TP53 &&
preallexp$EGFR == 0 && preallexp$BRAF == 0 && preallexp$VGEF == 0)), ]

One hot encoding / binary columns for each day of the year and select them

I have a R dataset of flight data. I need to add 365 columns to this dataset, one for each Day-of-the-year, with value 1 if the data[i]$FlightDate of the entry corresponds to that Day-of-the-year, 0 otherwise (see this question for why).
Previously I had managed to extract the day of Year from a FlightDate string using lubridate
data$DayOfYear <- yday(ymd(data$FlightDate))
How would I go about generating each 365 columns, and keep only those columns (along with some others) for a future SVD ? I will actually need to repeat the same for the hours in the day (which I will probably split into ranges of 30 or 10 minutes), so an extra 48-120 one-hot columns for a different variable will have to be added later.
Note : my dataset contains about 500k flights per month, (so about 16k flights for a single dayOfTheYear if I just take one year of data), and has 100 variable (columns)
Sample input data row data[1,]:
{
DayOfYear: 10,
FieldGoodForSvd1 : 235
FieldBadForSvd2 : "some string"
...
}
Sample output data row (after generating 365 binary cols and selecting fields compatible with an SVD)
{
DayOfYear1: 0,
...
DayOfYear9: 0,
DayOfYear10: 1, // The flight had taken place on that DayOfYear
DayOfYear11: 0,
...
DayOfYear365: 0,
FieldGoodForSvd1 : 235
}
EDIT
Suppose my input data matrix looks like that
DayOfYear ; FieldGoodForSvd1 ; FieldBadForSvd2
1 ; 275 ; "los angeles"
1 ; 256 ; "san francisco"
5 ; 15 ; "chicago"
The final output should be
FieldGoodForSvd1 ; DayOfYear1 ; DayOfYear2 ; ... ; DayOfYear4 ; DayOfYear5 ; DayOfYear6 ; ... ; DayOfYear365
275 ; 1 ; 0 ; ... ; 0 ; 0 ; 0 ; ... ; 0
256 ; 1 ; 0 ; ... ; 0 ; 0 ; 0 ; ... ; 0
5 ; 0 ; 0 ; ... ; 0 ; 1 ; 0 ; ... ; 0
Here is my final code that does the one hot encoding for DayOfYear and the TimeSlot, and proceeds to the svd
dsan = (d[!is.na(d$FieldGoodForSvd1) & d[!is.na(d$FieldGoodForSvd2),])
# We need factors to perform one hot encoding
dsan$DayOfYear <- as.factor(yday(ymd(dsan$FlightDate)))
dsan$TimeSlot <- as.factor(round(dsan$DepTime/100)) # in my case time slots were like 2055 for 20h55
dSvd= with(dsan,data.frame(
FieldGoodForSvd1,
FieldGoodForSvd2,
# ~ performs one hot encoding (on factors), -1 removes intercept term
model.matrix(~DayOfYear-1,dsan),
model.matrix(~TimeSlot-1,dsan)
))
theSVD = svd(scale(dSvd))

Hash Table + Binary Search

I'm using an Hash Table to store some values. Here are the details:
There will be roughly 1M items to store (not known before, so no perfect-hash possible).
Table is 10M large.
Hash function is MurMurHash3.
I did some tests and storing 1M values I get 350,000 collisions and 30 elements at the most-colliding hash table's slot.
Are these result good?
Would it make sense to implement Binary Search for lists that get created at colliding hash-table's slots?
What' your advice to improve performances?
EDIT: Here is my code
var
HashList: array [0..10000000 - 1] of Integer;
for I := 0 to High(HashList) do
HashList[I] := 0;
for I := 1 to 1000000 do
begin
Y := MurmurHash3(UIntToStr(I));
Y := Y mod Length(HashList);
Inc(HashList[Y]);
if HashList[Y] > 1 then
Inc(TotalCollisionsCount);
if HashList[Y] > MostCollidingSlotItemCount then
MostCollidingSlotItemCount := HashList[Y];
end;
Writeln('Total: ' + IntToStr(TotalCollisionsCount) + ' Max: ' + IntToStr(MostCollidingSlotItemCount));
Here is the result I get:
Total: 48169 Max: 5
Am I missing something?
This is what you get when you put 1M items randomly into 10M cells
calendar_size=10000000 nperson = 1000000
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 9048262 (0.904826) 0 (0.000000) 0 0 0
1: 905064 (0.090506) 905064 (0.905064) 1 905064 905064
2: 45136 (0.004514) 90272 (0.090272) 3 135408 1040472
3: 1488 (0.000149) 4464 (0.004464) 6 8928 1049400
4: 50 (0.000005) 200 (0.000200) 10 500 1049900
----+---------+--------+----------+--------+------+--------+--------
5: 10000000 1000000 1.049900 1049900
The left column is the number of items in a cell. The second: the number of cells having this itemcount.
WRT the binary search: it is obvious that for small tables like this (maximum chain length=4, but most chains are of length=1), linear search outperforms binary search. The takeover-point is probably somewhere between 10 and 100.

Resources