Power BI: distinct count - count

I have a table Import in PowerBi of which I want to count the number of sheets per File.
How do I make a measure that counts the number of sheets per file?
ColumnX
ColumnY
FileName
SheetName
x
y
File1
Sheet1
a
b
File1
Sheet2
c
d
File1
Sheet3
a
b
File2
Sheet1
a
b
File2
Sheet2
How do I get this result?
File
NumberOfSheets
File1
3
File2
2

Your measure can be like this:
NumberOfSheets = DISTINCTCOUNT ( 'Table'[SheetName] )
In a new table visual, add your FileName column and this new NumberOfSheets measure to produce your desired result.
However, for the Total row, this only counts the distinct number of sheet name variations in your table. If you want the total number of sheets to be displayed in the Total row, you can change your formula to something else, like this:
NumberOfSheets (Total Sum) =
SUMX (
VALUES ( 'Table'[FileName] ) ,
DISTINCTCOUNT ( 'Table'[SheetName] )
)
Here are the results:
Another edit:
If your data table is known to have 1 row per sheet, you can easily just count rows instead:
NumberOfSheets (Simple) = COUNTROWS ( 'Table' )
But this depends entirely on the full data table. If this assumption holds true, there is no reason to use anything other than the COUNTROWS approach, since this is probably the faster calculation.

Related

How can I add and populate a new column based on an existing factor column in R?

-I would like to add a Species column and populate that column with either 'deer' or 'cow' depending on the Id, a factor column.
-My animal Ids are either A or B followed by 628-637 for cow and above 80000 for deer (e.g., A628, A82117).
-Anything with an Id below 1000, A or B should be classified as 'cow' and everything else 'deer'.
Sample Data
You can try with the following steps:
get a substring of the Ids to extrace the numeric part
cast the substring as numeric
use the numeric value to generate the animal type column
You can do it directly in R (also if ids are represented as a factor column):
ids <- factor(c('A630','B81000','A1200','B626'))
type <- unlist(lapply(ids, function(x) ifelse(as.numeric(substr(as.character(x), 2, nchar(as.character(x))))<1000, 'cow', 'deer')))
animals <- data.frame(ids=ids, type=factor(type))
is.factor(animals$type)
animals
You can also prepare your data in a database with the following SQL code:
CREATE TABLE ANIMALS
(
ID VARCHAR(6)
);
INSERT INTO ANIMALS VALUES ('A628');
INSERT INTO ANIMALS VALUES ('A81000');
SELECT
ID,
CASE WHEN CAST(SUBSTR(ID,2,LENGTH(ID)-1) as decimal) < 1000 THEN 'COW' ELSE 'DEER' END AS TYPE
FROM ANIMALS A;
You can also create a new table with both columns:
CREATE TABLE CLASSIFIED_ANIMALS AS
SELECT
ID,
CASE WHEN CAST(SUBSTR(ID,2,LENGTH(ID)-1) as decimal) < 1000 THEN 'CAW' ELSE 'DEER' END as TYPE
FROM ANIMALS A;
In dplyr:
df %>%
mutate(animals = as.numeric(sub("A|B(\\d+).*", "\\1", ids)),
animals = ifelse(animals > 600 & animals < 80000, "cow", "deer"))
ids animals
1 A628 cow
2 B82117 deer
3 A1200 cow
4 B626 cow
5 B80007 deer
How this works:
first we extract the numerical part of the ids column using backreference \\1 and convert it to type numeric
then we run a simple ifelse comparison to assign the labels
Data:
df <- data.frame(
ids = factor(c('A628','B82117','A1200','B626', 'B80007'))
)

Obtain quickly the rows with maximal value for each indicator in a big data.table

I am given a large data.table, e.g.
n <- 7
dt <- data.table(id_1=sample(1:10^(n-1),10^n,replace=TRUE), other=sample(letters[1:20],10^n,replace=TRUE), val=rnorm(10^n,mean=10^4,sd=1000))
> structure(dt)
id_1 other val
1: 914718 o 9623.078
2: 695164 f 10323.943
3: 53186 h 10930.825
4: 496575 p 9964.064
5: 474733 l 10759.779
---
9999996: 650001 p 9653.125
9999997: 225775 i 8945.636
9999998: 372827 d 8947.095
9999999: 268678 e 8371.433
10000000: 730810 i 10150.311
and I would like to create a data.table that for each value of the indicator id_1 only has one row, namely the one with the largest value in the column val.
The following code seems to work:
dt[, .SD[which.max(val)], by = .(id_1)]
However, it is very slow for large tables.
Is there a quicker way?
Technically this is a duplicate of this question,
but the answer wasn't really explained,
so here it goes:
dt[dt[, .(which_max = .I[val == max(val)]), by = "id_1"]$which_max]
The inner expression basically finds,
for each group according to id_1,
the row index of the max value,
and simply returns those indices so that they can be used to subset dt.
However, I'm kind of surprised I didn't find an answer suggesting this:
setkey(dt, id_1, val)[, .SD[.N], by = "id_1"]
which seems to be similarly fast in my machine,
but it requires the rows to be sorted.
I am not sure how to do it in R, but what I have done is read line by line and then put those lines into data frame. This is very fast and happens in a flash for a 100 mb text file.
import pandas as pd
filename ="C:/Users/xyz/Downloads/123456789.012-01-433.txt"
filename =filename
with open(filename, 'r') as f:
sample =[] #creating an empty array
for line in f:
tag=line[:45].split('|')[5] # its a condition, you dont need this.
if tag == 'KV-C901':
sample.append(line.split('|')) # writing those lines to an array table
print('arrays are appended and ready to create a dataframe out of an array')

Code to get lines which have values less or equal than given values in both columns

I have data which has three columns. In the first column I have name, while the second and third columns have one or more semicolon (;) separated values.
Now I want to print the rows where pairs of the semicolon separated column values have distance <= 10 and MAF >= 0.5.
I would be happy if some one provide me R code, if not in R then AWK/SED.
Example
ID Distance MAF
cg12044689 8;40 0.000200;0.59
cg12143629 0;1;3 0.000200;0.520;0.0413
cg12247699 42 0.599
cg12375698 1;10 0.00231;0.51
Output should be:
ID Distance MAF
cg12143629 0;1;3 0.000200;0.520;0.0413
cg12375698 1;10 0.00231;0.51
Here is an awk script that accomplishes the task by splitting and comparing the pairwise values:
parse.awk
{
# For each row, split the distance and maf columns into the dist and maf arrays
n = split($2, dist, ";"); split($3, maf, ";")
do {
if (dist[n] <= 10 && maf[n] >= 0.5)
print
} while(n-- >= 1)
}
Run it like this:
awk -f ./parse.awk infile
Output:
cg12143629 0;1;3 0.000200;0.520;0.0413
cg12375698 1;10 0.00231;0.51

Group categories in R according to first letters of a string?

I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.
Having 3 different rows, only showing the id and the column I need to group by:
ID........... TEXTFIELD
1............ VGH2130
2............ BFGF2345
3............ VGH3321
Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as
ID........... TEXTFIELD........... NEWCOL
1............ VGH2130............. VGH
2............ BFGF2345............ BFGF
3............ VGH3321............. VGH
And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......) )
Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)
You can also try
> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
ID TEXTFIELD group
1 1 VGH2130 VGH
2 2 BFGF2345 BFGF
3 3 VGH3321 VGH
you can try, if df is your data.frame:
df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)
> df
# ID TEXTFIELD NEWCOL
#1 1 VGH2130 VGH
#2 2 BFGF2345 BFGF
#3 3 VGH3321 VGH
Does the text field always have 3 or 4 letters preceding the numbers?
you can check that by doing:
nrow(data[grepl("[aA-zZ]{1,4}\\d+", data$TEXTFIELD)== TRUE, ]) #this gives number of rows where TEXTFIELD contains 3,4 letters followed by digits
If yes, then:
require(stringr)
data$NEWCOL <- str_extract(data$TEXTFIELD, "[aA-zZ]{1,4}")
Final Step:
data$group <- ifelse(data$NEWCOL == "VGH", "Group Name", ifelse(data$NEWCOL == "BFGF", "Group Name", ifelse . . . . ))
# Complete the ifelse statement to classify all groups

How to read specific rows of CSV file with fread function

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.
How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.
The best solution this question, Reading specific rows of large matrix data file, gives the following solution:
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.
However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.
Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?
This is not a duplicate because it is only about the function fread.
Here's my problem setup:
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]
This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.
If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

Resources