Create new column with predefined data based multiple conditions - datetime

I try to create a new column with the following result 'Visit Missed', 'Visit On Time', 'Upcoming Visit' based on 3 conditions.
I would like to create a new column named "Test" for example that will be filled automatically if 3 conditions are met on the dates columns.
I have tried to create a "condition" variable which has my 3 conditions
conditions =
[
((df['Actual Visit Date'] == 'NaT' ) & (df['Latest Due Date'] <= pd.to_datetime("today"))) | (df['Actual Visit Date']>df['Latest Due Date']),
#below are okay
(df['Actual Visit Date'] != 'NaT' ) & (df['Actual Visit Date'] <= df['Latest Due Date']),
(df['Latest Due Date'] > pd.to_datetime("today"))
]
I have also created a "result" variable
results = ['Visit Missed', 'Visit On Time', 'Upcoming Visit']
I tried to create new with the below
df['test'] = np.select(conditions, results).
Results for "Visit On Time" and "Upcoming Visit" seem to work fine by the first condition is not processed ending up with 0 as result or not all the row that meet the condition.
It is will be great if I can sort it out by defining a function.

Related

PowerBI - Count instances of string in multiple columns

Been searching the forums on here but can't find anything that exactly replicates what I'm trying to do - I have split a string by delimiter and currently have an Excel file that has the following column names and some basic sample data:
Learnt 1 - Learnt 2 - Learnt 3 - Learnt 4
Books - (blank) - (blank) - (blank)
Online - Books - (blank) - (blank)
(blank) - Books - Bootcamp - (blank)
Bootcamp - (blank) - Books - (blank)
The four learnt columns are populated either by a method of learning (of which there's around 8 possibilities which apply to every column, or are blank, and more than one column can be non-blank at the same time in any given row. Very simply, I want to be able to count the total number of times each method appears in all the columns in PowerBI, so the expected results here would be Books 4, Online 1, Bootcamp 2. Any help would be greatly appreciated.
I have tried using the USERelationship function to link this to a separate table with all methods of learning listed out(Table 1), see query below but was having no luck:
CountinlearntocodeColumns =
CALCULATE (
COUNTROWS ( LearningMethodTable ),
USERELATIONSHIP ( Table1[Method], LearningMethodTable[Learnt 1] )
)
+ CALCULATE (
COUNTROWS ( LearningMethodTable ),
USERELATIONSHIP ( Table1[Method], LearningMethodTable[Learnt 2] )
)
+ CALCULATE (
COUNTROWS ( LearningMethodTable ),
USERELATIONSHIP ( Table1[Method], LearningMethodTable[Learnt 3] )
)
You make life unnecessarily difficult when working with pivoted tables in Power BI. Fix that first
In Power Query simply unpivot all columns and filter out blanks:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WcsrPzy5WUNJRgqJYnWgl/7yczLxUkCCKLEgKWQzIKElOzC1QgEmiCCBrjo0FAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Learnt 1 " = _t, #"Learnt 2 " = _t, #"Learnt 3 " = _t, #"Learnt 4" = _t]),
#"Unpivoted Columns" = Table.UnpivotOtherColumns(
Source, {}, "Learnt", "Method"),
#"Filtered Rows" = Table.SelectRows(
#"Unpivoted Columns", each [Method] <> null and [Method] <> "")
in
#"Filtered Rows"
The result will look like this:
Now you can easily create your count in DAX
Count = Count('Table'[Method])
and pull it into a new result table:

Earliest Time in Datetime column PowerBI

Okay so I have a table like shown...
I want to use PowerBI to create a new column called 'First_Interaction' where it will say 'True' if this was the user's earliest entry for that day. Any entry that came in after the first entry will be set to "False".
This is what I want the column to be like...
Use the following DAX formula to create a column:
First_Interaction =
VAR __userName = 'Table'[UserName]
VAR __minDate = CALCULATE( MIN( 'Table'[Datetime] ), FILTER( 'Table', 'Table'[UserName] = __userName ) )
Return IF( 'Table'[Datetime] = __minDate, "TRUE", "FALSE" )
Power BI dosnt support less than second so your DateTime Column must be a Text value. Take that on consideration for future transformation.

Teradata : using case statement in Where clause

My question is about using case statement in where clause to check for date and assign values to columns. My sample code include.
select * from table
where
column 1 > 10 and
case when column 2 = 1
then
column 3<= 10 and column 4 between (1st day of prev month) and (prev month end) or column 5 = '8888-01-01'
else
column 4 between (1st day of this month) and (yesterday)
end ;
when I am running this code. I am getting 3706 syntax error:expected something in between field and '='.
How to fix this ?
A CASE statement will always return a value or NULL (if none of the conditions matches), so you can use it in your WHERE clause. There are a couple ways to format your CASE statement:
Format 1
CASE
WHEN <condition> THEN <some_expression>
WHEN <another_condition> THEN <another_expression>
ELSE <final_expression>
END
-- Example
CASE
WHEN col1 = 10 THEN 'Y'
WHEN col1 = 20 THEN 'N'
ELSE 'N/A'
END
Format 2
CASE <expression>
WHEN <value> THEN <expression>
WHEN <another_value> THEN <another_expression>
ELSE <final_expression>
END
-- Example
CASE col1
WHEN 10 THEN 'Y'
WHEN 20 THEN 'N'
ELSE 'NA'
END
I'm not sure what you're trying to do with your sample code, but it looks more like pseudo-code and will not work as-is. Your CASE statement is not formatted properly and your column references like column 1 will not work that way. If your column is actually named column 1, then you need to put double-quotes around it:
select * from table where "column 1" > 10
Can you please describe a little more clearly what exactly you are trying to do?
A CASE expression can't be used to create some kind of dynamic conditions. Write it as a bunch of AND/OR conditons:
select * from table
where
column 1 > 10 and
(
( column 2 = 1 and
(column 3<= 10 and column 4 between (1st day of prev month) and (prev month end) or column 5 = '8888-01-01')
)
or
column 4 between (1st day of this month) and (yesterday)
);
Double check the logic, the precedence of logical operators is
parenthesis
NOT
AND
OR

Optimize left join query with multiple counts in SQLAlchemy?

Trying to optimize a query, which has multiple counts for objects in subordinate table (used aliases in SQLAlchemy). In Witch Academia terms, something like this:
SELECT
exam.id AS exam_id,
exam.name AS exam_name,
count(tried_witch.id) AS tried,
count(passed_witch.id) AS passed,
count(failed_witch.id) AS failed
FROM exam
LEFT OUTER JOIN witch AS tried_witch
ON tried_witch.exam_id = exam.id AND
tried_witch.is_failed = 0 AND
tried_witch.status != "passed"
LEFT OUTER JOIN witch AS passed_witch
ON passed_witch.exam_id = exam.id AND
passed_witch.is_failed = 0 AND
passed_witch.status = "passed"
LEFT OUTER JOIN witch AS failed_witch
ON failed_witch.exam_id = exam.id AND
failed_witch.is_failed = 1
GROUP BY exam.id, exam.name
ORDER BY tried ASC
LIMIT 20
Number of witches can be large (hundreds of thousands), number of exams is lower (hundreds), so the above query is quite slow. In a lot of similar questions I've found answers, which propose the above, but I feel like a totally different approach is needed here. I am stuck at coming up with alternative. NB, there is a need to order by calculated counts. It is also important to have zeros as counts, of course, where due. (do not pay attention to a somewhat funny model: witches can easily clone themselves to go to multiple exams, thus per exam identity)
With one EXISTS subquery, which is not reflected in the above and does not influence the ouotcome, the situation is:
# Query_time: 1.135747 Lock_time: 0.000209 Rows_sent: 20 Rows_examined: 98174
# Rows_affected: 0
# Full_scan: Yes Full_join: No Tmp_table: Yes Tmp_table_on_disk: Yes
# Filesort: Yes Filesort_on_disk: No Merge_passes: 0 Priority_queue: No
Updated query, which is still quite slow:
SELECT
exam.id AS exam_id,
exam.name AS exam_name,
count(CASE WHEN (witch.status != "passed" AND witch.is_failed = 0)
THEN witch.id
ELSE NULL END) AS tried,
count(CASE WHEN (witch.status = "passed" AND witch.is_failed = 0)
THEN witch.id
ELSE NULL END) AS passed,
count(CASE WHEN (witch.is_failed = 1)
THEN witch.id
ELSE NULL END) AS failed
FROM exam
LEFT OUTER JOIN witch ON witch.exam_id = exam.id
GROUP BY exam.id, exam.name
ORDER BY tried ASC
LIMIT 20
Indexing is the key to get performance of the query.
I do not know MariaDB at all, so not sure what the possibilities are. But if it is anything like Microsoft SQL Server, then here is what I would try:
Create ONE composite index covering ALL the required columns: witch_id, status and is_failed. If the query uses that index, that should be it. Here the order of the included columns might be very important. Then profile the query in order to understand if the index is used. See Optimization and Indexes documentation page.
Consider Generated (Virtual and Persistent) Columns.
It looks like all the information for classification of the witch into tried, passed or failed bucket is contained in the row for witch. Therefore, you can basically create those virtual columns on the database table directly and use PERSISTENT option. This option allows creating index on it. Then you can create an index specifically for this query containing witch_id and three virtual columns: tried, passed and failed. Make sure you query uses it, and that should be pretty good. The query will then look very simple:
SELECT exam.id,
exam.name,
sum(witch.tried) AS tried,
sum(witch.passed) AS passed,
sum(witch.failed) AS failed
FROM exam
INNER JOIN witch ON exam.id = witch.exam_id
GROUP BY exam.id,
exam.name
ORDER BY sum(witch.tried)
LIMIT 20
Although query simple comparisons and AND/OR clauses, you are basically offloading the calculation of the 3 statuses to the database during INSERT/UPDATE. Then during SELECT you query should be much faster.
Your example does not specify any result filtering (WHERE clause), but if you have one, it might also have an impact on the way one optimises indices for query performance.
Original answer: Below is the originally proposed change to the query.
Here i assume that indexing part of the optimisation has been already done.
Could you try with SUM instead of COUNT?
SELECT exam.id,
exam.name,
sum(CASE
WHEN (witch.is_failed = 0
AND witch.status != 'passed') THEN 1
ELSE 0
END) AS tried,
sum(CASE
WHEN (witch.is_failed = 0
AND witch.status = 'passed') THEN 1
ELSE 0
END) AS passed,
sum(CASE
WHEN (witch.is_failed = 1) THEN 1
ELSE 0
END) AS failed
FROM exam
INNER JOIN witch ON exam.id = witch.exam_id
GROUP BY exam.id,
exam.name
ORDER BY sum(CASE
WHEN (witch.is_failed = 0
AND witch.status != 'passed') THEN 1
ELSE 0
END)
LIMIT 20
The rest:
Given you have specified sqlalchemy in your answer, here is the sqlalchemy code, which i used to model and generate the query:
# model
class Exam(Base):
id = Column(Integer, primary_key=True)
name = Column(String)
class Witch(Base):
id = Column(Integer, primary_key=True)
exam_id = Column(Integer, ForeignKey('exam.id'))
is_failed = Column(Integer)
status = Column(String)
exam = relationship(Exam, backref='witches')
# computed fields
#hybrid_property
def tried(self):
return self.is_failed == 0 and self.status != 'passed'
#hybrid_property
def passed(self):
return self.is_failed == 0 and self.status == 'passed'
#hybrid_property
def failed(self):
return self.is_failed == 1
# computed fields: expression
#tried.expression
def _tried_expression(cls):
return case([(and_(
cls.is_failed == 0,
cls.status != 'passed',
), 1)], else_=0)
#passed.expression
def _passed_expression(cls):
return case([(and_(
cls.status == 'passed',
cls.is_failed == 0,
), 1)], else_=0)
#failed.expression
def _failed_expression(cls):
return case([(cls.is_failed == 1, 1)], else_=0)
and:
# query
q = (
session.query(
Exam.id, Exam.name,
func.sum(Witch.tried).label("tried"),
func.sum(Witch.passed).label("passed"),
func.sum(Witch.failed).label("failed"),
)
.join(Witch)
.group_by(Exam.id, Exam.name)
.order_by(func.sum(Witch.tried))
.limit(20)
)

Select query with logical operators as sub query

I would like to run a conditional select query on a table of sewer structures (S_Structures) in SQLite. The table contains the following columns: struct_type (structure type) and Year (construction year).
The selection will be based on the structure type as well as structure age, I think I've sorted out the type selection bit and with regards to the age, I intend to deduct the year from the local time. This is all fine and well, but how do I go about defining a logical operator as sub query for the specified age range of: 2 < age <= 5.
Select [Year],[Struc_Type]
FROM [S_Structures]
WHERE [Struc_Type] NOT IN
("Manhole", "Rodding Eye", "Dummy",
"End Manhole", "T-Piece", "Sub- Catchment", "Top End");
AND WHERE (strftime('%Y','now') - Year) > '2'
AND (strftime('%Y','now') - Year); <= '5';
A subquery would involve an additional SELECT in parentheses.
Only a single WHERE clause is allowed.
Just use the BETWEEN operator:
SELECT [Year], [Struc_Type]
FROM [S_Structures]
WHERE [Struc_Type] NOT IN ('Manhole', 'Rodding Eye', 'Dummy', 'End Manhole',
'T-Piece', 'Sub-Catchment', 'Top End')
AND (strftime('%Y','now') - Year) BETWEEN 3 AND 5;

Resources