R: Writing loops to replace NULL with Dates - r

The Here is an example of my table:
custID | StartDate | EndDate | ReasonForEnd | TransactionType | TransactionDate
1a | NULL | 2/12/2014 | AccountClosed | AccountOpened | 1/15/2004
1a | NULL | 2/12/2014 | AccountClosed | Purchase | 3/16/2004
.......
2b | 7/7/2011 | 6/14/2013 | AccountClosed | AccountOpened | 8/1/2010
The problem has to do with the StartDate field. For each custId, if the entry is NULL then I want to replace with the TransactionDate where TransactionType = AccountOpened. If StartDate is after the TransactionDate where TransactionType = AccountOpened, then replace with this date.
The actual data is over 250,000 rows. I really need some help figuring out how to write this in R.

You could try the following, however I didn't test it yet. I assume your data.frame is called df:
require(dplyr)
df %>%
mutate_each(funs(as.Date(as.character(., format="%m/%d/%Y"))),
StartDate, EndDate, TransactionDate) %>%
group_by(custID) %>%
mutate(StartDate = ifelse(is.na(StartDate) | StartDate > TransactionDate[TransactionType == "AccountOpened"],
TransactionDate[TransactionType == "AccountOpened"], StartDate))
This code first converts several columns to Date format (in this step, NULL entries will be converted to NA), groups by custID and then checks if StartDate is either NA or greater than TransactionDate where TransactionType == "AccountOpened" and if TRUE, replaces StartDate with TransactionDate where TransactionType == "AccountOpened".

Related

query to transpose rows to columns SQLite

Good afternoon,
I would like to know if it is possible to make a query to generate columns according to the number of rows that I have in my table
example:
ID COD DIAG
111111111 | Z359 | D
111111112 | Z359 | D
111111112 | Z359 | D
111111113 | Z359 | R
111111113 | Z359 | P
111111113 | Z359 | R
111111114 | Z359 | D
111111114 | Z359 | D
111111114 | Z359 | D
111111115 | Z359 | D
it would be ideal that columns be created according to the number of rows for each id, if not possible it would put a fixed number of columns.
result query
ID | COD1 | DIAG1 | COD2 | DIAG2 | COD3 | DIAG3
111111111 | Z359 | D | | | |
111111112 | Z359 | D | Z359 | D | |
111111113 | Z359 | R | Z359 | P | Z359 | R
111111114 | Z359 | D | Z359 | D | Z359 | D
111111115 | Z359 | D | | | |
sorry my english
Thanks a Lot !!
This first query follows the pattern of the answer to the duplicate question, included here for comparison.
WITH numbered AS (
SELECT row_number() OVER
(PARTITION BY ID ORDER BY COD, DIAG)
AS seq,
t.*
FROM SO58566470 t)
SELECT ID,
max(CASE WHEN seq = 1 THEN COD END) AS COD1,
max(CASE WHEN seq = 1 THEN DIAG END) AS DIAG1,
max(CASE WHEN seq = 2 THEN COD END) AS COD1,
max(CASE WHEN seq = 2 THEN DIAG END) AS DIAG1,
max(CASE WHEN seq = 3 THEN COD END) AS COD3,
max(CASE WHEN seq = 3 THEN DIAG END) AS DIAG3
FROM numbered n
GROUP BY ID;
But that really is a naive use of window functions, since it could have maximized the window by calculating other values at the same time. The first query is already collecting and traversing partitioned rows to get the row number, yet it essentially repeats that process twice by collecting values in the next query using the aggregate max() functions.
The following query looks longer and perhaps more complicated, but it takes advantage of the partitioned data (i.e. window data) by collecting the transformed values in the same process. But because window functions necessarily operate on each row, it becomes necessary to filter out "incomplete" rows. I did not do any kind of profiling on the queries, but I suspect this second query is much more efficient overall.
WITH transform AS (
SELECT id,
lag(COD, 0) OVER IDWin AS COD1,
lag(DIAG, 0) OVER IDWin AS DIAG1,
lag(COD, 1) OVER IDWin AS COD2,
lag(DIAG, 1) OVER IDWin AS DIAG2,
lag(COD, 2) OVER IDWin AS COD3,
lag(DIAG, 2) OVER IDWin AS DIAG3,
row_number() OVER IDWin AS seq
FROM SO58566470 t
WINDOW IDWin AS (PARTITION BY ID ORDER BY COD, DIAG)
ORDER BY ID, SEQ
),
last AS (
SELECT id, max(seq) as maxseq
FROM transform
GROUP BY id
)
SELECT transform.*
FROM transform
JOIN last
ON transform.id = last.id AND transform.seq = last.maxseq
ORDER BY id;

Make partitions based on difference in date in Postgres window function

I have data in the following format
id | first_name | last_name | birth_date
abc | Jared | Pollard | 1970-01-01
def | Jared | Pollard | 1972-02-02
ghi | Jared | Pollard | 1980-01-01
klm | Jared | Pollard | 2015-01-01
and I would like a query which groups data based on the following rule
If first_name, last_name are equal and birth_dates are within 5 years of each other, than records belong to same group
So the above data contains three groups group1=(abc, def), group2=(ghi) and group3=(klm)
Currently I have the following query which incorrectly creates only 2 groups, group1=(abc, def) and group2=(ghi, klm)
SELECT
g.id,
FIRST_VALUE(g.id) OVER (PARTITION BY lower(trim(g.last_name)), lower(trim(g.first_name)),
CASE WHEN g.birth_date between g.fv_birth_date - interval '5 year' AND g.fv_birth_date + interval '5 year' THEN 1 ELSE 0 END
ORDER BY g.last_used_dt DESC NULLS LAST) AS cluster_id
FROM (
SELECT id, last_used_dt, last_name, first_name, birth_date,
FIRST_VALUE(birth_date)
OVER (PARTITION BY
lower(trim(last_name)),
lower(trim(first_name))
ORDER BY last_used_dt DESC NULLS LAST) AS fv_birth_date
FROM guest
) g;
I understand this is because of the CASE statement within the PARTITION BY clause but am unable to come up with any other query

Order of columns after pivot in application insights

User wants a count of unique sessions per week in application insights. I have the query working, including a pivot, but the Week columns are out of order. I would prefer if they were in order.
pageViews
| where timestamp < now()
| summarize Sessions= dcount(session_Id)
by Week=bin(datepart("weekOfYear", timestamp), 1), user_AuthenticatedId
| order by Week
| evaluate pivot(Week, sum(Sessions))
| join kind=innerunique (pageViews
| summarize MostRecentRequest = max(timestamp) by user_AuthenticatedId)
on $right.user_AuthenticatedId == $left.user_AuthenticatedId
| project-away user_AuthenticatedId1
I've tried ordering by timestamp before the summarize, and ordering by week after the summarize (still in there) and no luck.
There's currently a "trick" that will work: serialize right after your order by
pageViews
| where timestamp < now()
| where isnotempty(user_AuthenticatedId)
| summarize Sessions= dcount(session_Id)
by Week=bin(datepart("weekOfYear", timestamp), 1), user_AuthenticatedId
| order by Week
| serialize // <--------------------------------- RIGHT HERE
| evaluate pivot(Week, sum(Sessions))
| join kind=innerunique (pageViews
| summarize TotalSessions=dcount(session_Id), MostRecentRequest = max(timestamp) by user_AuthenticatedId)
on $right.user_AuthenticatedId == $left.user_AuthenticatedId
| project-away user_AuthenticatedId1
| top 100 by TotalSessions desc
gets me this in workbooks, with the weeks in descending order (I also added total session count to sort/top by with some custom column settings set):
the custom settings I have for the column settings in workbooks:
delete all the #'d columns that are there by default and add one for ^[0-9]+$ set to heatmap:
I refactored query a bit for my own comprehension. I took the the left and right into "views". Thought I'd share.
let users_MostRecent_Session =
pageViews
| summarize
TotalSessions=dcount(session_Id)
, MostRecentRequest = max(timestamp)
by
user_AuthenticatedId
;
//
let users_sessions_ByWeek =
pageViews
| where timestamp < now()
| where isnotempty(user_AuthenticatedId)
| summarize
Sessions= dcount(session_Id)
by
Week=bin(datepart("weekOfYear", timestamp), 1)
, user_AuthenticatedId
| order by Week
| serialize
| evaluate pivot(Week, sum(Sessions))
;
//
//
users_sessions_ByWeek
| join kind=innerunique
users_MostRecent_Session
on user_AuthenticatedId
| project-away user_AuthenticatedId1
| top 100 by TotalSessions desc

sqlite get data from one field on two different dates and present as two columns

I have a sqlite database with some time series data:
holdings:
| id | date | instrument | position | price | portfolio | sector |
prices:
| id | date | instrument | open | high | low | close | adjclose |
static_data
| id | ticker | name | sector | industry | country | currency |
and I'd like to get the holdings for a particular day with the change in price on that day as a calculated field.
I've tried the following query
SELECT h.date,
h.portfolio,
h.instrument,
s.name,
h.position,
p.adjclose AS curpx,
(p.adjclose AS lastpx WHERE
h.date = "2013-01-10 00:00:00" AND
h.instrument = p.instrument)
FROM holdings AS h,
static_data AS s,
prices AS p
WHERE h.date = "2013-01-11 00:00:00"
AND h.portfolio = "usequity"
AND (h.instrument = p.instrument)
AND (h.date = p.date)
AND (h.instrument = s.ticker);
but I get a syntax error.
[2014-11-14 06:11:04] [1] [SQLITE_ERROR] SQL error or missing database (near "as": syntax error)
I'm a complete N00b at SQL so I'd like to know how I can get two sets of data from the same table and show them side by side or to perform a calculation using one against the other in SQL.
Thanks
You want a correlated subquery:
SELECT ...,
p.adjclose AS curpx,
(SELECT p2.adjclose
FROM prices AS p2
WHERE p2.date = datetime(h.date, '-1 days')
AND p2.instrument = h.instrument
) AS lastpx
FROM ...

Generating attendance list for hours without a matching row

I have a project that calculates work hour from the attendance logs that I import from attendance machine. I use SQLite database and VB .NET.
First I'll show the table that I use:
CREATE TABLE [CheckLogs] (
[IDCheckLog] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[IDEmployee] TEXT NOT NULL,
[Dates] TEXT NOT NULL,
[In] TEXT,
[Out] TEXT,
[OverTime] NUMERIC DEFAULT 0);
CREATE TABLE integers (i INTEGER NOT NULL PRIMARY KEY);
INSERT INTO integers (i) VALUES
(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
Table CheckLogs is the data that I import from the attendance machine. The OverTime column is calculated in my program. Table integer is used to create the date list, I got it from here.
I want to generate a view that shows employee attendance between 2 dates and display the CheckLogs data if the employee is present and null if absent. Because in the table CheckLogs, when the employee is absent then there is no data from that day from this employee.
This is the view that I desired (this is report for employee 10001 between 2014-10-01 and 2014-10-05):
Dates | IDEmployee | In | Out
---------------------------------------
2014-10-01 | 10001 | 07:00 | 16:00
2014-10-02 | 10001 | 07:01 | 15:58
2014-10-03 | 10001 | null | null
2014-10-04 | 10001 | 07:08 | 15:48
2014-10-05 | 10001 | null | null
And this is the query that I have now:
SELECT X.[Dates], C.[IDEmployee], C.[In], C.[Out]
FROM
(select date('2014-10-01', '+' || (H.i*100 + T.i*10 + U.i) || ' day') as Dates
from integers as H
cross
join integers as T
cross
join integers as U
where date('2005-01-25', '+' || (H.i*100 + T.i*10 + U.i) || ' day') <= '2014-10-05') AS X
, CheckLogs AS C USING (Dates)
WHERE C.[IDEmployee]='10001'
From this query I have this result:
Dates | IDEmployee | In | Out
---------------------------------------
2014-10-01 | 10001 | 07:00 | 16:00
2014-10-02 | 10001 | 07:01 | 15:58
2014-10-04 | 10001 | 07:08 | 15:48
To get NULL values for rows without a match, you need an outer join.
And you have to take care not to filter out those rows with a WHERE clause that would not match NULL values; to get dates that do not match a condition, you have to put that condition into the join's ON clause:
SELECT ...
FROM ( ... ) AS X
LEFT JOIN CheckLogs AS C ON C.Dates = X.Dates AND
C.IDEmployee = '10001'

Resources