R data.table fread fails on special character

R data.table fread fails on special character - r

I can only give you picture of data I'm working with or the character that creates my problems in .csv file. I don't know how to get that character.
This pillar character is stopping fread working. Is there away to escape it? readr read_csv works through them with no problem. I have tried to drop, make it character column, use comment.char = "", but nothing seems to work.
Here what I'm hoping to get out (what I get out with read_csv)
# A tibble: 5 x 4
X1 trade date trade_condition
<dbl> <dbl> <date> <chr>
1 2902 28.3 2019-01-14 -12------P----
2 2903 28.0 2019-01-14 P
3 2904 28.0 2019-01-14 P
4 2905 28.0 2019-01-14 P
5 2906 28.1 2019-01-14 P
I'm using data.table_1.12.0
Here is Verbose = T
omp_get_max_threads() = 8
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C:/Users/Markku/Desktop/KONECRANES_2019.01.14/trades.csv
File opened, size = 592KB (606768 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,trade,date,trade_condition,sy>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 9 fields using quote rule 0
Detected 9 columns on line 1. This line is either column names or first data row. Line starts as: <<,trade,date,trade_condition,sy>>
Quote rule picked = 0
fill=false and the most number of columns found is 9
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (606767 bytes from row 1 to eof) / (2 * 27623 jump0size) == 10
Type codes (jump 000) : 57AAAA5AA Quote rule 0
A line with too-few fields (4/9) was found on line 4 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-few fields (4/9) was found on line 13 of sample jump 9. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 010) : 57AAAA5AA Quote rule 0
'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 858 sample rows
=====
Sampled 858 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 606683
Line length: mean=213.01 sd=86.78 min=59 max=372
Estimated number of rows: 606683 / 213.01 = 2849
Initial alloc = 5698 rows (2849 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 57AAAA5AA
[10] Allocate memory for the datatable
Allocating 9 column slots (9 - 0 dropped) with 5698 rows
[11] Read the data
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==1
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==2
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==3
jumps=[0..1), chunk_size=606683, total_size=606683
Read 2903 rows x 9 columns from 592KB (606768 bytes) file in 00:00.014 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
1 : float64 '7'
6 : string 'A'
=============================
0.003s ( 21%) Memory map 0.001GB file
0.007s ( 50%) sep=',' ncol=9 and header detection
0.000s ( 0%) Column type detection using 858 sample rows
0.000s ( 0%) Allocation of 5698 rows x 9 cols (0.000GB) of which 2903 ( 51%) rows used
0.004s ( 29%) Reading 1 chunks (0 swept) of 0.579MB (each chunk 2903 rows) using 1 threads
+ 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
+ 0.002s ( 14%) Transpose
+ 0.002s ( 14%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.014s Total
Warning message:
In fread(trades_file, verbose = T) :
Stopped early on line 2905. Expected 9 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2903,28.04,2019-01-14,"P>>

Related

Pyomo constraint issue: not returning constrained result

I setup a constraint that does not constraint the solver in pyomo.
The constraint is the following:
def revenue_positive(model,t):
for t in model.T:
return (model.D[t] * model.P[t]) >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
while the model parameters are:
model = ConcreteModel()
model.T = Set(doc='quarter of year', initialize=df.index.tolist(), ordered=True)
model.P = Param(model.T, initialize=df['price'].to_dict(), within=Any, doc='Price for each quarter')
model.C = Var(model.T, domain=NonNegativeReals)
model.D = Var(model.T, domain=NonNegativeReals)
income = sum(df.loc[t, 'price'] * model.D[t] for t in model.T)
expenses = sum(df.loc[t, 'price'] * model.C[t] for t in model.T)
profit = income - expenses
model.objective = Objective(expr=profit, sense=maximize)
# Solve the model
solver = SolverFactory('cbc')
solver.solve(model)
df dataframe is:
df time_stamp price Status imbalance Difference Situation ... week month hour_of_day day_of_week day_of_year yearly_quarter
quarter ...
0 2021-01-01 00:00:00 64.84 Final 16 -3 Deficit ... 00 1 0 4 1 1
1 2021-01-01 00:15:00 13.96 Final 38 2 Surplus ... 00 1 0 4 1 1
2 2021-01-01 00:30:00 12.40 Final 46 1 Surplus ... 00 1 0 4 1 1
3 2021-01-01 00:45:00 7.70 Final 65 14 Surplus ... 00 1 0 4 1 1
4 2021-01-01 01:00:00 64.25 Final 3 -9 Deficit ... 00 1 1 4 1 1
The objective is to constraint the solver not to accept a negative revenue. As such it does not work as the solver passes 6 negative revenue values through. Looking at the indices with negative revenue, it appears the system chooses to sell at a negative price to buy later at a price even "more" negative, so from an optimization standpoint, it is ok. I would like to check the difference in results if we prohibit the solver to do that. Any input is welcome as after many searches on the web, still not the right way to write it correctly.
I did a pprint() of the constraint that returned:
positive_revenue : Size=35040, Index=T, Active=True
UPDATE following new constraint code:
def revenue_positive(model,t):
return model.D[t] * model.P[t] >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
Return the following error:
ERROR: Rule failed when generating expression for constraint positive_revenue
with index 283: ValueError: Invalid constraint expression. The constraint
expression resolved to a trivial Boolean (True) instead of a Pyomo object.
Please modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'
ERROR: Constructing component 'positive_revenue' from data=None failed:
ValueError: Invalid constraint expression. The constraint expression
resolved to a trivial Boolean (True) instead of a Pyomo object. Please
modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'
Traceback (most recent call last):
File "/home/olivier/Desktop/Elia - BESS/run_imbalance.py", line 25, in <module>
results_df = optimize_year(df)
File "/home/olivier/Desktop/Elia - BESS/battery_model_imbalance.py", line 122, in optimize_year
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/block.py", line 542, in __setattr__
self.add_component(name, val)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/block.py", line 1087, in add_component
val.construct(data)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/constraint.py", line 781, in construct
self._setitem_when_not_present(
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/indexed_component.py", line 778, in _setitem_when_not_present
obj.set_value(value)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/constraint.py", line 506, in set_value
raise ValueError(
ValueError: Invalid constraint expression. The constraint expression resolved to a trivial Boolean (True) instead of a Pyomo object. Please modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'

So there are 2 issues w/ your constraint. It isn't clear if one is a cut & paste issue or not.
The function call to make the constraint appears to be indented and inside of your function after the return statement, making it unreachable code. Could be just the spacing in your post.
You are incorrectly adding a loop inside of your function. You are passing in the parameter t as a function argument and then you are blowing it away with the for loop, which only executes for the first value of t in T then hits the return statement. Remove the loop. When you use the rule= structure in pyomo it will call the rule for each instance of the set that you are using in the Constraint(xx, rule=) structure.
So I think you should have:
def revenue_positive(model, t):
return model.D[t] * model.P[t] >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
Updated re: the error you added.
The error cites the 283rd index. My bet is that price[283] is zero, so you are multiplying by a zero and killing your variable.
You could add a check within the function that checks if the price is zero, and in that case, just return pyo.Constraint.Feasible, which is the trivial return that doesn't influence the model (or crash)

Remove all rows above and below a value in R

We have citizen scientist recording data for us using In-Situ Aqua troll 600 instruments. It is similar to a CTD but not. The data format is a little different. Different enough that I cannot use CTD trim from the OCE package in R. I need to remove all the rows of data during the soak time (time in the water before they start lowering the instrument) and the up cast from the data. That is all the rows after they reached the max depth. So I just need that center portion of my dataframe.
My Data
Date Time Salinity (ppt) (672441) Chlorophyll-a Fluorescence (RFU) (671721) RDO Concentration (mg/L) (672144) Temperature (°C) (676121) Depth (ft) (671051)
16:29.0 0 0.01089297 7.257619 31.91303 0.008220486
16:31.0 0 0.01765913 7.246986 31.93175 0.1499496
16:33.0 0 0.0130412 7.258863 31.93253 0.5387784
16:35.0 0 0.01299242 7.274049 31.93806 0.6187978
16:37.0 0 0.01429801 7.26965 31.94401 0.6640261
16:39.0 0 0.01342988 7.271608 31.93595 0.681709
16:41.0 0 0.01337719 7.271549 31.93503 0.684597
16:43.0 7.087267 0.007094439 6.98015 31.89018 1.598019
16:45.0 28.3442 0.007111916 6.268753 31.83806 1.687673
16:47.0 31.06357 0.007945394 6.197834 31.77821 1.418773
16:49.0 32.07076 0.0080788 6.166986 31.76881 1.382685
16:51.0 31.95504 0.004382414 6.191305 31.72906 1.358556
16:53.0 36.21165 0.01983912 5.732656 29.3942 123.4148
16:55.0 36.37849 0.02243886 5.626586 28.82502 125.2927
16:57.0 36.43061 0.02416219 5.450325 28.23787 126.7997
16:59.0 36.44484 0.02441683 5.421676 28.14037 127.0321
17:01.0 36.46815 4.510316 5.318929 28.09501 127.2064
17:03.0 36.41381 4.012657 5.241654 28.14595 127.2227
17:05.0 36.42724 0.7891375 5.174401 28.20383 127.2019
17:07.0 36.41064 0.4351442 5.120181 28.18592 127.197
17:09.0 36.38155 0.2253969 5.033384 28.21021 127.1895
17:11.0 36.37671 0.2089337 5.019629 28.21222 127.1885
17:13.0 36.43813 0.08728585 4.981099 28.17526 127.2223
17:15.0 36.47644 0.904435 4.951878 28.13579 127.2108
17:17.0 36.54742 0.1230291 4.93056 28.06166 127.2307
17:19.0 36.60466 10.04291 4.908442 27.9397 126.6003
17:21.0 36.61511 11.33922 4.904828 27.92038 126.5161
17:23.0 36.68179 0.6680982 4.87018 27.78319 123.707
17:25.0 36.74612 0.06539913 4.848994 27.72977 119.906
17:27.0 36.75729 0.02414635 4.826871 27.72545 114.9537
17:29.0 37.1578 0.01556828 4.804105 27.81129 113.3405
> depthmax<- max(WS$`Depth (ft) (671051)`, na.rm = TRUE)
> output <- WS[WS$"Depth (ft) (671051)" < depthmax,]
> Output2 <- output[output$"Depth (ft) (671051)" > 1,]
I tried these and got output2 to work but can't seam to get output to work. Is there a more elegant way to do this? Just to recap I need to remove all rows after the depthmax (127.2307) and all the rows before the depth when they start lowering the instrument (~2.41).

Your code does remove the maximum depth, but not the rows after the maximum depth is reached. You want to locate the row index of the the maximum depth and delete that row and the ones after:
start <- tail(which(na.omit(WS$`Depth (ft) (671051)`) < 2.41), 1) + 1
end<- which.max(na.omit(WS$`Depth (ft) (671051)`)) - 1
output <- WS[start:end, ]
The first line finds the index of the last row less than 2.41 and adds 1 to get the starting row. The second line finds the index of the maximum depth and subtracts 1 to get the row before that.

Output R dataframe to SAS format Issue

I have a dataset that looks like this:
df_dummy = data.frame(
Company=c("0001","0002","0003","0004","0005"),
Measure=c("A","B","C","D","E"),
Num=c(10,10,10,10,10),
Den=c(20,20,20,20,20),
Rate=c(50.0,50.0,50.0,50.0,50.0)
)
df_dummy$Company <- as.character(df_dummy$Company)
df_dummy$Measure <- as.character(df_dummy$Measure)
I am using this to export to an .xpt file
write.xport(df_dummy, file = "data/tmp.xpt")
lookup.xport("data/tmp.xpt")
In SAS, I use this code to import:
libname sasfile 'PATH\data';
libname xptfile xport 'PATH\data\tmp.xpt' access=readonly;
proc copy inlib=xptfile outlib=sasfile;
run;
The table looks fine, but the rate doesn't show the decimal point.
In my actual dataset, there are a lot more rows but it's the same format essentially and if I run a lookup.xport I get this:
Variables in data set `MEASURES':
dataset name type format flength fdigits iformat iflength ifdigits label nobs
MEASURES ID character 0 0 0 0 29064
MEASURES MEASURE character 0 0 0 0 29064
MEASURES NUM numeric 0 0 0 0 29064
MEASURES DEN numeric 0 0 0 0 29064
MEASURES RATE numeric 0 0 0 0 29064
However, if I use the same SAS code to import this, I get something that looks completely off and I can't figure out what's causing it.

I cannot replicate your issue using R (3.4.1) and SAS (9.4 TS1M4) on Mac OS X with both being 64 bit versions. The 32/64 bit versions can cause issues sometimes.
I used R Studio and SAS UE, both freely available for education usage.
Full R code:
install.packages("SASxport")
library("SASxport")
df_dummy = data.frame(
Company=c("0001","0002","0003","0004","0005"),
Measure=c("A","B","C","D","E"),
Num=c(10,10,10,10,10),
Den=c(20,20,20,20,20),
Rate=c(50.0,50.0,50.0,50.0,50.0)
)
df_dummy$Company <- as.character(df_dummy$Company)
df_dummy$Measure <- as.character(df_dummy$Measure)
write.xport(df_dummy, file = "tmp.xpt")
Full SAS Code:
libname sasfile '/folders/myfolders/';
libname xptfile xport '/folders/myfolders/tmp.xpt' access=readonly;
proc copy inlib=xptfile outlib=sasfile;
run;

Your example works. Even with older version or R. Make sure your transport file had not been corrupted by transferring between machines. A transport file is binary data with fixed length 80 byte records, but much of data looks like ASCII codes.
SAS transport files follow the SAS V5 rules for names. Make sure that your member name and variable names are valid SAS names and are not longer than 8 characters. Character variables cannot be longer than 200 characters.
You can quickly look at the file using a simple data step. Especially for your small example. So if you see that the length is not exactly a multiple of 80 or you see that the header records do not start at the beginning of an 80 byte record then something has corrupted the file.
56 data _null_;
57 infile '/test/tmp.xpt' lrecl=80 recfm=f ;
58 input;
59 list;
60 run;
NOTE: The infile '/test/tmp.xpt' is:
Filename=/test/tmp.xpt,
Owner Name=xxxxx,Group Name=xxxxx,
Access Permission=-rw-r--r--,
Last Modified=29Sep2017:09:16:16,
File Size (bytes)=1680
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1 HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000
2 CHAR SAS SAS SASLIB 7.00 R 3.0.2. 29SEP17:09:16:16
ZONE 54522222545222225454442232332222523232302222222222222222222222223354533333333333
NUMR 3130000031300000313C92007E000000203E0E200000000000000000000000002935017A09A16A16
3 29SEP17:09:16:16
4 HEADER RECORD*******MEMBER HEADER RECORD!!!!!!!000000000000000001600000000140
5 HEADER RECORD*******DSCRPTR HEADER RECORD!!!!!!!000000000000000000000000000000
6 CHAR SAS DF_DUMMYSASDATA 7.00 R 3.0.2. 29SEP17:09:16:16
ZONE 54522222445454455454454232332222523232302222222222222222222222223354533333333333
NUMR 3130000046F45DD9313414107E000000203E0E200000000000000000000000002935017A09A16A16
7 29SEP17:09:16:16
8 HEADER RECORD*******NAMESTR HEADER RECORD!!!!!!!000000000500000000000000000000
9 CHAR ........COMPANY ........
ZONE 00000000444544522222222222222222222222222222222222222222222222220000000022222222
NUMR 020008013FD01E900000000000000000000000000000000000000000000000000000000000000000
10 CHAR ....................................................................MEASURE
ZONE 00000000000000000000000000000000000000000000000000000000000000000000444555422222
NUMR 00000000000000000000000000000000000000000000000000000000000002000802D51352500000
11 CHAR ........ ....................
ZONE 22222222222222222222222222222222222222222222000000002222222200000000000000000000
NUMR 00000000000000000000000000000000000000000000000000000000000000000008000000000000
12 CHAR ................................................NUM
ZONE 00000000000000000000000000000000000000000000000045422222222222222222222222222222
NUMR 000000000000000000000000000000000000000001000803E5D00000000000000000000000000000
13 CHAR ........ ........................................
ZONE 22222222222222222222222200000000222222220000000100000000000000000000000000000000
NUMR 00000000000000000000000000000000000000000000000000000000000000000000000000000000
14 CHAR ............................DEN
ZONE 00000000000000000000000000004442222222222222222222222222222222222222222222222222
NUMR 000000000000000000000100080445E0000000000000000000000000000000000000000000000000
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
15 CHAR ........ ............................................................
ZONE 22220000000022222222000000010000000000000000000000000000000000000000000000000000
NUMR 00000000000000000000000000080000000000000000000000000000000000000000000000000000
16 CHAR ........RATE ........
ZONE 00000000545422222222222222222222222222222222222222222222222222220000000022222222
NUMR 01000805214500000000000000000000000000000000000000000000000000000000000000000000
17 CHAR ....... ....................................................
ZONE 00000002000000000000000000000000000000000000000000000000000022222222222222222222
NUMR 00000000000000000000000000000000000000000000000000000000000000000000000000000000
18 HEADER RECORD*******OBS HEADER RECORD!!!!!!!000000000000000000000000000000
19 CHAR 0001 A A ......B.......B2......0002 B A ......B.......B2......
ZONE 33332222422222224A000000410000004300000033332222422222224A0000004100000043000000
NUMR 00010000100000001000000024000000220000000002000020000000100000002400000022000000
20 CHAR 0003 C A ......B.......B2......0004 D A ......B.......B2......
ZONE 33332222422222224A000000410000004300000033332222422222224A0000004100000043000000
NUMR 00030000300000001000000024000000220000000004000040000000100000002400000022000000
21 CHAR 0005 E A ......B.......B2......
ZONE 33332222422222224A00000041000000430000002222222222222222222222222222222222222222
NUMR 00050000500000001000000024000000220000000000000000000000000000000000000000000000
NOTE: 21 records were read from the infile '/test/tmp.xpt'.

Counting observations using multiple BY groups SAS

I am examining prescription patterns within a large EHR dataset. The data is structured so that we are given several key bits of information, such as patient_num, encounter_num, ordering_date, medication, age_event (age at event) etc. Example below:
Patient_num enc_num ordering_date medication age_event
1111 888888 07NOV2008 Wellbutrin 48
1111 876578 11MAY2011 Bupropion 50
2222 999999 08DEC2009 Amitriptyline 32
2222 999999 08DEC2009 Escitalopram 32
3333 656463 12APR2007 Imipramine 44
3333 643211 21DEC2008 Zoloft 45
3333 543213 02FEB2009 Fluoxetine 45
Currently I have the dataset sorted by patient_id then by ordering_date so that I can see what each individual was prescribed during their encounters in a longitudinal fashion. For now, I am most concerned with the prescription(s) that were made during their first visit. I wrote some code to count the number of prescriptions and had originally restricted later analyses to RX = 1, but as we can see, that doesn't work for people with multiple scripts on the same encounter (Patient 2222).
data pt_meds_;
set pt_meds;
by patient_num;
if first.patient_num then RX = 1;
else RX + 1;
run;
Patient_num enc_num ordering_date medication age_event RX
1111 888888 07NOV2008 Wellbutrin 48 1
1111 876578 11MAY2011 Bupropion 50 2
2222 999999 08DEC2009 Amitriptyline 32 1
2222 999999 08DEC2009 Escitalopram 32 2
3333 656463 12APR2007 Imipramine 44 1
3333 643211 21DEC2008 Zoloft 45 2
3333 543213 02FEB2009 Fluoxetine 45 3
I think it would be more appropriate to recode the encounter numbers into a new variable so that they reflect a style similar to the RX variable. Where each encounter is listed 1-n, and the number will repeat if multiple scripts are made in the same encounter. Such as below:
Patient_num enc_num ordering_date medication age_event RX Enc_
1111 888888 07NOV2008 Wellbutrin 48 1 1
1111 876578 11MAY2011 Bupropion 50 2 2
2222 999999 08DEC2009 Amitriptyline 32 1 1
2222 999999 08DEC2009 Escitalopram 32 2 1
3333 656463 12APR2007 Imipramine 44 1 1
3333 643211 21DEC2008 Zoloft 45 2 2
3333 543213 02FEB2009 Fluoxetine 45 3 3
From what I have seen, this could be possible with a variant of the above code using 2 BY groups (patient_num & enc_num), but I can't seem to get it. I think the first. / last. codes require sorting, but if I am to sort by enc_num, they won't be in chronological order because the encounter numbers are generated by the system and depend on all other encounters going in at that time.
I tried to do the following code (using ordering_date instead because its already sorted properly) but everything under Enc_ is printed as a 1. I'm sure my logic is all wrong. Any thoughts?
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
if first.patient_num;
if first.ordering_date then enc_ = 1;
else enc_ + 1;
run;

First
.First/.Last flags doesn't require sorting if data is properly ordered or you use NOTSORTED in your BY statement. If your variable in BY statement is not properly ordered then BY statment will throw error and stop executing when encounter deviations. Like this:
data class;
set sashelp.class;
by age;
first = first.age;
last = last.age;
run;
ERROR: BY variables are not properly sorted on data set SASHELP.CLASS.
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 first=. last=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
Try this code to see how exacly .first/.last flags works:
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
fp = first.patient_num;
lp = last.patient_num;
fo = first.ordering_date;
lo = last.ordering_date;
run;
Second
Those condidions works differently than you think:
if expression;
If expression is true then continue with next instructions after if.
Otherwise return to begining of data step (no implicit output). This also implies your observation is not retained in the output.
In most cases if without then is equivalent to where. However
whereworks faster but it is limited to variables that comes from data set you are reading
if can be used with any type of expression including calculated fields
More info:: IF
Statement, Subsetting
Third
I think lag() function can be your answear.
data pt_meds_test;
set pt_meds_;
by patient_num;
retain enc_;
prev_patient_num = lag(patient_num);
prev_ordering_date = lag(ordering_date);
if first.patient_num then enc_ = 1;
else if patient_num = prev_patient_num and ordering_date ne prev_ordering_date then enc_ + 1;
end;
run;
With lag() function you can look what was the value of vairalbe on the previos observation and compare it with current one later.
But be carefull. lag() doesn't look for variable value from previous observation. It takes vale of variable and stores it in a FIFO queue with size of 1. On next call it retrives stored value from queue and put new value there.
More info: LAG Function

I'm not sure if this hurts the rest of your analysis, but what about just
proc freq data=pt_meds noprint;
tables patient_num ordering_date / out=pt_meds_freq;
run;
data pt_meds_freq2;
set pt_meds_freq;
by patient_num ordering_date;
if first.patient_num;
run;

Hash Table + Binary Search

I'm using an Hash Table to store some values. Here are the details:
There will be roughly 1M items to store (not known before, so no perfect-hash possible).
Table is 10M large.
Hash function is MurMurHash3.
I did some tests and storing 1M values I get 350,000 collisions and 30 elements at the most-colliding hash table's slot.
Are these result good?
Would it make sense to implement Binary Search for lists that get created at colliding hash-table's slots?
What' your advice to improve performances?
EDIT: Here is my code
var
HashList: array [0..10000000 - 1] of Integer;
for I := 0 to High(HashList) do
HashList[I] := 0;
for I := 1 to 1000000 do
begin
Y := MurmurHash3(UIntToStr(I));
Y := Y mod Length(HashList);
Inc(HashList[Y]);
if HashList[Y] > 1 then
Inc(TotalCollisionsCount);
if HashList[Y] > MostCollidingSlotItemCount then
MostCollidingSlotItemCount := HashList[Y];
end;
Writeln('Total: ' + IntToStr(TotalCollisionsCount) + ' Max: ' + IntToStr(MostCollidingSlotItemCount));
Here is the result I get:
Total: 48169 Max: 5
Am I missing something?

This is what you get when you put 1M items randomly into 10M cells
calendar_size=10000000 nperson = 1000000
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 9048262 (0.904826) 0 (0.000000) 0 0 0
1: 905064 (0.090506) 905064 (0.905064) 1 905064 905064
2: 45136 (0.004514) 90272 (0.090272) 3 135408 1040472
3: 1488 (0.000149) 4464 (0.004464) 6 8928 1049400
4: 50 (0.000005) 200 (0.000200) 10 500 1049900
----+---------+--------+----------+--------+------+--------+--------
5: 10000000 1000000 1.049900 1049900
The left column is the number of items in a cell. The second: the number of cells having this itemcount.
WRT the binary search: it is obvious that for small tables like this (maximum chain length=4, but most chains are of length=1), linear search outperforms binary search. The takeover-point is probably somewhere between 10 and 100.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex