Removing duplicate records from .Xdf file - r

I would like to remove the duplicate records from my large .xdf file trans.xdf.
Here is the file details:
File name: /poc/revor/data/trans.xdf
Number of observations: 1000000000
Number of variables: 5
Number of blocks: 40
Compression type: zlib
Variable information:
Var 1: CARD_ID, Type: character
Var 2: SE_NO, Type: character
Var 3: r12m_cv, Type: numeric, Low/High: (-2348.7600, 40587.3900)
Var 4: r12m_roc, Type: numeric, Low/High: (0.0000, 231.0000)
Var 5: PROD_GRP_CD, Type: character
Also below is the sample data of the file:
CARD_ID SE_NO r12m_cv r12m_roc PROD_GRP_CD
900000999000000000 1045815024 110 1 1
900000999000000000 1052487253 247.52 2 1
900000999000000000 9999999999 38.72 1 1
900000999000000000 1090389768 1679.96 16 1
900000999000000000 1091226035 0 1 1
900000999000000000 1091241208 538.68 4 1
900000999000000000 9999999999 83 1 1
900000999000000000 1091468041 148.4 3 1
900000999000000000 1092640358 3.13 1 1
900000999000000000 1093468692 546.29 1 1
I have tried using rxDataStep function to use its transform parameter to call to unique() function over the .xdf file. Below is the code for the same:
uniq_dat <- function( dataList )
{
datalist <- unique(datalist)
return(datalist)
}
rxDataStepXdf(inFile = "/poc/revor/data/trans.xdf",outFile = "/poc/revor/data/trans.xdf",transformFunc = uniq_dat,overwrite = TRUE)
But was getting below error:
Error in unique(datalist) : object 'datalist' not found
Error in transformation function: Error in unique(datalist) : object 'datalist' not found
Error in rxCall("RxDataStep", params) :
So anybody could point out the mistake that I am doing here or if there is a better way to remove the duplicate records from the .Xdf file. I am avoiding loading the data into inmemory dataframe as the data is pretty huge.
I am running the above code in Revolution R Environment over HDFS.
If the same can be obtained by any other approach then the example for the same would be appreciated.
Thanks for the help in advance :)
Cheers,
Amit

you can remove the duplicate values providing removeDupKeys=TRUE parameter for rxSort() function. For example for your case:
XdfFilePath <- file.path("<your file's fully qualified path>/trans.xdf")
rxSort(inData = XdfFilePath,sortByVars=c("CARD_ID","SE_NO","r12m_cv","r12m_roc","PROD_GRP_CD"), removeDupKeys=TRUE)
if you want to remove duplicate records based on a specific key column, for example, based on SE_NO column
set the key value as sortByVars="SE_NO"

Related

Pyomo constraint issue: not returning constrained result

I setup a constraint that does not constraint the solver in pyomo.
The constraint is the following:
def revenue_positive(model,t):
for t in model.T:
return (model.D[t] * model.P[t]) >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
while the model parameters are:
model = ConcreteModel()
model.T = Set(doc='quarter of year', initialize=df.index.tolist(), ordered=True)
model.P = Param(model.T, initialize=df['price'].to_dict(), within=Any, doc='Price for each quarter')
model.C = Var(model.T, domain=NonNegativeReals)
model.D = Var(model.T, domain=NonNegativeReals)
income = sum(df.loc[t, 'price'] * model.D[t] for t in model.T)
expenses = sum(df.loc[t, 'price'] * model.C[t] for t in model.T)
profit = income - expenses
model.objective = Objective(expr=profit, sense=maximize)
# Solve the model
solver = SolverFactory('cbc')
solver.solve(model)
df dataframe is:
df time_stamp price Status imbalance Difference Situation ... week month hour_of_day day_of_week day_of_year yearly_quarter
quarter ...
0 2021-01-01 00:00:00 64.84 Final 16 -3 Deficit ... 00 1 0 4 1 1
1 2021-01-01 00:15:00 13.96 Final 38 2 Surplus ... 00 1 0 4 1 1
2 2021-01-01 00:30:00 12.40 Final 46 1 Surplus ... 00 1 0 4 1 1
3 2021-01-01 00:45:00 7.70 Final 65 14 Surplus ... 00 1 0 4 1 1
4 2021-01-01 01:00:00 64.25 Final 3 -9 Deficit ... 00 1 1 4 1 1
The objective is to constraint the solver not to accept a negative revenue. As such it does not work as the solver passes 6 negative revenue values through. Looking at the indices with negative revenue, it appears the system chooses to sell at a negative price to buy later at a price even "more" negative, so from an optimization standpoint, it is ok. I would like to check the difference in results if we prohibit the solver to do that. Any input is welcome as after many searches on the web, still not the right way to write it correctly.
I did a pprint() of the constraint that returned:
positive_revenue : Size=35040, Index=T, Active=True
UPDATE following new constraint code:
def revenue_positive(model,t):
return model.D[t] * model.P[t] >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
Return the following error:
ERROR: Rule failed when generating expression for constraint positive_revenue
with index 283: ValueError: Invalid constraint expression. The constraint
expression resolved to a trivial Boolean (True) instead of a Pyomo object.
Please modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'
ERROR: Constructing component 'positive_revenue' from data=None failed:
ValueError: Invalid constraint expression. The constraint expression
resolved to a trivial Boolean (True) instead of a Pyomo object. Please
modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'
Traceback (most recent call last):
File "/home/olivier/Desktop/Elia - BESS/run_imbalance.py", line 25, in <module>
results_df = optimize_year(df)
File "/home/olivier/Desktop/Elia - BESS/battery_model_imbalance.py", line 122, in optimize_year
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/block.py", line 542, in __setattr__
self.add_component(name, val)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/block.py", line 1087, in add_component
val.construct(data)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/constraint.py", line 781, in construct
self._setitem_when_not_present(
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/indexed_component.py", line 778, in _setitem_when_not_present
obj.set_value(value)
File "/home/olivier/anaconda3/lib/python3.9/site-packages/pyomo/core/base/constraint.py", line 506, in set_value
raise ValueError(
ValueError: Invalid constraint expression. The constraint expression resolved to a trivial Boolean (True) instead of a Pyomo object. Please modify your rule to return Constraint.Feasible instead of True.
Error thrown for Constraint 'positive_revenue[283]'
So there are 2 issues w/ your constraint. It isn't clear if one is a cut & paste issue or not.
The function call to make the constraint appears to be indented and inside of your function after the return statement, making it unreachable code. Could be just the spacing in your post.
You are incorrectly adding a loop inside of your function. You are passing in the parameter t as a function argument and then you are blowing it away with the for loop, which only executes for the first value of t in T then hits the return statement. Remove the loop. When you use the rule= structure in pyomo it will call the rule for each instance of the set that you are using in the Constraint(xx, rule=) structure.
So I think you should have:
def revenue_positive(model, t):
return model.D[t] * model.P[t] >= 0
model.positive_revenue = Constraint(model.T, rule=revenue_positive)
Updated re: the error you added.
The error cites the 283rd index. My bet is that price[283] is zero, so you are multiplying by a zero and killing your variable.
You could add a check within the function that checks if the price is zero, and in that case, just return pyo.Constraint.Feasible, which is the trivial return that doesn't influence the model (or crash)

How to perform pandas drop_duplicates based on index column

I am banging my head against the wall when trying to perform a drop duplicate for time series, base on the value of a datetime index.
My function is the following:
def csv_import_merge_T(f):
dfsT = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True, names=['datetime','temp','rh'], header=0) for fp in files]
dfT = pd.concat(dfsT)
#print dfT.head(); print dfT.index; print dfT.dtypes
dfT.drop_duplicates(subset=index, inplace=True)
dfT.resample('H').bfill()
return dfT
which is called by:
inputcsvT = ['./input_csv/A08_KI_T*.csv']
for csvnameT in inputcsvT:
files = glob.glob(csvnameT)
print ('___'); print (files)
t = csv_import_merge_T(files)
print csvT
I receive the error
NameError: global name 'index' is not defined
what is wrong?
UPDATE:
The issue appear to arise when csv input files (which are to be concatenated) are overlapped.
inputcsvT = ['./input_csv/A08_KI_T*.csv'] gets files
A08_KI_T5
28/05/2015 17:00,22.973,24.021
...
08/10/2015 13:30,24.368,45.974
A08_KI_T6
08/10/2015 14:00,24.779,41.526
...
10/02/2016 17:00,22.326,41.83
and it runs correctly, whereas:
inputcsvT = ['./input_csv/A08_LR_T*.csv'] gathers
A08_LR_T5
28/05/2015 17:00,22.493,25.62
...
08/10/2015 13:30,24.296,44.596
A08_LR_T6
28/05/2015 17:00,22.493,25.62
...
10/02/2016 17:15,21.991,38.45
which leads to an error.
IIUC you can call reset_index and then drop_duplicates and then set_index again:
In [304]:
df = pd.DataFrame(data=np.random.randn(5,3), index=list('aabcd'))
df
Out[304]:
0 1 2
a 0.918546 -0.621496 -0.210479
a -1.154838 -2.282168 -0.060182
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
In [308]:
df.reset_index().drop_duplicates('index').set_index('index')
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
EDIT
Actually there is a simpler method is to call duplicated on the index and invert it:
In [309]:
df[~df.index.duplicated()]
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252

Error in lis[[i]] : attempt to select less than one element

This code is meant to compute the total distance of some given coordinates, but I don't know why it's not working.
The error is: Error in lis[[i]] : attempt to select less than one element.
Here is the code:
distant<-function(a,b)
{
return(sqrt((a[1]-b[1])^2+(a[2]-b[2])^2))
}
totdistance<-function(lis)
{
totdis=0
for(i in 1:length(lis)-1)
{
totdis=totdis+distant(lis[[i]],lis[[i+1]])
}
totdis=totdis+distant(lis[[1]],lis[[length(lis)]])
return(totdis)
}
liss1<-list()
liss1[[1]]<-c(12,12)
liss1[[2]]<-c(18,23)
liss1[[4]]<-c(29,25)
liss1[[5]]<-c(31,52)
liss1[[3]]<-c(24,21)
liss1[[6]]<-c(36,43)
liss1[[7]]<-c(37,14)
liss1[[8]]<-c(42,8)
liss1[[9]]<-c(51,47)
liss1[[10]]<-c(62,53)
liss1[[11]]<-c(63,19)
liss1[[12]]<-c(69,39)
liss1[[13]]<-c(81,7)
liss1[[14]]<-c(82,18)
liss1[[15]]<-c(83,40)
liss1[[16]]<-c(88,30)
Output:
> totdistance(liss1)
Error in lis[[i]] : attempt to select less than one element
> distant(liss1[[2]],liss1[[3]])
[1] 6.324555
Let me reproduce your error in a simple way
>list1 = list()
> list1[[0]]=list(a=c("a"))
>Error in list1[[0]] = list(a = c("a")) :
attempt to select less than one element
So, the next question is where are you accessing 0 index list ?
(Indexing of lists starts with 1 in R )
As Molx, indicated in previous posts : "The : operator is evaluated before the subtraction - " . This is causing 0 indexed list access.
For ex:
> 1:10-1
[1] 0 1 2 3 4 5 6 7 8 9
>1:(10-1)
[1] 1 2 3 4 5 6 7 8 9
So replace the following lines of your code
>for(i in 1:(length(lis)-1))
{
totdis=totdis+distant(lis[[i]],lis[[i+1]])
}

R: field '$addToSet' in '$addToSet' is not valid for storage: rmongodb

I am trying to write a generic upsert for a tick Database in R.
The python code would be:
collection.update({'symbol':'somesymbol', 'sha':'SoM3__W3|Re|7__Sh#'},
{'$set':{segment:5},
'$addToSet': {'parent':parent_id}}},
upsert=True)
In R I am using rmongodb and trying to build the BSON Objects
#get the query
mtch_b<-mongo.bson.buffer.create()
mongo.bson.buffer.append(mtch_b, "symbol", "somesymbol")
mongo.bson.buffer.append(mtch_b, "sha", "SoM3__W3|Re|7__Sh#")
mtch<-mongo.bson.from.buffer(mtch_b)
#set the segment
qry_b<-mongo.bson.buffer.create()
mongo.bson.buffer.start.object(qry_b, "$set")
mongo.bson.buffer.append(qry_b, "segment", 5)
mongo.bson.buffer.start.object(qry_b, "$addToSet")
mongo.bson.buffer.append(qry_b, "parent", "Initial")
mongo.bson.buffer.finish.object(qry_b) #end of $addtoSet object
mongo.bson.buffer.finish.object(qry_b) #end of $set object
qry_bsn <-mongo.bson.from.buffer(qry_b)
mongo.update(mongo, "M__test.tmp", mtch, qry_bsn, flags=mongo.update.upsert)
When I run this I get an error:
"The dollar ($) prefixed field '$addToSet' in '$addToSet' is not valid for storage."
looking at the qry_bsn:
qry_bsn
$set : 3
segment : 4
0 : 1 1.000000
1 : 1 2.000000
2 : 1 3.000000
3 : 1 4.000000
$addToSet : 3
parent : 2 Initial
When I remove the $addToSet, append and finish objects of the $addToSet object the query runs fine.
Any help on how to do this would be much appreciated.
I can't find reason for not using mongo.bson.from.list. It make all mongo.bson.buffer.* calls for you. And it is much less chance to produce a bug with bson construction.
query <- mongo.bson.from.list(list("symbol" = "somesymbol", "sha" = "SoM3__W3|Re|7__Sh#"))
upd_obj <- mongo.bson.from.list(list('$set' = list('segment' = 1:4), '$addToSet' = list('parent' = 'PARENT_ID')))
mongo.update(mongo = mongo, ns = "M__test.tmp", criteria = query, objNew = upd_obj, flags=mongo.update.upsert)

How can I apply fisher test on this set of data (nominal variables)

I'm pretty new in statistics:
fisher = function(idxToTest, idxATI){
idxDependent=c()
dependent=c()
p = c()
for(i in c(1:length(idxToTest)))
{
tbl = table(data[[idxToTest[i]]], data[[idxATI]])
rez = fisher.test(tbl, workspace = 20000000000)
if(rez$p.value<0.1){
dependent=c(dependent, TRUE)
if(rez$p.value<0.1){
idxDependent = c(idxDependent, idxToTest[i])
}
}
else{
dependent = c(dependent, FALSE)
}
p = c(p, rez$p.value)
}
}
This is the function I use. It seems to work.
What I understood until now is that I have to pass as first parameter data like:
Men Women
Dieting 10 30
Non-dieting 5 60
My data comes from a CSV:
data = read.csv('***.csv', header = TRUE, sep=',');
My first problem is that I don't know how to converse from:
Loan.Purpose Home.Ownership
lp_value_1 ho_value_2
lp_value_1 ho_value_2
lp_value_2 ho_value_1
lp_value_3 ho_value_2
lp_value_2 ho_value_3
lp_value_4 ho_value_2
lp_value_3 ho_value_3
to:
ho_value_1 ho_value_2 ho_value_3
lp_value1 0 2 0
lp_value2 1 0 1
lp_value3 0 1 1
lp_value4 0 1 0
The second issue is that I don't know what the second parameter should be
POST UPDATE: This is what I get using fisher.test(myTable):
Error in fisher.test(test) : FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
where myTable is:
MORTGAGE NONE OTHER OWN RENT
car 18 0 0 5 27
credit_card 190 0 2 38 214
debt_consolidation 620 0 2 87 598
educational 5 0 0 3 7
...
Basically, fisher tests only work on smallish data sets because they require alot of memory. But all is good because chi-square tests make minimal additional assumptions and are easier on the computer. Just do:
chisq.test(Loan.Purpose,Home.Ownership)
to get your p-values.
Make sure you read through and understand the help page for chisq.test, especially the examples at the bottom.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
Then look at a mosaicplot to see the quantities like:
mosaicplot(Loan.Purpose,Home.Ownership)
this reference explains how mosaicplots work.
http://alumni.media.mit.edu/~tpminka/courses/36-350.2001/lectures/day12/

Resources