How to perform pandas drop_duplicates based on index column - datetime

I am banging my head against the wall when trying to perform a drop duplicate for time series, base on the value of a datetime index.
My function is the following:
def csv_import_merge_T(f):
dfsT = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True, names=['datetime','temp','rh'], header=0) for fp in files]
dfT = pd.concat(dfsT)
#print dfT.head(); print dfT.index; print dfT.dtypes
dfT.drop_duplicates(subset=index, inplace=True)
dfT.resample('H').bfill()
return dfT
which is called by:
inputcsvT = ['./input_csv/A08_KI_T*.csv']
for csvnameT in inputcsvT:
files = glob.glob(csvnameT)
print ('___'); print (files)
t = csv_import_merge_T(files)
print csvT
I receive the error
NameError: global name 'index' is not defined
what is wrong?
UPDATE:
The issue appear to arise when csv input files (which are to be concatenated) are overlapped.
inputcsvT = ['./input_csv/A08_KI_T*.csv'] gets files
A08_KI_T5
28/05/2015 17:00,22.973,24.021
...
08/10/2015 13:30,24.368,45.974
A08_KI_T6
08/10/2015 14:00,24.779,41.526
...
10/02/2016 17:00,22.326,41.83
and it runs correctly, whereas:
inputcsvT = ['./input_csv/A08_LR_T*.csv'] gathers
A08_LR_T5
28/05/2015 17:00,22.493,25.62
...
08/10/2015 13:30,24.296,44.596
A08_LR_T6
28/05/2015 17:00,22.493,25.62
...
10/02/2016 17:15,21.991,38.45
which leads to an error.

IIUC you can call reset_index and then drop_duplicates and then set_index again:
In [304]:
df = pd.DataFrame(data=np.random.randn(5,3), index=list('aabcd'))
df
Out[304]:
0 1 2
a 0.918546 -0.621496 -0.210479
a -1.154838 -2.282168 -0.060182
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
In [308]:
df.reset_index().drop_duplicates('index').set_index('index')
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
EDIT
Actually there is a simpler method is to call duplicated on the index and invert it:
In [309]:
df[~df.index.duplicated()]
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252

Related

How to correct the output generated through str_detect/str_contains in R

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip)
multi<- read.csv("multiple_responses.csv", header = T)
This file having values name of family planning methods in the column name like:
methods_discussed
emergency female_sterilization male_sterilization iud NaN injectables male_condoms -77 male_condoms female_sterilization male_sterilization injectables iud male_condoms
I have created a vector of all but not -77 and NAN of 8 family planning methods as:
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
I want to create new indicator variable based on the names of vector (method_names) in the existing data frame multi2, for this I used (I)
for (abc in method_names) {
multi2[abc]<- as.integer(str_detect(multi2$methods_discussed, fixed(abc)))
}
(II)
for (abc in method_names) {
multi2[abc]<- as.integer(str_contains(abc,multi2$methods_discussed))
}
(III) I also tried
for (abc in method_names) {
multi2[abc]<- as.integer(stri_detect_fixed(multi2$methods_discussed, abc))
}
but the output is not matching as expected. Probably male_sterilization is a substring of female_sterilization and it shows 1(TRUE) for male_sterilization for female_sterlization also. It is shown below in the Actual output at row 2. It must show 0 (FALSE) as female_sterilization is in the method_discussed column at row 2. I also don't want to generate any thing like 0/1 (False/True) (should be blank) corresponding to -77 and blank in method_discussed (All are highlighted in Expected output.
Actual Output
Expected Output
No error in code but only in the output.
You can add word boundaries to fix that issue.
multi<- read.csv("multiple_responses.csv", header = T)
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
for (abc in method_names) {
multi[abc]<- as.integer(grepl(paste0('\\b', abc, '\\b'), multi$methods_discussed))
}
multi[multi$methods_discussed %in% c('', -77), method_names] <- ''

Syntax error when using count in loop

I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge.
Below is my code:
/*define local list*/
local ward_names B C D E FN FS GS HE
/*loop for each dbase*/
foreach file of local ward_names {
use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear
count if _merge
local ward_count=r(N)
count if _merge==1
local count_master=r(N)
count if _merge==2
local count_using=r(N)
count if _merge==3
local count_match=r(N)
clear
set obs 1
g ward_count='ward_count'
g count_master=`count_master'
g count_using=`count_using'
g count_match=`count_match'
g ward= "`file'"
save "../temp/`file'_collapsed_diagnostics.dta", replace
clear
The code was running fine until I tried to add the total count for each ward file:
g ward_count='ward_count'
'ward_count' invalid name
Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro:
generate ward_count = `ward_count'
EDIT:
As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once:
tabulate _merge, matcell(A)
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
master only (1) | 1 16.67 16.67
matched (3) | 5 83.33 100.00
------------------------+-----------------------------------
Total | 6 100.00
matrix list A
A[2,1]
c1
r1 1
r2 5
So you could then do the following:
generate count_master = A[1,1]
generate count_match = A[2,1]

is it possible to get a new instance for namedtuple pushed into a dictionary before values are known?

It looks like things are going wrong on line 9 for me. Here I wish to push a new copy of the TagsTable into a dictionary. I'm aware that once a namedtuple field is recorded, it can not be changed. However, results baffle me as it looks like the values do change - when this code exits all entries of mp3_tags[ any of the three dictionary keys ].date are set to the last date of "1999_03_21"
So, two questions:
Is there a way to get a new TagsTable pushed into the dictionary ?
Why doesnt the code fail and not allow the second (and even third) date to be written to the TagsTable.date field (since it seems to be references to the same namedtuple) ? I thought you could not write a second value ?
from collections import namedtuple
2 TagsTable = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
3 mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
4 dates = ['1999_01_07', '1999_02_14', '1999_03_21']
5
6 mp3_tags = {}
7
8 for mp3file in mp3files:
9 mp3_tags[mp3file] = TagsTable
10
11 for mp3file,date_string in zip(mp3files,dates):
12 mp3_tags[mp3file].date = date_string
13
14 for mp3file in mp3files:
15 print( mp3_tags[mp3file].date )
looks like this is the fix I was looking for:
from collections import namedtuple
mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
dates = ['1999_01_07', '1999_02_14', '1999_03_21']
mp3_tags = {}
for mp3file in mp3files:
mp3_tags[mp3file] = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
for mp3file,date_string in zip(mp3files,dates):
mp3_tags[mp3file].date = date_string
for mp3file in mp3files:
print( mp3_tags[mp3file].date )

Removing duplicate records from .Xdf file

I would like to remove the duplicate records from my large .xdf file trans.xdf.
Here is the file details:
File name: /poc/revor/data/trans.xdf
Number of observations: 1000000000
Number of variables: 5
Number of blocks: 40
Compression type: zlib
Variable information:
Var 1: CARD_ID, Type: character
Var 2: SE_NO, Type: character
Var 3: r12m_cv, Type: numeric, Low/High: (-2348.7600, 40587.3900)
Var 4: r12m_roc, Type: numeric, Low/High: (0.0000, 231.0000)
Var 5: PROD_GRP_CD, Type: character
Also below is the sample data of the file:
CARD_ID SE_NO r12m_cv r12m_roc PROD_GRP_CD
900000999000000000 1045815024 110 1 1
900000999000000000 1052487253 247.52 2 1
900000999000000000 9999999999 38.72 1 1
900000999000000000 1090389768 1679.96 16 1
900000999000000000 1091226035 0 1 1
900000999000000000 1091241208 538.68 4 1
900000999000000000 9999999999 83 1 1
900000999000000000 1091468041 148.4 3 1
900000999000000000 1092640358 3.13 1 1
900000999000000000 1093468692 546.29 1 1
I have tried using rxDataStep function to use its transform parameter to call to unique() function over the .xdf file. Below is the code for the same:
uniq_dat <- function( dataList )
{
datalist <- unique(datalist)
return(datalist)
}
rxDataStepXdf(inFile = "/poc/revor/data/trans.xdf",outFile = "/poc/revor/data/trans.xdf",transformFunc = uniq_dat,overwrite = TRUE)
But was getting below error:
Error in unique(datalist) : object 'datalist' not found
Error in transformation function: Error in unique(datalist) : object 'datalist' not found
Error in rxCall("RxDataStep", params) :
So anybody could point out the mistake that I am doing here or if there is a better way to remove the duplicate records from the .Xdf file. I am avoiding loading the data into inmemory dataframe as the data is pretty huge.
I am running the above code in Revolution R Environment over HDFS.
If the same can be obtained by any other approach then the example for the same would be appreciated.
Thanks for the help in advance :)
Cheers,
Amit
you can remove the duplicate values providing removeDupKeys=TRUE parameter for rxSort() function. For example for your case:
XdfFilePath <- file.path("<your file's fully qualified path>/trans.xdf")
rxSort(inData = XdfFilePath,sortByVars=c("CARD_ID","SE_NO","r12m_cv","r12m_roc","PROD_GRP_CD"), removeDupKeys=TRUE)
if you want to remove duplicate records based on a specific key column, for example, based on SE_NO column
set the key value as sortByVars="SE_NO"

IndexError: list index out of range, scores.append( (fields[0], fields[1]))

I'm trying to read a file and put contents in a list. I have done this mnay times before and it has worked but this time it throws back the error "list index out of range".
the code is:
with open("File.txt") as f:
scores = []
for line in f:
fields = line.split()
scores.append( (fields[0], fields[1]))
print(scores)
The text file is in the format;
Alpha:[0, 1]
Bravo:[0, 0]
Charlie:[60, 8, 901]
Foxtrot:[0]
I cant see why it is giving me this problem. Is it because I have more than one value for each item? Or is it the fact that I have a colon in my text file?
How can I get around this problem?
Thanks
If I understand you well this code will print you desired result:
import re
with open("File.txt") as f:
# Let's make dictionary for scores {name:scores}.
scores = {}
# Define regular expressin to parse team name and team scores from line.
patternScore = '\[([^\]]+)\]'
patternName = '(.*):'
for line in f:
# Find value for team name and its scores.
fields = re.search(patternScore, line).groups()[0].split(', ')
name = re.search(patternName, line).groups()[0]
# Update dictionary with new value.
scores[name] = fields
# Print output first goes first element of keyValue in dict then goes keyName
for key in scores:
print (scores[key][0] + ':' + key)
You will recieve following output:
60:Charlie
0:Alpha
0:Bravo
0:Foxtrot

Resources