range index on mixed content node in exist db - xquery

My xml file is with the structure
<root>
<compound>abc<parts>a b c</parts></compound>
<compound>xyz<parts>x y z</parts></compound>
</root>
I have created a range index on
<range>
<create qname="compound" type="xs:string"/>
</range>
I expected the index terms are abca b c and xyzx y z but I found abc and xyz under index link in monitoring and profiling window. And also the search string
//compound[.="abca b c"] giving 0 results.
Can any one help in creating index on the whole contents of compound like on abca b c, xyz x y z so on..
Thanks
sony

In xquery, you have to use data() function in order to return all of the descendant or the sub-element values.
So, to test if the values of the compound element can be returned you can use the following:
//compound/data()[.="abca b c"]

nested="yes" attribute solved the problem.
I have changed the range index to
<range>
<create qname="compound" type="xs:string" nested="yes" />
</range>

Related

Issues with importing R Data due to formatting

I'm trying to import txt data into R; however, due to the txt file's unique formatting, I'm unsure of how to do this. I definitely feel that the issue is related to the fact that the txt file was formatted to line up columns with column names; however, as it's a text file, this was done with a variety of spaces. For example:
Gene Chromosomal Swiss-Prot MIM Description
name position AC Entry name code
______________ _______________ ______________________ ______ ______________________
A3GALT2 1p35.1 U3KPV4 A3LT2_HUMAN Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3 1p36.21 Q5VUY0 ADCL3_HUMAN Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4 1p36.21 Q5VUY2 ADCL4_HUMAN Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4 1p21-p22.1 P78363 ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM protein) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10 1q42 Q9NRK6 ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ATP-binding cassette transporter
Because of this, I have not been able to import my data whatsoever. Because it was made to be justified text with spaces, the number of spaces aren't uniform at all.
This is the link to the data sheet that I am using: https://www.uniprot.org/docs/humchr01.txt
Each field has a fixed width. Therefore, you can use the function read.fwf to read the file.
The following code reads the input file (assuming the file has only the rows, without the headers)
f = read.fwf('input.txt', c(14,16,11,12,7,250), strip.white=T)
colnames(f) = c('Gene name', 'Chromosomal position', 'Swiss-Prot AC',
'Swiss-Prot Entry name', 'MIM code', 'Description')

What is the "some" meaning in Collect result in Scala

"some" is not a special term which makes the googling seem to just ignore that search.
What I am asking is in my learning below:
b.collect:
Array[(Int, String)] = Array((3,dog), (6,salmon), (3,rat), (8,elephant))
d.collect:
Array[(Int, String)] = Array((3,dog), (3,cat), (6,salmon), (6,rabbit), (4,wolf), (7,penguin))
if I do some join and then collect the result, like b.join(d).collect, I will get the following:
Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (3,(dog,dog)), (3,(dog,cat)), (3,(rat,dog)), (3,(rat,cat)))
which seems understandable, however, if I do: b.leftOuterJoin(d).collect, I will get:
Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (8,(elephant,None)))
My question is why do I get results seems to be expressed differently, I mean why the second result contains "Some"? what's the difference between with "Some" and without "Some"? Can "Some" be removed? Does "Some" have any impact to any later operations as the content of RDD?
Thank you very much.
When you do the normal join as b.join(d).collect, you get Array[(Int, (String, String))]
This is because of only the same key with RDD b and RDD d so it is always guaranteed to have a value so it returns Array[(Int, (String, String))].
But when you use b.leftOuterJoin(d).collect the return type is Array[(Int, (String, Option[String]))] this is because to handle the null. In leftOuterJoin, there is no guarantee that all the keys of RDD b are available in RDD d, So it is returned as Option[String] which contains two values
Some(String) =>If the key is matched in both RDD
None If the key is present in b and not present in d
You can replace Some by getting the value from it and providing the value in case of None as below.
val z = b.leftOuterJoin(d).map(x => (x._1, (x._2._1, x._2._2.getOrElse("")))).collect
Now you should get Array[(Int, (String, String))] and output as
Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (3,(dog,dog)), (3,(dog,cat)), (3,(rat,dog)), (3,(rat,Some(cat)), (8,(elephant,)))
Where you can replace "" with any other string as you require.
Hope this helps.

How to perform pandas drop_duplicates based on index column

I am banging my head against the wall when trying to perform a drop duplicate for time series, base on the value of a datetime index.
My function is the following:
def csv_import_merge_T(f):
dfsT = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True, names=['datetime','temp','rh'], header=0) for fp in files]
dfT = pd.concat(dfsT)
#print dfT.head(); print dfT.index; print dfT.dtypes
dfT.drop_duplicates(subset=index, inplace=True)
dfT.resample('H').bfill()
return dfT
which is called by:
inputcsvT = ['./input_csv/A08_KI_T*.csv']
for csvnameT in inputcsvT:
files = glob.glob(csvnameT)
print ('___'); print (files)
t = csv_import_merge_T(files)
print csvT
I receive the error
NameError: global name 'index' is not defined
what is wrong?
UPDATE:
The issue appear to arise when csv input files (which are to be concatenated) are overlapped.
inputcsvT = ['./input_csv/A08_KI_T*.csv'] gets files
A08_KI_T5
28/05/2015 17:00,22.973,24.021
...
08/10/2015 13:30,24.368,45.974
A08_KI_T6
08/10/2015 14:00,24.779,41.526
...
10/02/2016 17:00,22.326,41.83
and it runs correctly, whereas:
inputcsvT = ['./input_csv/A08_LR_T*.csv'] gathers
A08_LR_T5
28/05/2015 17:00,22.493,25.62
...
08/10/2015 13:30,24.296,44.596
A08_LR_T6
28/05/2015 17:00,22.493,25.62
...
10/02/2016 17:15,21.991,38.45
which leads to an error.
IIUC you can call reset_index and then drop_duplicates and then set_index again:
In [304]:
df = pd.DataFrame(data=np.random.randn(5,3), index=list('aabcd'))
df
Out[304]:
0 1 2
a 0.918546 -0.621496 -0.210479
a -1.154838 -2.282168 -0.060182
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
In [308]:
df.reset_index().drop_duplicates('index').set_index('index')
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
EDIT
Actually there is a simpler method is to call duplicated on the index and invert it:
In [309]:
df[~df.index.duplicated()]
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252

How to store SparkR Dataframe in cassandra?

head(users)
1 jay chennai
2 kumar bangalore
3 vinoth Trichy
4 saswath perambalur
I want to store this output to cassandra table . I tried the below lines to store
users.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "sparkusers", "keyspace" -> "bigdata"))
.save()
throws error
unexpected symbol in test.write.format("org.apache.spark.sql.cassandra").options" please help me on this?
You are using the wrong syntax for R (That is the python/scala syntax)
read.df(sqlContext, source = "org.apache.spark.sql.cassandra", keyspace = "ks", table = "table")
See Spark R Dataframe Documentation

How to use the function "table:get" (table extension) when 2 keys are required?

I have a file .txt with 3 columns: ID-polygon-1, ID-polygon-2 and distance.
When I import my file into Netlogo, I obtain 3 lists [[list1][list2][list3]] which corresponds with the 3 columns.
I used table:from-list list to create a table with the content of 3 lists.
I obtain {{table: [[1 1] [67 518] [815 127]]}} (The table displays the first two lines of my dataset).
For example, I would like to get the value of distance (list3) between ID-polygon-1 = 1 (list1) and ID-polygon-2 = 67 (list1), that is, 815.
How can I use table:get table key when I have need of 2 keys (ID-polygon-1 and ID-polygon-2) ?
Thanks very much your help.
Using table:from-list will not help you there: it expects "a list of two element lists, or pairs" where the "the first element in the pair is the key and the second element is the value." That's not what you have in your original list.
Furthermore, NetLogo tables (and associative arrays in general) cannot have two keys. They are always just key-value pairs. Nothing prevents the value from being another table, however, and in your case, that is what you need: a table of tables!
There is no primitive to build that directly, however. You will need to build it yourself:
extensions [ table ]
globals [ t ]
to setup
let lists [
[ 1 1 ] ; ID-polygon-1 column
[ 67 518 ] ; ID-polygon-2 column
[ 815 127 ] ; distance column
]
set t table:make
foreach n-values length first lists [ ? ] [
let id1 item ? (item 0 lists)
let id2 item ? (item 1 lists)
let dist item ? (item 2 lists)
if not table:has-key? t id1 [
table:put t id1 table:make
]
table:put (table:get t id1) id2 dist
]
end
Here is what you get when you print the resulting table:
{{table: [[1 {{table: [[67 815] [518 127]]}}]]}}
And here is a small reporter to make it convenient to get a distance from the table:
to-report get-dist [ id1 id2 ]
report table:get (table:get t id1) id2
end
Using get-dist 1 67 will give the 815 result you were looking for.

Resources