Best way to import words from a text file into a data frame in R - r

I have a text file filled with words separated by spaces as seen below:
ACNES ACOCK ACOLD ACORN ACRED ACRES ACRID ACTED ACTIN ACTON ACTOR ACUTE ACYLS ADAGE ADAPT ADAWS ADAYS ADDAX ADDED ADDER ADDIO ADDLE ADEEM ADEPT ADHAN ADIEU ADIOS ADITS ADMAN ADMEN ADMIN ADMIT ADMIX ADOBE ADOBO ADOPT ADORE ADORN ADOWN ADOZE ADRAD ADRED ADSUM ADUKI ADULT ADUNC ADUST ADVEW ADYTA ADZED ADZES AECIA AEDES AEGIS AEONS AERIE AEROS AESIR AFALD AFARA AFARS AFEAR AFFIX AFIRE AFLAJ AFOOT AFORE AFOUL AFRIT AFROS AFTER AGAIN AGAMA AGAMI AGAPE AGARS AGAST AGATE AGAVE AGAZE AGENE AGENT AGERS AGGER AGGIE AGGRI AGGRO AGGRY AGHAS AGILA AGILE AGING AGIOS AGISM AGIST AGITA AGLEE AGLET AGLEY AGLOO AGLOW AGLUS AGMAS AGOGE AGONE AGONS AGONY AGOOD AGORA AGREE AGRIA AGRIN
What's the best way to import all these words into a 1 column data frame?

Related

Crop-row detection of already known plant positions - How to?

Through image recognition and segmentation I have already obtained an abstract representation of plants on a field (i. e. I exactly know all the coordinates of all plants in an image).
Now I want to detect the crop-rows in this abstract representation - and can't quite figure out how.
My problems are:
the rows in the images may be a bit rotated and not exactly in north/south orientation (angles may vary between -10° to +10°)
The number of crop-rows per image can vary per image and is not fixed - also it is unknown before processing has taken place
The rotation of the crop-rows may be slightly different in each processed image
I have hundreds of images / representations to be processed (so doing it by hand is obviously not really feasible :-) ), so I would need an algorithm that I can later e.g. put into a loop
Can you perhaps help me with at least a strategy (or code snippets) to do such a crop row detection? Idealy in the end for each crop row I would have the parameters of a linear equation (y=m*x+t), so that abline() can be used, but I am open for anything. In the end it could look like something like this (here purely for illustration purposes done by hand):
Underlying data of the images is here:
structure(c(5278.072, 2632.564, 393.34, 4057.704, 3805.599, 611.269,
1823.835, 3359.069, 3598.284, 5262.873, 2069.963, 1579.745, 4539.584,
3579.977, 4296.46, 1831.153, 2333.835, 1126.639, 152.948, 4030.205,
3368.738, 2066.733, 855.111, 2579.665, 3092.37, 1318.357, 1109.438,
3578.606, 375.756, 3796.788, 4520.064, 1807.36, 5001.773, 87.272,
4033.594, 836.708, 639.13, 3105.628, 1569.256, 2831.851, 826.444,
3557.598, 1078.643, 576.266, 4789.585, 3091.929, 5239.658, 1099.954,
1807.972, 2534.677, 4271.841, 5019.276, 2053.246, 1536.071, 3347.644,
4019.766, 3793.392, 5257.628, 604.323, 2561.307, 1792.665, 884.25,
109.456, 3066.108, 3750.833, 4511.819, 2815.08, 119.468, 4499.801,
2582.512, 2822.354, 3773.842, 1054.719, 4251.171, 4002.476, 2018.277,
1775.284, 4959.269, 2541.009, 4742.312, 2265.149, 3071.313, 1779.218,
3972.64, 2822.409, 5217.848, 1265.449, 1522.899, 3057.732, 5364.729,
346.341, 4226.012, 3287.299, 1767.18, 3991.963, 1811.498, 2785.251,
4488.214, 822.509, 2016.435, 3022.344, 2528.079, 4470.315, 3017.716,
572.771, 97.748, 5168.119, 4199.643, 2006.285, 3946.505, 2771.626,
3495.94, 1745.531, 3734.241, 3265.819, 4963.116, 1058.788, 300.408,
1252.845, 4453, 5411.107, 2768.93, 557.806, 2004.424, 2218.582,
4214.073, 4698.292, 5149.238, 4953.886, 1238.343, 3502.518, 2753.044,
5417.502, 1031.945, 2518.901, 1483.487, 4450.737, 2258.484, 289.261,
2987.945, 5156.371, 4171.407, 1995.901, 781.96, 3918.94, 1974.667,
316.758, 1470.993, 5160.868, 3237.828, 521.251, 787.228, 1039.416,
1202.261, 3456.837, 4148.167, 2200.492, 2720.912, 4915.451, 3902.744,
4435.419, 1209.418, 1471.057, 4641.269, 3913.51, 5412.672, 1953.878,
2220.277, 4911.249, 1006.368, 2974.173, 4410.827, 1688.391, 293.729,
1462.871, 4618.785, 5150.904, 2689.061, 1952.56, 5389.383, 2176.387,
995.073, 4125.245, 498.978, 5137.266, 5358.118, 1444.34, 1674.431,
2689.288, 2465.351, 4566.352, 765.125, 1196.984, 1687.859, 258.247,
1914.911, 4575.408, 3421.147, 495.879, 979.079, 1922.943, 4097.704,
737.439, 3410.562, 234.74, 2159.697, 471.983, 1418.991, 2440.575,
1942.708, 1162.525, 5312.409, 2162.656, 5059.814, 1411.412, 4558.905,
247.618, 4319.106, 3411.827, 1786.69, 1670.462, 1180.524, 1640.636,
4715.993, 3576.548, 3566.57, 3589.872, 3565.564, 3531.571, 3415.178,
3511.07, 3510.051, 3487.762, 3470.791, 3443.062, 3369.329, 3386.999,
3387.786, 3277.473, 3376.266, 3421.932, 3387.869, 3367.994, 3346.403,
3259.785, 3296.081, 3297.633, 3285.163, 3300.119, 2941.504, 3264.344,
3277.9, 3235.499, 3198.869, 3235.508, 3156.907, 3221.313, 3123.96,
3165.979, 3186.806, 3148.158, 3129.906, 3035.963, 2987.899, 3053.684,
3050.107, 3052.643, 3037.767, 3037.525, 2994.456, 3006.454, 2960.606,
2973.443, 2919.843, 2917.246, 2939.87, 2914.804, 2886.588, 2920.769,
2906.616, 2908.866, 2868.052, 2885.769, 2860.088, 2801.168, 2853.439,
2853.863, 2847.141, 2805.677, 2806.183, 2718.094, 2661.652, 2695.19,
2656.518, 2612.372, 2603.286, 2602.449, 2591.63, 2595.714, 2593.287,
2575.333, 2572.15, 2476.559, 2435.917, 2538.626, 2514.215, 2458.875,
2477.5, 2385.366, 2421.47, 2220.899, 2397.842, 2396.848, 2393.501,
2352.039, 2292.429, 2315.84, 2328.682, 2256.508, 2236.925, 2192.809,
2241.279, 2144.107, 2195.016, 2185.86, 2112.28, 2098.085, 2020.843,
1971.232, 1979.691, 1968.859, 1943.755, 1974.743, 1891.801, 1944.186,
1951.423, 1872.022, 1928.441, 1880.504, 1912.82, 1893.822, 1878.889,
1850.38, 1834.762, 1851.886, 1806.117, 1776.713, 1682.26, 1733.805,
1714.941, 1700.778, 1686.258, 1703.367, 1549.601, 1682.525, 1563.277,
1632.103, 1609.4, 1621.888, 1587.126, 1545.346, 1537.933, 1542.424,
1366.974, 1494.822, 1498.618, 1494.055, 1450.098, 1407.89, 1345.613,
1388.68, 1380.527, 1368.772, 1372.391, 1161.35, 1297.577, 1312.849,
1304.972, 1286.721, 1292.485, 1257.53, 1241.146, 1263.164, 1217.146,
1226.615, 993.046, 1166.837, 1112.254, 1072.249, 1117.723, 1061.758,
1098.207, 1084.597, 1059.916, 1059.685, 1063.814, 1054.735, 944.2,
982.653, 963.989, 969.55, 941.066, 907.014, 930.988, 776.849,
877.918, 889.259, 805.872, 831.361, 803.752, 786.654, 791.649,
814.271, 794.444, 776.833, 694.969, 664.718, 653.238, 661.703,
652.696, 655.997, 637.118, 539.101, 555.694, 491.482, 459.712,
453.73, 490.567, 391.441, 409.506, 319.697, 391.505, 390.46,
308.658, 310.59, 285.799, 268.86, 245.89, 195.933, 243.418, 214.203,
172.129, 173.754, 191.456, 194.795, 98.098, 99.4479999999999,
62.1419999999998), .Dim = c(224L, 2L))
Here is something that may help:
For each detected plant point, find the closest neighboring plant. Hopefully this finds a plant in the same crop row more often than not. If it's known a priori that images are roughly in north/south orientation, we should prefer looking more in the vertical direction to choose neighboring plants. One way to do that is to redefine "distance" for the nearest neighbor search as something anisotropic like
distance = 10 * (x0 - x1)² + (y0 - y1)²
Here is a plot of what this produces, making a line segment between each plant and its nearest neighbor:
It's not perfect, but could be a useful start. Most crop rows are lucky enough that a run of 4 or more plants are correctly chained together.
A thought on a possible strategy from here:
Identify the connected components, the "chains" of plants.
For each chain, regress a best fit line by least squares. Or better yet, use the RANSAC algorithm so that the fit robustly ignores a single stray plant in an otherwise colinear chain.
Again using the rough north/south orientation, consider the best fit line "valid" only if it's close enough to vertical. Supposing it is valid, find all plants that are close to the best fit line. If many plants are close, then the best fit line is likely a crop row.

Issues with importing R Data due to formatting

I'm trying to import txt data into R; however, due to the txt file's unique formatting, I'm unsure of how to do this. I definitely feel that the issue is related to the fact that the txt file was formatted to line up columns with column names; however, as it's a text file, this was done with a variety of spaces. For example:
Gene Chromosomal Swiss-Prot MIM Description
name position AC Entry name code
______________ _______________ ______________________ ______ ______________________
A3GALT2 1p35.1 U3KPV4 A3LT2_HUMAN Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3 1p36.21 Q5VUY0 ADCL3_HUMAN Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4 1p36.21 Q5VUY2 ADCL4_HUMAN Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4 1p21-p22.1 P78363 ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM protein) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10 1q42 Q9NRK6 ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ATP-binding cassette transporter
Because of this, I have not been able to import my data whatsoever. Because it was made to be justified text with spaces, the number of spaces aren't uniform at all.
This is the link to the data sheet that I am using: https://www.uniprot.org/docs/humchr01.txt
Each field has a fixed width. Therefore, you can use the function read.fwf to read the file.
The following code reads the input file (assuming the file has only the rows, without the headers)
f = read.fwf('input.txt', c(14,16,11,12,7,250), strip.white=T)
colnames(f) = c('Gene name', 'Chromosomal position', 'Swiss-Prot AC',
'Swiss-Prot Entry name', 'MIM code', 'Description')

How can I select specific information from a XML file? in R or other platforms

Hi I've just downloaded a XML file refering to the 5.8S region in aedes aegyptii from NCBI - nucleotide. As an example I paste the info I get for the first sample in the text.
From here I wish to extract
1. <INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
2. <INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
3. <INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
4. <INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA </INSDReference_journal>
Also, as I said this is a short version of all the info I really downloadead (13 samples) https://www.ncbi.nlm.nih.gov/nuccore/?term=aedes+aegypti+5.8, is there a posibility to extract the info I wanted for all the samples?
I`m familiar with R but, which platform suites better to do this?
<INSDSeq_locus>CH477247</INSDSeq_locus>
<INSDSeq_length>3065330</INSDSeq_length>
<INSDSeq_strandedness>double</INSDSeq_strandedness>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>CON</INSDSeq_division>
<INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
<INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
<INSDSeq_definition>Aedes aegypti strain Liverpool supercont1.62 genomic scaffold, whole genome shotgun sequence</INSDSeq_definition>
<INSDSeq_primary-accession>CH477247</INSDSeq_primary-accession>
<INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>gnl|WGS:AAGE|supercont1.62</INSDSeqid>
<INSDSeqid>gb|CH477247.1|</INSDSeqid>
<INSDSeqid>gi|78216626</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_project>PRJNA12434</INSDSeq_project>
<INSDSeq_keywords>
<INSDKeyword>WGS</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_source>Aedes aegypti (yellow fever mosquito)</INSDSeq_source>
<INSDSeq_organism>Aedes aegypti</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Holometabola; Diptera; Nematocera; Culicoidea; Culicidae; Culicinae; Aedini; Aedes; Stegomyia</INSDSeq_taxonomy>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>1</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Nene,V.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>Lawson,D.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Tu,Z.J.</INSDAuthor>
<INSDAuthor>Loftus,B.</INSDAuthor>
<INSDAuthor>Xi,Z.</INSDAuthor>
<INSDAuthor>Megy,K.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Ren,Q.</INSDAuthor>
<INSDAuthor>Zdobnov,E.M.</INSDAuthor>
<INSDAuthor>Lobo,N.F.</INSDAuthor>
<INSDAuthor>Campbell,K.S.</INSDAuthor>
<INSDAuthor>Brown,S.E.</INSDAuthor>
<INSDAuthor>Bonaldo,M.F.</INSDAuthor>
<INSDAuthor>Zhu,J.</INSDAuthor>
<INSDAuthor>Sinkins,S.P.</INSDAuthor>
<INSDAuthor>Hogenkamp,D.G.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Arensburger,P.</INSDAuthor>
<INSDAuthor>Atkinson,P.W.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Biedler,J.</INSDAuthor>
<INSDAuthor>Birney,E.</INSDAuthor>
<INSDAuthor>Bruggner,R.V.</INSDAuthor>
<INSDAuthor>Costas,J.</INSDAuthor>
<INSDAuthor>Coy,M.R.</INSDAuthor>
<INSDAuthor>Crabtree,J.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Debruyn,B.</INSDAuthor>
<INSDAuthor>Decaprio,D.</INSDAuthor>
<INSDAuthor>Eiglmeier,K.</INSDAuthor>
<INSDAuthor>Eisenstadt,E.</INSDAuthor>
<INSDAuthor>El-Dorry,H.</INSDAuthor>
<INSDAuthor>Gelbart,W.M.</INSDAuthor>
<INSDAuthor>Gomes,S.L.</INSDAuthor>
<INSDAuthor>Hammond,M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Hogan,J.R.</INSDAuthor>
<INSDAuthor>Holmes,M.H.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Johnston,J.S.</INSDAuthor>
<INSDAuthor>Kennedy,R.C.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Kriventseva,E.V.</INSDAuthor>
<INSDAuthor>Kulp,D.</INSDAuthor>
<INSDAuthor>Labutti,K.</INSDAuthor>
<INSDAuthor>Lee,E.</INSDAuthor>
<INSDAuthor>Li,S.</INSDAuthor>
<INSDAuthor>Lovin,D.D.</INSDAuthor>
<INSDAuthor>Mao,C.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Menck,C.F.</INSDAuthor>
<INSDAuthor>Miller,J.R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Mori,A.</INSDAuthor>
<INSDAuthor>Nascimento,A.L.</INSDAuthor>
<INSDAuthor>Naveira,H.F.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>O&apos;leary,S.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Pertea,M.</INSDAuthor>
<INSDAuthor>Quesneville,H.</INSDAuthor>
<INSDAuthor>Reidenbach,K.R.</INSDAuthor>
<INSDAuthor>Rogers,Y.H.</INSDAuthor>
<INSDAuthor>Roth,C.W.</INSDAuthor>
<INSDAuthor>Schneider,J.R.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Stanke,M.</INSDAuthor>
<INSDAuthor>Stinson,E.O.</INSDAuthor>
<INSDAuthor>Tubio,J.M.</INSDAuthor>
<INSDAuthor>Vanzee,J.P.</INSDAuthor>
<INSDAuthor>Verjovski-Almeida,S.</INSDAuthor>
<INSDAuthor>Werner,D.</INSDAuthor>
<INSDAuthor>White,O.</INSDAuthor>
<INSDAuthor>Wyder,S.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Zhao,Q.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Hill,C.A.</INSDAuthor>
<INSDAuthor>Raikhel,A.S.</INSDAuthor>
<INSDAuthor>Soares,M.B.</INSDAuthor>
<INSDAuthor>Knudson,D.L.</INSDAuthor>
<INSDAuthor>Lee,N.H.</INSDAuthor>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Salzberg,S.L.</INSDAuthor>
<INSDAuthor>Paulsen,I.T.</INSDAuthor>
<INSDAuthor>Dimopoulos,G.</INSDAuthor>
<INSDAuthor>Collins,F.H.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
<INSDAuthor>Fraser-Liggett,C.M.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Genome sequence of Aedes aegypti, a major arbovirus vector</INSDReference_title>
<INSDReference_journal>Science 316 (5832), 1718-1723 (2007)</INSDReference_journal>
<INSDReference_xref>
<INSDXref>
<INSDXref_dbname>doi</INSDXref_dbname>
<INSDXref_id>10.1126/science.1138878</INSDXref_id>
</INSDXref>
</INSDReference_xref>
<INSDReference_pubmed>17510324</INSDReference_pubmed>
</INSDReference>
<INSDReference>
<INSDReference_reference>2</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Devon,K.</INSDAuthor>
<INSDAuthor>Henn,M.R.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
<INSDAuthor>Collins,F.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Rounsley,S.</INSDAuthor>
<INSDAuthor>DeCaprio,D.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Lander,E.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Butler,J.</INSDAuthor>
<INSDAuthor>Alvarez,P.</INSDAuthor>
<INSDAuthor>Gnerre,S.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Kleber,M.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Brockman,W.</INSDAuthor>
<INSDAuthor>Young,S.</INSDAuthor>
<INSDAuthor>LaButti,K.</INSDAuthor>
<INSDAuthor>Pushparaj,V.</INSDAuthor>
<INSDAuthor>Koehrsen,M.</INSDAuthor>
<INSDAuthor>Engels,R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Pearson,M.</INSDAuthor>
<INSDAuthor>Howarth,C.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Yandava,C.</INSDAuthor>
<INSDAuthor>Oleary,S.</INSDAuthor>
<INSDAuthor>Alvarado,L.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
</INSDReference_authors>
<INSDReference_consortium>The Broad Institute Genome Sequencing Platform</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>3</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Loftus,B.J.</INSDAuthor>
<INSDAuthor>Nene,V.M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>White,O.R.</INSDAuthor>
<INSDAuthor>Salzberg,S.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Holmes,M.</INSDAuthor>
<INSDAuthor>Miller,J.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Pop,M.</INSDAuthor>
<INSDAuthor>Pai,G.</INSDAuthor>
<INSDAuthor>Utterback,T.</INSDAuthor>
<INSDAuthor>Rogers,Y.-H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Fraser,C.M.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>4</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_consortium>VectorBase</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (05-SEP-2012) VectorBase / Ensembl, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK</INSDReference_journal>
<INSDReference_remark>Annotation update by submitter</INSDReference_remark>
</INSDReference>
</INSDSeq_references>
<INSDSeq_comment>The sequence for this assembly was produced jointly by The Broad Institute of Harvard/MIT and The Institute for Genomic Research. The assembly represents 7.6X sequence coverage of the genome and the total length of the contigs is 1.31 Gb. Additional information about the Aedes aegypti sequencing project and assembly can be found at http://www.broad.mit.edu/annotation/disease_vector/aedes_aegypti/ and http://www.tigr.org/msc/aedes/aedes.shtml. Long-term curation of the sequence and subsequent annotation updates will be the responsibility of VectorBase at http://www.vectorbase.org.~Annotation was updated by VectorBase in Sept 2012.</INSDSeq_comment>
<INSDSeq_feature-table>
<INSDFeature>
<INSDFeature_key>source</INSDFeature_key>
<INSDFeature_location>1..3065330</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>1</INSDInterval_from>
<INSDInterval_to>3065330</INSDInterval_to>
<INSDInterval_accession>CH477247.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>Aedes aegypti</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>strain</INSDQualifier_name>
<INSDQualifier_value>Liverpool</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:7159</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>chromosome</INSDQualifier_name>
<INSDQualifier_value>2</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
</INSDSeq_feature-table>
<INSDSeq_contig>join(AAGE02003964.1:1..7226,gap(unk100),AAGE02003965.1:1..6376,gap(unk100),AAGE02003966.1:1..16236,gap(4301),AAGE02003967.1:1..174188,gap(unk100),AAGE02003968.1:1..24199,gap(1396),AAGE02003969.1:1..104064,gap(29770),AAGE02003970.1:1..12303,gap(56956),AAGE02003971.1:1..2368,gap(12542),AAGE02003972.1:1..29888,gap(1379),AAGE02003973.1:1..98175,gap(unk100),AAGE02003974.1:1..13180,gap(unk100),AAGE02003975.1:1..2872,gap(unk100),AAGE02003976.1:1..18626,gap(unk100),AAGE02003977.1:1..52378,gap(151),AAGE02003978.1:1..153108,gap(901),AAGE02003979.1:1..3583,gap(unk100),AAGE02003980.1:1..32852,gap(unk100),AAGE02003981.1:1..68239,gap(unk100),AAGE02003982.1:1..61056,gap(unk100),AAGE02003983.1:1..21852,gap(unk100),AAGE02003984.1:1..49659,gap(unk100),AAGE02003985.1:1..33070,gap(315),AAGE02003986.1:1..411266,gap(unk100),AAGE02003987.1:1..2985,gap(unk100),AAGE02003988.1:1..38365,gap(159),AAGE02003989.1:1..110697,gap(890),AAGE02003990.1:1..22405,gap(2299),AAGE02003991.1:1..7510,gap(187),AAGE02003992.1:1..447937,gap(263),AAGE02003993.1:1..92770,gap(1409),AAGE02003994.1:1..2258,gap(132),AAGE02003995.1:1..5605,gap(unk100),AAGE02003996.1:1..3451,gap(2717),AAGE02003997.1:1..20215,gap(unk100),AAGE02003998.1:1..35683,gap(514),AAGE02003999.1:1..307288,gap(unk100),AAGE02004000.1:1..71359,gap(433),AAGE02004001.1:1..10550,gap(unk100),AAGE02004002.1:1..289125,gap(unk100),AAGE02004003.1:1..45622,gap(unk100),AAGE02004004.1:1..35927)</INSDSeq_contig>
<INSDSeq_xrefs>
<INSDXref>
<INSDXref_dbname>BioProject</INSDXref_dbname>
<INSDXref_id>PRJNA12434</INSDXref_id>
</INSDXref>
<INSDXref>
<INSDXref_dbname>BioSample</INSDXref_dbname>
<INSDXref_id>SAMN02953616</INSDXref_id>
</INSDXref>
</INSDSeq_xrefs>
`
Use an xpath or a CSS selector.
Depending on the language and libraries you use.

Fuzzy match of strings using python

I have a record set as below.
"product_id"|"prod_descr"|"status"|"last_upd_time"
"102317"|"TELMINORM CH 40/12.5MG TAB 10'S"|"A"|"2016-08-31 15:02:06.609879"
"99996"|"BECOSTAR TAB 15'S"|"A"|"2016-09-05 18:20:25"
"99997"|"SUPRADYN TABLET15S"|"A"|"2016-09-06 09:05:24"
"120138"|"LASILACTONE 50MG TABLET 10'S"|"A"|"2016-09-07 12:01:05"
"101921"|"TELMA 20MG TABLET 15S"|"A"|"2016-08-31 15:02:06.609879"
"1220"|"ACNESTAR SOAP 75GM"|"A"|"2016-08-31 15:02:06.609879"
"120147"|"AMANTREL CAPSULES 15S"|"A"|"2016-09-09 09:54:35"
"113446"|"VOLIX 0 3MG TABLET 15S"|"A"|"2016-08-31 15:02:06.609879"
"121294"|"maxifer xt syrup "|"A"|"2016-09-29 15:32:40"
"120151"|"PIRITON CS SYRUP 100ML"|"A"|"2016-09-09 14:30:46"
"103481"|"TERBICIP SPRAY 30ML"|"A"|"2016-08-31 15:02:06.609879"
"96175"|"SORBITRATE 5MG TABLET 50S"|"A"|"2016-08-31 15:02:06.609879"
The set is as huge as a million records. I want to take each record (second field), say on row 2 "TELMINORM CH 40/12.5MG TAB 10'S" and make a fuzzy comparison with the rest of the records and find if there exists a similar record set.
An example would be
TELMINORM CH 40/12.5MG TAB 10'S is same as TELMINORM CH 40/12.5MG CAP 10'S. Tablet/Capsule is what is meant by TAB/CAP. In this case its a duplicate record.
So to eliminate this I used distance module and then if the difference of the string is less than 5, I am writing to a file in below format.
TELMINORM CH 40/12.5MG TAB 10'S - TELMINORM CH 80/12.5MG TAB 10'S, TELMINORM CH 40/12.5MG TAB 10'S, TELMINORM CH 40/12.5MG CAP 10'S
The logic i used is doing the trick but slow. It processes 150 records in 1 hour
which is very slow process.
I have used something like this
from fuzzywuzzy import fuzz
rank = fuzz.ratio("str_1", "str_2")
Then I check if the rank > 80 and proceed. This method seems to be faster than the distance module.

Display Arabic text as separate characters (instead of cursive script) using CSS

To display license plates in Arabic, I wish to have each letter displayed without joining adjacent characters.
Is it possible to display Arabic text as separate characters, without the cursive script?
There are unique isolated Arabic UTF-8 characters just for this type of purpose.
It's all explained in this Wikipedia page.
(sorry, it pasted in as a bit of a mess)
A demonstration for the basic alphabet used in Modern Standard Arabic:
General
Unicode Contextual forms Name
Isolated End Middle Beginning
0623
أ‎ FE83
أ‎ FE84
ـأ‎ ʾalif
0628
ب‎ FE8F
ﺏ‎ FE90
ـب‎ FE92
ـبـ‎ FE91
بـ‎ bāʾ
062A
ت‎ FE95
ﺕ‎ FE96
ـت‎ FE98
ـتـ‎ FE97
تـ‎ tāʾ
062B
ث‎ FE99
ﺙ‎ FE9A
ـث‎ FE9C
ـثـ‎ FE9B
ثـ‎ ṯāʾ
062C
ج‎ FE9D
ﺝ‎ FE9E
ـج‎ FEA0
ـجـ‎ FE9F
جـ‎ ǧīm
062D
ح‎ FEA1
ﺡ‎ FEA2
ـح‎ FEA4
ـحـ‎ FEA3
حـ‎ ḥāʾ
062E
خ‎ FEA5
ﺥ‎ FEA6
ـخ‎ FEA8
ـخـ‎ FEA7
خـ‎ ḫāʾ
062F
د‎ FEA9
ﺩ‎ FEAA
ـد‎ dāl
0630
ذ‎ FEAB
ﺫ‎ FEAC
ـذ‎ ḏāl
0631
ر‎ FEAD
ﺭ‎ FEAE
ـر‎ rāʾ
0632
ز‎ FEAF
ﺯ‎ FEB0
ـز‎ zayn/zāy
0633
س‎ FEB1
ﺱ‎ FEB2
ـس‎ FEB4
ـسـ‎ FEB3
سـ‎ sīn
0634
ش‎ FEB5
ﺵ‎ FEB6
ـش‎ FEB8
ـشـ‎ FEB7
شـ‎ šīn
0635
ص‎ FEB9
ﺹ‎ FEBA
ـص‎ FEBC
ـصـ‎ FEBB
صـ‎ ṣād
0636
ض‎ FEBD
ﺽ‎ FEBE
ـض‎ FEC0
ـضـ‎ FEBF
ضـ‎ ḍād
0637
ط‎ FEC1
ﻁ‎ FEC2
ـط‎ FEC4
ـطـ‎ FEC3
طـ‎ ṭāʾ
0638
ظ‎ FEC5
ﻅ‎ FEC6
ـظ‎ FEC8
ـظـ‎ FEC7
ظـ‎ ẓāʾ
0639
ع‎ FEC9
ﻉ‎ FECA
ـع‎ FECC
ـعـ‎ FECB
عـ‎ ʿayn
063A
غ‎ FECD
ﻍ‎ FECE
ـغ‎ FED0
ـغـ‎ FECF
غـ‎ ġayn
0641
ف‎ FED1
ف‎ FED2
ـف‎ FED4
ـفـ‎ FED3
فـ‎ fāʾ
0642
ق‎ FED5
ﻕ‎ FED6
ـق‎ FED8
ـقـ‎ FED7
قـ‎ qāf
0643
ك‎ FED9
ﻙ‎ FEDA
ـك‎ FEDC
ـكـ‎ FEDB
كـ‎ kāf
0644
ل‎ FEDD
ﻝ‎ FEDE
ـل‎ FEE0
ـلـ‎ FEDF
لـ‎ lām
0645
م‎ FEE1
ﻡ‎ FEE2
ـم‎ FEE4
ـمـ‎ FEE3
مـ‎ mīm
0646
ن‎ FEE5
ن‎ FEE6
ـن‎ FEE8
ـنـ‎ FEE7
نـ‎ nūn
0647
ﻫ‎ FEE9
ﻩ‎ FEEA
ـه‎ FEEC
ـهـ‎ FEEB
هـ‎ hāʾ
0648
و‎ FEED
ﻭ‎ FEEE
ـو‎ wāw
064A
ي‎ FEF1
ﻱ‎ FEF2
ـي‎ FEF4
ـيـ‎ FEF3
يـ‎ yāʾ
0622
آ‎ FE81
ﺁ‎ FE82
ـآ‎ ʾalif maddah
0629
ة‎ FE93
ﺓ‎ FE94
ـة‎ — — Tāʾ marbūṭah
0649
ى‎ FEEF
ﻯ‎ FEF0
ـى‎ — — ʾalif maqṣūrah
[edit]

Resources