Turning a large character string into a dataframe [duplicate] - r

This question already has answers here:
R read.csv data from inline string with a csv file content
(3 answers)
Is there a way to use read.csv to read from a string value rather than a file in R?
(6 answers)
Closed 7 months ago.
I'm getting data from an API in a very raw, messy text format. For the purposes of reproducibility, the dput looks as such:
people <- "Registrant UID,Status,Tracking Source,Tracking ID,Open Tracking ID,State API Submission Result,Language,Date of birth,Email address,US citizen?,Salutation,First name,Middle name,Last name,Name suffix,Home address,Home unit,Home city,Home County,Home state,Home zip code,Has mailing address?,Mailing address,Mailing unit,Mailing city,Mailing County,Mailing state,Mailing zip code,Party,Race,Phone,Phone type,Opt-in to RTV email?,Opt-in to RTV sms?,Opt-in to Partner email?,Opt-in to Partner SMS/robocall,Survey question 1,Survey answer 1,Survey question 2,Survey answer 2,Volunteer for RTV,Volunteer for partner,Ineligible reason,Pre-Registered,Started registration,Finish with State,Built via API,Has State License,Has SSN,VR Application Submission Modifications,VR Application Submission Errors,VR Application Status,VR Application Status Details,VR Application Status Imported DateTime,Submitted Via State API,Submitted Signature to State API,utm_source,utm_medium,utm_campaign,utm_term,utm_content,other_parameters,Change of Name,Prev Name Title,Prev First Name,Prev Middle Name,Prev Last Name,Prev Name Suffix,Registration Source,Registration Medium,Shift ID,Blocks Shift ID,Over 18 Affirmation,Preferred Language,State Flow Status,State API Transaction ID,Requested Assistance,Viewed Steps\n111,Complete,\"\",\"\",,Success: 333,English,4/13/sample#email.com,Yes,Ms.,FirstName,\"\",LastName,\"\",Street,\"\",City, State, Zip,No,\"\",,\"\",,,\"\",Political Party,\"\",111,,Yes,No,No,No,,,,,No,No,,No,DoB and Time,No,No,Yes,Yes,\"\",[],Approved,APPR - CHANGE APPLICATION,Date ,Yes,false,,,,,,amp=,No,,\"\",\"\",\"\",,Web,Submitted Via State API,,,Yes,\"\",complete,333,,\"state_registrants-edit,state_registrants-edit,state_registrants-edit,state_registrants-pending,state_registrants-complete\"\n111,Complete,\"\",\"\",,Success: 333,English,4/13/sample#email.com,Yes,Ms.,FirstName,\"\",LastName,\"\",Street,\"\",City, State, Zip,No,\"\",,\"\",,,\"\",Political Party,\"\",111,,Yes,No,No,No,,,,,No,No,,No,DoB and Time,No,No,Yes,Yes,\"\",[],Approved,APPR - CHANGE APPLICATION,Date ,Yes,false,,,,,,amp=,No,,\"\",\"\",\"\",,Web,Submitted Via State API,,,Yes,\"\",complete,333,,\"state_registrants-edit,state_registrants-edit,state_registrants-edit,state_registrants-pending,state_registrants-
n111,Complete,"","",,Success: 333,English,4/13/sample#email.com,Yes,Ms.,FirstName,"",LastName,"",Street,"",City, State, Zip,No,"",,"",,,"",Political Party,"",111,,Yes,No,No,No,,,,,No,No,,No,DoB and Time,No,No,Yes,Yes,"",[],Approved,APPR - CHANGE APPLICATION,Date ,Yes,false,,,,,,amp=,No,,"","","",,Web,Submitted Via State API,,,Yes,"",complete,333,,"state_registrants-edit,state_registrants-edit,state_registrants-edit,state_registrants-pending,state_registrants-
You can probably tell, but from Registrant UID to Viewed Steps will be the header/column names. Some of the fields are empty and that's fine. Some of the fields also have multiple commas which is also fine.
How exactly would I go about putting this into a neatly structured data frame?

Related

Error while using "EpiEstim" and "ggplot2" libraries

First of all, I must say I'm completely noob in R. So I apologize in advance for asking for help with such a simple task. My task is to form a graph of COVID-19 cases for a certain period using data from the CSV file. Unfortunately, at the moment I cannot contact the person from the World Health Organization who provided the data and the script for launching. But I was left with an error that I cannot fix either myself, not with the help of Google.
script.R
library(EpiEstim)
library(ggplot2)
COVID<-read.csv("dataset.csv")
res_parametric_si<-estimate_R(COVID$I,method="parametric_si",config=make_config(list(mean_si=4,std_si=3)))
plot(res_parametric_si)
dataset.csv
Date,Suspected per day,Total suspected,Discarded/pending,Confirmed per day,Total confirmed,Deaths per day,Deaths Total,Case fatality rate,Daily confirmed,Recovered per day,Recovered total,Active cases,Tested with PCR,# of PCR tests total,average tests/ 7 days,Inf HCW,Inf HCW/d,Vent HCW,Susp per day
01-Jul-20,1239,91172,45285,889,45887,12,1185,2.58%,889,505,20053,24649,11109,676684,10073,6828,63,,1239
02-Jul-20,1249,92421,45658,876,46763,27,1212,2.59%,876,505,20558,24993,13167,689851,9966,6874,46,,1249
03-Jul-20,1288,93709,46032,914,47677,15,1227,2.57%,914,597,21155,25295,11825,701676,9915.7,6937,63,,1288
04-Jul-20,926,94635,46135,823,48500,22,1249,2.58%,823,221,21376,25875,9934,711610,9957,6990,53,,926
05-Jul-20,680,95315,46272,543,49043,13,1262,2.57%,543,327,21703,26078,6696,718306,9963.7,7030,40,,680
06-Jul-20,871,96186,46579,564,49607,21,1283,2.59%,564,490,22193,26131,9343,727649,10303.9,7046,16,,871
07-Jul-20,1170,97356,46942,807,50414,23,1306,2.59%,807,926,23119,25989,13568,741217,10806,7092,46,,1170
Error
Error in process_I(incid) (script.R#4): incid must be a vector or a dataframe with either i) a column called 'I', or ii) 2 columns called 'local' and 'imported'.
For the example data the issue seems to be that it does only cover 7 data points, and the configurator assumes that there it can window over more than 7 days. What worked for me was the following code (working in the sense that it does not throw an error).
config <- make_config(incid = COVID$Daily.confirmed,
method="parametric_si",
list(mean_si=4,std_si=3, t_start = c(2,3),t_end = c(6,7)))
res_parametric_si<-estimate_R(COVID$Daily.confirmed,method="parametric_si",config=config)
plot(res_parametric_si)

getRetweeters() returns one id whereas getRetweetCount() returns 2 -- in twitteR package

I use twitteR package and I am trying to retrieve account ids of retweeters..
The retweeterCount and the list of retweeters does not appear to be always consistent.
For example, I retrieved a status (tweet) using
st<-showStatus("1058168768009043969")
retweeters(st$getId()) # returns "260857015"
st$getRetweetCount() # however returns 2
st$getRetweeters() # returns a known error
Using twitteR's getRetweeters method
twitter site shows 2 retweets as shown here
https://twitter.com/ConsueloMack/status/1058168768009043969
In order to run one needs a valid key and setup the oauth as follows
require('twitteR')
twapi<-read.csv("./coach_keys.json",sep=":",stringsAsFactors=F,header=F)
# in Linux you can obtain oauth as follows
setup_twitter_oauth(twapi[twapi$V1=="API_KEY",c("V2")],
twapi[twapi$V1=="API_SECRET_KEY",c("V2")],
twapi[twapi$V1=="ACCESS_TOKEN",c("V2")],
twapi[twapi$V1=="ACCESS_TOKEN_SECRET",c("V2")])
# then the above snippet can be run
I expected the retweeters method to return as many as indicated by
the getRetweetCount().
However, it does not. I am seeking some pointers especially if I am doing something wrong. Is it common occurrence? Can someone show for the ID I have how to retrieve count and the list consistent with each other?
Thank you very much.

How to get QFileDialog to select and return multiple folders [duplicate]

This question already has an answer here:
PyQt QFileDialog - Multiple Directory Selection
(1 answer)
Closed 6 years ago.
I want to have the user be able to select multiple folders and then store the paths of those folders in a list.
How can i make that happen? My current QFileDialog looks like this:
str = QtGui.QFileDialog.getExistingDirectory(self, "Open Directory", /folder/subfolder, QtGui.QFileDialog.DontResolveSymlinks)
But of course, it only lets me select one folder. How can I change it to select multiple folders and return them in a list?
As far as I know you can't do that with the native FileDialog.
There, however, exists a workaround in which you don't use the native dialog:
file_dialog = QFileDialog()
file_dialog.setFileMode(QFileDialog.DirectoryOnly)
file_dialog.setOption(QFileDialog.DontUseNativeDialog, True)
file_view = file_dialog.findChild(QListView, 'listView')
# to make it possible to select multiple directories:
if file_view:
file_view.setSelectionMode(QAbstractItemView.MultiSelection)
f_tree_view = file_dialog.findChild(QTreeView)
if f_tree_view:
f_tree_view.setSelectionMode(QAbstractItemView.MultiSelection)
if file_dialog.exec():
paths = file_dialog.selectedFiles():
This workaround is a bit clunky however, but it's the best solution I know of other than rolling your own custom dialog.

How to connect data dictionaries to the unlabeled data

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

How to import Geonames into SQLite?

I need to import the Geonames database (http://download.geonames.org/export/dump/) into SQLite (file is about a gigabyte in size, ±8,000,000 records, tab-delimited).
I'm using the built-in SQLite-possibilities of Mac OS X, accessed through terminal. All goes well, until record 381174 (tested with older file, the exact number varies slightly depending on the exact version of the Geonames database, as it is updated every few days), where the error "expected 19 columns of data but found 18" is displayed.
The exact line causing the problem is:
126704 Gora Kyumyurkey Gora Kyumyurkey Gora Kemyurkey,Gora
Kyamyar-Kup,Gora Kyumyurkey,Gora Këmyurkëy,Komur Qu",Komur
Qu',Komurkoy Dagi,Komūr Qū’,Komūr Qū”,Kummer Kid,Kömürköy Dağı,kumwr
qwʾ,كُمور
قوء 38.73335 48.24133 T MT AZ AZ 00 0 2471 Asia/Baku 2014-03-05
I've tested various countries separately, and the western countries all completely imported without a problem, causing me to believe the problem is somewhere in the exotic characters used in some entries. (I've put this line into a separate file and tested with several other database-programs, some did give an error, some imported without a problem).
How do I solve this error, or are there other ways to import the file?
Thanks for your help and let me know if you need more information.
Regarding the question title, a preliminary search resulted in
the GeoNames format description ("tab-delimited text in utf8 encoding")
https://download.geonames.org/export/dump/readme.txt
some libraries (untested):
Perl: https://github.com/mjradwin/geonames-sqlite (+ autocomplete demo JavaScript/PHP)
PHP: https://github.com/robotamer/geonames-to-sqlite
Python: https://github.com/commodo/geonames-dump-to-sqlite
GUI (mentioned by #charlest):
https://github.com/sqlitebrowser/sqlitebrowser/
The SQLite tools have import capability as well:
https://sqlite.org/cli.html#csv_import
It looks like a bi-directional text issue. "كُمور قوء" is expected to be at the end of the comma-separated alternate name list. However, on account of it being dextrosinistral (or RTL), it's displaying on the wrong side of the latitude and longitude values.
I don't have visibility of your import method, but it seems likely to me that that's why it thinks a column is missing.
I found the same problem using the script from the geonames forum here: http://forum.geonames.org/gforum/posts/list/32139.page
Despite adjusting the script to run on Mac OS X (Sierra 10.12.6) I was getting the same errors. But thanks to the script author since it helped me get the sqlite database file created.
After a little while I decided to use the sqlite DB Browser for SQLite (version 3.11.2) rather than continue with the script.
I had errors with this method as well and found that I had to set the "Quote character" setting in the import dialog to the blank state. Once that was done the import from the FULL allCountries.txt file ran to completion taking just under an hour on my MacBookPro (an old one but with SSD).
Although I have not dived in deeper I am assuming that the geonames text files must not be quote parsed in any way. Each line simply needs to be handled as tab delimited UTF-8 strings.
At the time of writing allCountries.txt is 1.5GB with 11,930,517 records. SQLite database file is just short of 3GB.
Hope that helps.
UPDATE 1:
Further investigation has revealed that it is indeed due to the embedded quotes in the geonames files, and looking here: https://sqlite.org/quirks.html#dblquote shows that SQLite has problems with quotes. Hence you need to be able to switch off quote parsing in SQLite.
Despite the 3.11.2 version of DB Browser being based on SQLite 3.27.2 which does not have the required mods to ignore the quotes, I can only assume it must be escaping the quotes when you set the "Quote character" to blank.

Resources