read.table unable to read tab delimited file? - r

I'm having trouble reading this table into R:
http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt
I tried all of the following:
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=7,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=8,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=10,header=FALSE)
If I tell it that the separator is a tab, i get the wrong table:
d = read.table(file="http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",header=FALSE,skip=7,sep="\t")
the only thing that seems to work is readLines. but then i don't know how to get a data.frame out of each line.
d =readLines("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
any suggestions? thanks.

I agree that read.fwf will work, once you've worked out the widths.
But, Yeah -- I just hate people who allow whitespace inside elements (e.g. "SouthDakota" ) . One other thing you can do is edit the source text file, replacing {2,N} spaces with a tab. That will leave the state names as-is but give you a workable delimiter.

Related

How to edit hidden character in String

The appearance of "textparcali" in RStudio Source Editor was as follows.
In textparcali (tbl_df), I ran the following code to delete single strings.
textparcali$word<-gsub("\\W*\\b\\w\\b\\W*",'', textparcali$word)
But the deletion was interesting. You can see the picture below. Please note lines 67 and 50.
Everything was fine for line 50 and lines like that. However, this was not the case for line 67 (and I think there are others like it).
I focused on one line(67) to understand why you deleted it wrong. I've already seen what it says on this line in the editor. But I also wanted to look at the console. I wrote the following code to the console.
textparcali$word[67]
The word on line 67 looks different in the console. The value that doesn't appear when you make a copy paste but surprisingly appears on the console:
The reason I put it as a picture is because this character disappears after the copy-paste command.
You can download the file containing this character from the link below. However, you should open it with Notepad ++.
Character.txt
Gsub did his job right. How is that possible? What's the name of this character? When I try to write code that destroys this character, the " sign changes and does not delete.
textparcali$word<-gsub('[[:punct:]]+',' ',textparcali$word) command also does not work.
What is the explanation of my experience? I do not know. Is there a way to destroy this character? What caused this? I ve asked a lot.
Thank you all.
(I apologize for the bad scribbles in the pictures.)
I found the surprise character.
Above Right, Combining Dot ͘ ͘
The following is the code required to eliminate this character.
c<-"surprise character"
c
[1] "\u0358"
textparcali$word<-gsub("\u0358","",textparcali$word,ignore.case = FALSE)
textparcali$word<-gsub("\u307","",textparcali$word,ignore.case = FALSE)
Code 307 did the job for me. However, you should determine what the actual code is. If not, your character code may be incorrect.
More detailed information can be found in the links below.
https://gist.github.com/ngs/2782436
https://www.charbase.com/0358-unicode-combining-dot-above-right
Thanks a lot!

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

Aptana: find and replace tabs (with spaces)

I am belatedly setting up Aptana to use four spaces instead of tabs. I've made the necessary changes to the preferences so every new tab inserts four spaces.
All the existing tabs remain, however, and so I get Mixed spaces and tab errors. How can you do a Replace all to fix this? I've tried ^t, <TAB> etc but it just searches for these as normal strings. What are the correct ways to specify a space and a tab?
I've found myself in similar situation and copying whole source code and repasting it helped me out. Just do as follow on your source code Ctrl+A, Ctrl+C, then Del whole code and Ctrl+V it again. You will get only spaces if have set it in options like you mentioned above.
There's an option in the refactor source menu to convert between tabs and space-tabs.
This worked for me: Edit > Find/Replace... (CTRL+F), check Regular Expressions, type \t and type 4 spaces (I use 4 spaces for tab).

read.fwf and the number sign

I am trying to read this file (3.8mb) using its fixed-width structure as described in the following link.
This command:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
Produces an error:
line 37 did not have 10 elements
After replicating the issue with different values of the skip option, I figured that the lines causing the problem all contain the "#" symbol.
Is there any way to get around it?
As #jverzani already commented, this problem is probably the fact that the # sign often used as a character to signal a comment. Setting the comment.char input argument of read.fwf to something other than # could fix the problem. I'll leave my answer below as a more general case that you can use on any character that causes problems (e.g. the 's in the Dutch city name 's Gravenhage).
I've had this problem occur with other symbols. The approach I took was to simply replace the # by either nothing, or by a character which does not generate the error. In my case it was no problem to simply replace the character, but this might not be possible in your case.
So my approach would be to delete the symbol that generates the error, or replace by another character. This can be done using a text editor (find and replace), in an R script, or using some linux tools called grep and sed. If you want to do this in an R script, use scan or readLines to read the lines. Once the text is in memory, you can use sub to replace the character.
If you cannot replace the character, I would try the following approach: replace the character by a character that does not generate an error, read it into R using read.fwf, and finally replace the character by the # character.
Following up on the answer above: to get all characters to be read as literals, use both comment.char="" and quote="" (the latter takes care of #PaulHiemstra's problem with single-quotes in Dutch proper nouns) in the call to read.fwf (this is documented in ?read.table).

repair data in csv file

I have a huge csv file, separated by comma's and I want to do a analysis with glm in R.
In one column there exists data with a comma implied, something like: bla,blabla
When reading the file in R with read.csv.sql there comes a error-message:
RS-DBI driver: (RS_sqlite_import: ./agp.csv line 47612 expected 37 columns of data but found 38)
This is due to the 'extra' comma in some of the data, not the whole column has an extra column.
How can I fix this? I want to remove this extra superfluous comma.
Thanks for the reaction,
André
The CSV format is very simple and can easily be hand edited. In order to include a comma in a value, you must surround the value with quotes quotes. Try this: "bla,blabla". If that data happens to contain any quotes, eg. blah,"thequotedblah",blah, those quotes need to be escaped with another quote, like this: "blah,""thequotedblah"",blah".
Although there is no official standard around it, there isn't much to the CSV format. Wikipedia has a great CSV reference that I have personally used to implement CSV support in applications. Spend 5-10 minutes reading it and you'll know everything you ever need to know to manually create/read/repair CSV data.
Is it just this one line that contains a non-quoted comma - or are there several such lines? Editing the .csv with an editor that can handle large files (e.g. Ultraedit) to sanitize that one record would certainly help. Asaph's suggestion of quoting is also a good 'un.

Resources