Export into text in R, problem with html special characters

Export into text in R, problem with html special characters - r

I am trying to export parts of dataframe in r into .txt file, using rio::export
the exported part is a text string that contains some characters as >, ≥ and ", some of these show in the text file as < " or > which is not what I need ( I want them to transfer into the text as greater than or less than, etc...).
when I looked these up online, I found these to be "html special characters" but I could not find a solution. I believe it has something to do with the encoding. is there a way to fix that by adding arguments to rio::export or do you suggest using other package?
I am using the code:
rio::export(df, "df.txt")
I also tried the following from this question but still gives the same problem
How to save R output data to a text file
cat(df$text, file = "myfile.txt") where df$text is the column that has the items I want to export.
thank you

Related

R write fixed width columns to csv (in the plain text file)

I write data frames to csv files using write.csv(). When this is done, the output when viewed in a plain text editor, in particular vi or notepad++, shows no spacing between the column content and the commas, resulting in it being relatively hard to read. For example, the columns are not lined up down the page.
I have negative interest in using excel to view the csv files. I am definitely not looking for a suggestion for a csv viewer. Nor do I want instructions on how to modify the plain text file afterward. Padding needs to be spaces not tabs.
I am interested in how to get R to line up the columns in the plain text csv file so that they are easier to read using a non specialized plain text editor.
I could (and might) write my own routine that converts everything to some fixed width string format and print that. But, I would prefer to find that this is an option within write.csv() or similar common output library call.
[I just this moment found out about printf in R, and that might be the best answer to this conundrum].

Missing delimiter error when importing html text

Playing with Azure Machine Learning using the Designer and am getting a "Delimiter not found" error when importing my data.
I originally started with a few hundred html files stored as azure blobs. Each file would be considered a single row of text, however, I had no luck importing these files for further text analytics.
I created a Data Factory job that imported each file, stripped all the tabs, quotes, cr/lf from the text, added a column for the file name and stored it all as a combined tab-delimited file. In notepad++ I can confirm that the format is FileName tab HtmlText. This is the file I'm trying to import into ML and getting the missing delimiter message as I'm trying to define the import module.
Here is the error when I try and create a dataset:
{
"message": "'Delimiter' is not specified or invalid."
}
Question 1: Is there a better way to do text analytics on a large collection of html files?
Question 2: Is there a format I need to use in my combined .tsv file that works?
Question 3: Is there maybe a max length to the string column? My html can be 10's of thousands of characters long.

you're right that it might be line length, but my guess is that there are still some special characters (i.e. anything starting with \ that aren't properly escaped or removed. How did you scrape and strip the text data? Have you tried using beautifulsoup?

Import data csv with particular quotes in R

I have a csv like this:
"Data,""Ultimo"",""Apertura"",""Massimo"",""Minimo"",""Var. %"""
"28.12.2018,""86,66"",""86,66"",""86,93"",""86,32"",""0,07%"""
What is the solution for importing correctly please?
I tried with read.csv("IT000509408=MI Panoramica.csv", header=T,sep=",", quote="\"") but it doesn't work.

Each row in your file is encoded as a single csv field.
So instead of:
123,"value"
you have:
"123,""value"""
To fix this you can read the file as csv (which will give you one field per row without the extra quotes), and then write the full value of that field to a new file as plain text (without using a csv writer).

characters converted to dates when using write.csv

my data frame has a column A with strings in character form
> df$A
[1] "2-60", "2-61", "2-62", "2-63" etc
I saved the table using write.csv, but when I open it with Excel column A appears formatted as date:
Feb-60
Feb-61
Feb-62
Feb-63
etc
Anyone knows what can I do to avoid this?
I tweaked the arguments of write.csv but nothing worked, and I can't seem to find an example in Stack Overflow that helps solve this problem.

As said in the comments, this is an excel behaviour, not R's. And that can't be deactivated:
Microsoft Excel is preprogrammed to make it easier to enter dates. For
example, 12/2 changes to 2-Dec. This is very frustrating when you
enter something that you don't want changed to a date. Unfortunately
there is no way to turn this off. But there are ways to get around it.
Microsoft Office Article
The first suggested way around it according to the article is not helpful, because it relies on changing the cell formatting, but that's too late when you open the .csv file in excel (it's already converted to an integer representing the date).
There is, however, a useful tip:
If you only have a few numbers to enter, you can stop Excel from
changing them into dates by entering:
An apostrophe (‘) before you enter a number, such as ’11-53 or ‘1/47. The apostrophe isn’t displayed in the cell after you press
Enter.
So you can make the data display as original by using
vec <- c("2-60", "2-61", "2-62", "2-63")
vec <- paste0("'", vec)
Just remember the values will still have the apostrophe if you read them again in R, so you might have to use
vec <- sub("'", "", vec)
This might not be ideal but at least it works.
One alternative is enclosing the text in =" ", as an excel formula, but that has the same end result and uses more characters.

Another solution - a bit tedious, Use Import Text File in Excel, click thru the dialog boxes and in Step 3 of 3 of the Text Import Wizard, you will have an option of setting the column data format, use "Text" for the column that has "2-60", "2-61", "2-62", "2-63". If you use General (the default), Excel tries to be smart and converts the answer for you.

I solved the problem by saving the file using the .xlsx format by using the function
write.xlsx()
from the package xlsx (https://www.rdocumentation.org/packages/xlsx/versions/0.6.5)

'Incomplete final line' warning when trying to read a .csv file into R

I'm trying to read a .csv file into R and upon using this formula:
pheasant<-read.table(file.choose(),header=TRUE,sep=",")
I get this warning message:
"incomplete final line found by readTableHeader on 'C:\Documents and Settings..."
There are a couple of things I thought may have caused this warning, but unfortunately I don't know enough about R to diagnose the problem myself so I thought I'd post here in the hope someone else can diagnose it for me!
the .csv file was originally an Excel file, which I saved into .csv format
the file comprises three columns of data
each data column is of a differing length, i.e. there are a different number of values in each column
I want to compare the means (using t-test or equivalent depending on normal / not normal distribution) of two of the columns at a time, so for example, t-test between column 1 values and column 2 values, then a t-test of column 1 and column 3 values, etc.
Any help or suggestions would be seriously appreciated!

The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)). The original intention of this message was to warn you that the file may be incomplete; most datafiles have an EOL character as the very last character in the file.
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return
Save the file

The problem is easy to resolve;
it's because the last line MUST be empty.
Say, if your content is
line 1,
line2
change it to
line 1,
line2
(empty line here)
Today I met this kind problem, when I was trying to use R to read a JSON file, by using command below:
json_data<-fromJSON(paste(readLines("json01.json"), collapse=""))
; and I resolve it by my above method.

Are you really sure that you selected the .csv file and not the .xls file? I can only reproduce the error if I try to read in an .xls file. If I try to read in a .csv file or any other text file, it's impossible to recreate the error you get.
> Data <- read.table("test.csv",header=T,sep=",")
> Data <- read.table("test.xlsx",header=T,sep=",")
Warning message:
In read.table("test.xlsx", header = T, sep = ",") :
incomplete final line found by readTableHeader on 'test.xlsx'
readTableHead is the c-function that gives the error. It tries to read in the first n lines (standard the first 5 ) to determine the type of the data. The rest of the data is read in using scan(). So the problem is the format of the file.
One way of finding out, is to set the working directory to the directory where the file is. That way you see the extension of the file you read in. I know on Windows it's not shown standard, so you might believe it's csv while it isn't.
The next thing you should do, is open the file in Notepad or Wordpad (or another editor) and check that the format is equivalent to my file test.csv:
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,
This file will give you the following dataframe :
> read.table(testfile,header=T,sep=",")
Test1 Test2 Test3
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
5 5 5 NA
6 NA 6 NA
The csv format saved by excel seperates all cells with a comma. Empty cells just don't have a value. read.table() can easily deal with this, and recognizes empty cells just fine.

Use readLines() (with warn = FALSE) to read the file into a character vector first.
After that use the text = option to read the vector into a data frame with read.table()
pheasant <- read.table(
text = readLines(file.choose(), warn = FALSE),
header = TRUE,
sep = ","
)

I realized that several answers have been provided but no real fix yet.
The reason, as mentioned above, is a "End of line" missing at the end of the CSV file.
While the real Fix should come from Microsoft, the walk around is to open the CSV file with a Text-editor and add a line at the end of the file (aka press return key).
I use ATOM software as a text/code editor but virtually all basic text editor would do.
In the meanwhile, please report the bug to Microsoft.
Question: It seems to me that it is a office 2016 problem. Does anyone have the issue on a PC?

I have solved this problem with changing encoding in read.table argument from fileEncoding = "UTF-16" to fileEncoding = "UTF-8".

I received the same message. My fix included: I deleted all the additional sheets (tabs) in the .csv file, eliminated non-numeric characters, resaved the file as comma delimited and loaded in R v 2.15.0 using standard language:
filename<-read.csv("filename",header=TRUE)
As an additional safeguard, I closed the software and reopened before I loaded the csv.

In various European locales, as the comma character serves as decimal point, the read.csv2 function should be used instead.

I got this problem once when I had a single quote as part of the header. When I removed it (i.e. renamed the respective column header from Jimmy's data to Jimmys data), the function returned no warnings.

In my case, it was literally the final line. The issue was fixed by literally adding a blank row at the bottom of the CSV file.
FROM
cola,colb,colc
1,2,3
4,5,6
7,8,9
INTO
cola,colb,colc
1,2,3
4,5,6
7,8,9
Take a look closer on that extra space at the very last row. Just add that blank line and it will fix the issue.
NOTE
It seems that R's CSV parser is looking for that very last new line character as the new line separator. This is more known to programmers as the \r\n or \r characters.

The problem that you're describing occurred for me when I renamed a .xlsx as .csv.
What fixed it for me was going "Save As" and then saving it as a .csv again.

To fix this issue through R itself, I just used read.xlsx(..) instead of a read.csv(). Works like a charm!! You do not even have to rename. Renaming an xlsx into to csv is not a viable solution.

Open the file in text wrangler or notepad ++ and show the formating e.g. in text wrangler you do show invisibles. That way you can see the new line or tabs characters
Often excel will add all sorts of tabs in the wrong places and not a last new line character, but you need to show the symbols to see this.

My work around was that I opened the csv file in a text editor, removed the excessive commas on the last value, then saved the file. For example for the following file
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,,
Remove the commas after 6, then save the file.

I've experienced a similar problem, however this appears to a generic warning, and may not in fact be related to the line-end character. In my case it was giving this error because the file I was using contained Cyrillic characters, once I replaced them with latin characters the error disappeared.

I tried different solutions, such as using a text editor to insert a new line and get the End Of Line character as recommended in the top answer above. None of these worked, unfortunately.
The solution that did finally work for me was very simple: I copy-pasted the content of a CSV file into a new blank CSV file, saved it, and the problem was gone.

There is a quite simple solution (if it is indeed the finale line which is causing troubles) where you don't need to open the file before reading it:
cat("\n", file = "your/File/Dir", append = TRUE)
Found this solution here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Export into text in R, problem with html special characters - r

Related

R write fixed width columns to csv (in the plain text file)

Missing delimiter error when importing html text

Import data csv with particular quotes in R

characters converted to dates when using write.csv

'Incomplete final line' warning when trying to read a .csv file into R

Categories

Resources