Getting extra symbols and numbers after reading csv file in R - r

I have a csv file containing text, where after reading it in R, I am getting some unwanted symbols and numbers.
How to remove all these unwanted ones from the file?
Example:
My csv file has two text columns= Question and Answer
Original Question (before opening in R):
Where do I see my bank's account details?
Original Answer:
That's a frequently asked question. You can find details at this link.
After reading the file, I am getting like:
Question:
Where do I see my bank’s account details?
Answer:
That39;s a frequently asked question. You can find details at this link.
I tried saving in UTF-8 and then read the file,
df <-read.csv("data.csv", encoding="UTF-8", stringsAsFactors=FALSE)
But still some unwanted symbols and numbers are appearing as shown above. How do I remove them?

Try read_excel() from the readxl package
library(readxl)
df = read_excel("data.csv")

Related

Writing to a CSV file producing errors

I am using R to analyze some text data. After doing some aggregation, I had a new dataframe I wanted to write to a csv file, so I can use it in other analyses. The dataframe looks correct in r- it only has 2 columns with text data- but once I write the csv and open it, the text is scattered across different columns. Here is the code I was using:
write.csv(new_df, "4.19 Group 1_agg by user try 2.csv")
I tried adding in an extra bit of code to specify that it should be using UTF-8, since I've heard this could be an encoding error, so the code then looked like this:
write.csv(new_df, "4.19 Group 1_agg by user try 2.csv", fileEncoding = "UTF-8")
I also tried reading in the file differently (using fread instead of read.csv)
Still, the csv file looks wrong/messy in many places. Here is what it should look like:
This is what it looks like currently:
Again, I think the error must be in writing the csv file, because everything looks good in R when I check it using names and head. Any help is appreciated, thank you!

How can I extract URLs from Excel and match them with their displayed text using R?

I have an Excel file containing a large number of hyperlinks, and I want to write a program that extracts the URLs and matches them with the displayed text. I can import the URLs with the solution to a previous question, which uses the following code:
library(XML)
# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)
# unzip the file
unzip(my.zip.file)
# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")
# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/#display", namespaces="x")
However, this ignores rows without any links, so the imported dataset is several thousand rows shorter than it should be. I can get the displayed text with read.xlsx, but I don't know how to match it with the URLs. I've tried looking for ways to find out which rows have links, or to change the code so it adds NAs in the right places, but I haven't had any success.
Having the same use case today, I dug a little bit and got a R function to extract all hyperlinks beneath the cells/text. My code snippet is posted here, Extract hyperlink from Excel file in R, which I believe is a similar topic:

Tab Delimited Data

Since I'm not able to find/decipher solutions to my problem I'll try asking.
Up until now I've only worked with other people's data (.csv files) in RStudio but this time around it's my own data, which I entered into Excel. All entries are tab-delimited, and I wouldn't enter data into Excel any other way. After some googling it seems like R and .xlsx files aren't best friends, so I saved my file in various other formats, .csv one of them. Thus I have tab-delimited .csv file.
The problem with loading tab-delimited .csv files also features here, but my problem is not with "reading some of the numeric variables in as factors" (whatever that means), but that data is loaded as semi-colon separated in R:
data <- read.table(file.choose(), sep="\t", header=T)
View(data)
Date.Miles.Time
2015-08-10;5;45
Apart from improper formatting there are 29 observations rather than 24; the last five entries are just ;;. Again, my problem is not the same as in the link but I figure there's no harm in trying Justin's suggestion in the answer, i.e. options(stringsAsFactors=FALSE) and then running the above again, but it achieves nothing.
read.csv() and read.delim() yield the same result. Any suggestions?

"arules" library's "read.transaction()" reads in CSV files with an additional, blank column for every transaction

When you attempt to read CSV files that aren't the default groceries.csv, every transaction has an additional entry in it — a blank space — which will mess up all of the calculations for analysis (and even crash R if your CSV file is big enough). I've tried to insert NA's into all of the blank cells in my CSV file, but I cannot find a way to remove all of them within the read.transactions() command (remove duplicates leaves a single NA). I haven't found a trustworthy way to fix this in any of the other questions on stackoverflow, nor anywhere else on the internet.
Example entry:
> inspect(trans[1:5])
items
1 {,
FACEBOOK.COM,
Google,
Google Web Search}
It is hard to say. I assume you read the data with read.transactions(). Does your CSV file have leading white spaces in some/all lines? You could try to use the cols parameter in read.transactions() to fix the problem.
An example with data and the code to replicate the problem would help.

'Incomplete final line' warning when trying to read a .csv file into R

I'm trying to read a .csv file into R and upon using this formula:
pheasant<-read.table(file.choose(),header=TRUE,sep=",")
I get this warning message:
"incomplete final line found by readTableHeader on 'C:\Documents and Settings..."
There are a couple of things I thought may have caused this warning, but unfortunately I don't know enough about R to diagnose the problem myself so I thought I'd post here in the hope someone else can diagnose it for me!
the .csv file was originally an Excel file, which I saved into .csv format
the file comprises three columns of data
each data column is of a differing length, i.e. there are a different number of values in each column
I want to compare the means (using t-test or equivalent depending on normal / not normal distribution) of two of the columns at a time, so for example, t-test between column 1 values and column 2 values, then a t-test of column 1 and column 3 values, etc.
Any help or suggestions would be seriously appreciated!
The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)). The original intention of this message was to warn you that the file may be incomplete; most datafiles have an EOL character as the very last character in the file.
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return
Save the file
The problem is easy to resolve;
it's because the last line MUST be empty.
Say, if your content is
line 1,
line2
change it to
line 1,
line2
(empty line here)
Today I met this kind problem, when I was trying to use R to read a JSON file, by using command below:
json_data<-fromJSON(paste(readLines("json01.json"), collapse=""))
; and I resolve it by my above method.
Are you really sure that you selected the .csv file and not the .xls file? I can only reproduce the error if I try to read in an .xls file. If I try to read in a .csv file or any other text file, it's impossible to recreate the error you get.
> Data <- read.table("test.csv",header=T,sep=",")
> Data <- read.table("test.xlsx",header=T,sep=",")
Warning message:
In read.table("test.xlsx", header = T, sep = ",") :
incomplete final line found by readTableHeader on 'test.xlsx'
readTableHead is the c-function that gives the error. It tries to read in the first n lines (standard the first 5 ) to determine the type of the data. The rest of the data is read in using scan(). So the problem is the format of the file.
One way of finding out, is to set the working directory to the directory where the file is. That way you see the extension of the file you read in. I know on Windows it's not shown standard, so you might believe it's csv while it isn't.
The next thing you should do, is open the file in Notepad or Wordpad (or another editor) and check that the format is equivalent to my file test.csv:
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,
This file will give you the following dataframe :
> read.table(testfile,header=T,sep=",")
Test1 Test2 Test3
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
5 5 5 NA
6 NA 6 NA
The csv format saved by excel seperates all cells with a comma. Empty cells just don't have a value. read.table() can easily deal with this, and recognizes empty cells just fine.
Use readLines() (with warn = FALSE) to read the file into a character vector first.
After that use the text = option to read the vector into a data frame with read.table()
pheasant <- read.table(
text = readLines(file.choose(), warn = FALSE),
header = TRUE,
sep = ","
)
I realized that several answers have been provided but no real fix yet.
The reason, as mentioned above, is a "End of line" missing at the end of the CSV file.
While the real Fix should come from Microsoft, the walk around is to open the CSV file with a Text-editor and add a line at the end of the file (aka press return key).
I use ATOM software as a text/code editor but virtually all basic text editor would do.
In the meanwhile, please report the bug to Microsoft.
Question: It seems to me that it is a office 2016 problem. Does anyone have the issue on a PC?
I have solved this problem with changing encoding in read.table argument from fileEncoding = "UTF-16" to fileEncoding = "UTF-8".
I received the same message. My fix included: I deleted all the additional sheets (tabs) in the .csv file, eliminated non-numeric characters, resaved the file as comma delimited and loaded in R v 2.15.0 using standard language:
filename<-read.csv("filename",header=TRUE)
As an additional safeguard, I closed the software and reopened before I loaded the csv.
In various European locales, as the comma character serves as decimal point, the read.csv2 function should be used instead.
I got this problem once when I had a single quote as part of the header. When I removed it (i.e. renamed the respective column header from Jimmy's data to Jimmys data), the function returned no warnings.
In my case, it was literally the final line. The issue was fixed by literally adding a blank row at the bottom of the CSV file.
FROM
cola,colb,colc
1,2,3
4,5,6
7,8,9
INTO
cola,colb,colc
1,2,3
4,5,6
7,8,9
Take a look closer on that extra space at the very last row. Just add that blank line and it will fix the issue.
NOTE
It seems that R's CSV parser is looking for that very last new line character as the new line separator. This is more known to programmers as the \r\n or \r characters.
The problem that you're describing occurred for me when I renamed a .xlsx as .csv.
What fixed it for me was going "Save As" and then saving it as a .csv again.
To fix this issue through R itself, I just used read.xlsx(..) instead of a read.csv(). Works like a charm!! You do not even have to rename. Renaming an xlsx into to csv is not a viable solution.
Open the file in text wrangler or notepad ++ and show the formating e.g. in text wrangler you do show invisibles. That way you can see the new line or tabs characters
Often excel will add all sorts of tabs in the wrong places and not a last new line character, but you need to show the symbols to see this.
My work around was that I opened the csv file in a text editor, removed the excessive commas on the last value, then saved the file. For example for the following file
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,,
Remove the commas after 6, then save the file.
I've experienced a similar problem, however this appears to a generic warning, and may not in fact be related to the line-end character. In my case it was giving this error because the file I was using contained Cyrillic characters, once I replaced them with latin characters the error disappeared.
I tried different solutions, such as using a text editor to insert a new line and get the End Of Line character as recommended in the top answer above. None of these worked, unfortunately.
The solution that did finally work for me was very simple: I copy-pasted the content of a CSV file into a new blank CSV file, saved it, and the problem was gone.
There is a quite simple solution (if it is indeed the finale line which is causing troubles) where you don't need to open the file before reading it:
cat("\n", file = "your/File/Dir", append = TRUE)
Found this solution here.

Resources