Non-existing blank/strange character at the beginning of the import

Non-existing blank/strange character at the beginning of the import - r

I have a strange (to me) case:
I am importing a data with read.table (the file is a CSV - a couple of columns, nothing special I believe).
The unexpected behavior is that in the resulting r data frame I receive a strange first character before the first value (first data row) - see head(data,5) below
Number
1 п»ї400007
2 400007
3 400007
4 400007
5 400007
When I open the data in the viewer it looks like there's a blank character before the first value.
When I open the csv in notepad there is nothing before 400007 in the first row.
This has happened to me a couple of times.
Now if I want to tell read.table function that this "Number" variable should be read as number/integer it gives an error because of this strange character.
Could it be anything related to my language system settings(Windows)?
Any ideas?

Related

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [Âµm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML

If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)

Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".

This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.

I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

Character strings are converted to date in .csv. Can't even be converted back in Excel (R)

I have a data.frame that looks like this:
a=data.frame(c("MARCH3","SEPT9","XYZ","ABC","NNN"),c(1,2,3,4,5))
> a
c..MARCH3....SEPT9....XYZ....ABC....NNN.. c.1..2..3..4..5.
1 MARCH3 1
2 SEPT9 2
3 XYZ 3
4 ABC 4
5 NNN 5
Write into csv: write.csv(a,"test.csv")
I want everything to stay the way it is but MARCH3 and SEPT9 become 3-Mar and 9-Sep. I have tried everything in Excel: formatting by date, text, custom...none works. 3-Mar would be converted to 42066 and 9-Sep to 42256. In reality, a is a fairly large table so this can't even be done manually. Is there a way to coerce a[,1] so that Excel would ignore its format?

The best way to prevent Excel from autoformatting would probably be to store the data as excel file:
library(xlsx)
write.xlsx(a, "test.xlsx")

Your best bet is probably to change the file extension (e.g. make it ".txt" or ".dat" or something like that). When you open such a file in Excel the text import wizard will open. Specify that the file is delimited with commas, then make sure to change the appropriate column from "General" to "Text".
As an example: looking at the data in the question it appears that your CSV file might look like
,,,,MARCH3,,,,1
,,,,SEPT9,,,,2
,,,,XYZ,,,,3
,,,,ABC,,,,4
,,,,NNN,,,,5
If I save this file with a ".csv" extension and open it in Excel I get:
3-Mar 1
9-Sep 2
XYZ 3
ABC 4
NNN 5
with the date values changed as you noted. When I change the file extension to ".dat", making no other changes to the file, and open it in Excel I'm presented with the Text Import Wizard. I tell Excel that the file is "Delimited", choose "Comma" as the delimiter, and in the column with the "MARCH3" and "SEPT9" values I change the Column Data Type to "Text" (instead of "General"). After I clicked the Finish button on the wizard I got the following data in the spreadsheet:
MARCH3 1
SEPT9 2
XYZ 3
ABC 4
NNN 5
I tried putting the MARCH3 and SEPT9 values in double-quotes to see if that would convince Excel to treat these values as text but Excel still converted these cells to dates.
Share and enjoy.

My solution was to append a semicolon to all the gene names. The added character convinces excel that this column is text not a date. You can find and replace the semicolon later is you want, but most programs - like perseus will allow you to ignore everything after the semicolon so its not always a problem...
df$Gene.name <- paste(df$Gene.name, ";", sep="")
I would be interested in anyone has a trick for doing this to just the Sept, March gene names though...

fread() error and strange behaviour when reading csv

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:
' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.
I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.
I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv
I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.
The four lines in testout.csv are below (I start a new line for each entry below for readability).
"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail.
.",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131
When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)
Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.

UPDATE: Now fixed in v1.9.3 on GitHub :
fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.See:
fread and a quoted multi-line column value
Windows users are reporting success with the latest version from GitHub.

Long number field does not retail string of digits due to Excel. Excel Office Professional 2013 converts digits to rounded number

I'm reading a csv file in R that includes a conversion ID column. The issue I'm running into is that my conversionID is being rounded as an exponential number. Below is snapshot of the CSV file (opened in Excel) that I'm reading into R. As you can see, the conversion ID is an exponential format, but the value is: 383305820480.
When I read the data into R, using the following lines, I got the following output. Which looks like it's rounding the string of conversion IDs.
x<-read.csv("./Test2.csv")
options("scipen"=100, "digits"=15)
x
When I export the file as CSV, using the code
write.csv(x,"./Test3.csv")
I get the following output. As you can see, I no longer have a unique identifier as it rounds the number.
I also tried reading the file as a factor, using the code, but I get the same output with numbers rounded. I need the Conversion.ID to be a unique identifier.
x<-read.csv("./Test2.csv", colClasses="character")
The only way I can get the Conversion ID column to stay as a unique identifier is to open the CSV file and write a ' in front of each conversion ID. That is not scalable because I have hundreds of files.

I can't replicate your experience.
(Update: OP reports that the problem is actually with Excel converting/rounding the data on import [!!!])
I created a file on disk with full precision (I don't know the least-significant digits of your data, you didn't show them except for the first element, but I put a non-zero value in the units place for illustration):
writeLines(c(
"Conversion ID",
" 383305820480",
" 39634500000002",
" 213905000000002",
"1016890000000002",
"1220910000000002"),
con="Test2.csv")
Read the file and print it with full precision (use check.names=FALSE for perfect "round trip" capability -- not something you want to do on a regular basis):
x <- read.csv("Test2.csv",check.names=FALSE)
options(scipen=100)
print(x,digits=20)
## Conversion ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
Looks OK.
Now write output (use row.names=FALSE to avoid adding row names/allow a clean round-trip):
write.csv(x,"Test3.csv",row.names=FALSE,quote=FALSE)
The least-mediated way to examine a file on disk from within R is file.show():
file.show("Test3.csv")
## Conversion ID
## 383305820480
## 39634500000002
## 213905000000002
## 1016890000000002
## 1220910000000002
x3 <- read.csv("Test3.csv",check.names=FALSE)
all.equal(x,x3) ## TRUE
Use system tools to check that the files are the same (except for white space differences -- the original file was right-justified):
system("diff -w Test2.csv Test3.csv") ## no difference
If you have even longer ID strings you will need to read them as character to avoid loss of precision:
read.csv("Test2.csv",colClasses="character")
## Conversion.ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
You could probably round-trip through Excel more safely (if you still think that's a good idea) by importing as character and exporting with quotation marks to protect the values.

I just figured out the issue. It looks like my version of Excel is converting the data, causing it to lose the digits. If I avoid opening the file in Excel after downloading it, it retains all the digits. I'm not sure if this is a known issue with newer version. I'm using Excel Office Professional Plus 2013.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Non-existing blank/strange character at the beginning of the import - r

Related

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

Character values stored in DATAFRAME with Double Quotes while reading into R

Character strings are converted to date in .csv. Can't even be converted back in Excel (R)

fread() error and strange behaviour when reading csv

Long number field does not retail string of digits due to Excel. Excel Office Professional 2013 converts digits to rounded number

Categories

Resources