CSV file import in R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to import a CSV file into R to do fraud analysis with linear/logistic regression. What should have been pretty easy is turning complicated... This data set contains 26 variables and more than 2 million rows. I used this command line to import the CSV file:
data <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=";")
Nevertheless, R imported 2.3 million rows in only 1 variable. I attach an of the View(data) obtained after this step for more information. I have tried switching from sep=";" to sep="," using:
datos <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=",")
But got this error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I have tried changing read.csv to read.csv2 (2.3 million rows and 1 variable as result); or using fill=TRUE options (same result), nevertheless the import is not correct. I attach another image of original CSV look opened in Excel.
I appreciate in advance any suggestion or help to fix it.

Break down the problem into steps that you can check - initially I'd try something like
file <- 'C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv'
read.csv(file, header=F, skip=1, sep=',', nrow=1)
If this produces a data.frame with 1 row and 26 columns, you're in business, if not, check through the arguments of read.csv again, and see if any of the arguments need changing for your file.
Now progress to
read.csv(file, header=T, skip=0, sep=',', nrow=1)
This should give you the same one line data.frame, but with column names correct - if not check the csv file has the right number of columns in the first row, or carry on skipping the header and assign column names after you've read it in.
Now increase nrow, initially to 10, then maybe by a factor of 10 until you read in the whole file, or you hit a problem. Use a binary search to find the exact line that causes the problem, by setting nrow to be halfway between the value you know works, and one that doesn't until you find the exact problem line.
Refer to the csv in Excel to see what is particular about this line - does it have a strange character, unmatched quotes, fewer entries... this will influence how you fix the problem.
Repeat until your whole file reads in!

From the excel screenshot, the first line of data in your file has 31 columns; the second has 29...
My guess is that your csv file has a comma for column separator and a comma for decimal separator. You have to reexport your file to csv by making decimal and column separator different.

Related

How to import YRBS ASCII .dat file into R

I'm trying to import the YRBS ASCII .dat file found here to analyze in R, but I'm having trouble importing the file. I followed the recommendations here and here but none seem to work. More specifically, it's still showing up as being one column/variable in R with 14,765 observations.
I've tried using the readLines(), read.table, and read.csv functions but none seem to be separating the columns.
Here are the specific codes I tried:
readLines("D:/Projects/XXH2017_YRBS_Data.dat", n=5)
read.csv("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
read.table("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
readLines and read.csv only provided one column and I got an error message from using read.table that stated that line 1 did not have 23 elements (which I'm assuming is just referring to the missing values?).
The data also starts from line 1 so I cannot use skip = 1 like some have suggested online.
How do I import this file into R so that I can separate the columns?
Bulky file, so I did not download them.
First, use an Access file version then use try following codes.
Compare it to Access data.
data<- readr::read_table2("XXH2017_YRBS_Data.dat", col_names = FALSE, na = ".")

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Headers changing when reading data from csv or tsv in R

I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

read_excel all columns text [duplicate]

This question already has answers here:
Specifying Column Types when Importing xlsx Data to R with Package readxl
(6 answers)
Closed 2 years ago.
I have an Excel file with all columns of type "text". When read_excel is called, however, some of the columns are guessed to be "dbl" instead. I know I can use col_types to specify the columns, but that requires me knowing how many columns there are in my file.
Is there a way I can detect the number of columns? Or, alternatively, specify that the columns are all "text"? Something like
read_excel("file.xlsx", col_types = "text")
which, quite reasonably, gives an error that I haven't specified the type for all the columns.
Currently, I can solve this by reading in the file twice:
read_excel_one_type <- function(filename, col_type = "text"){
temp <- read_excel(path = filename)
ncol.temp <- ncol(temp)
read_excel(path = filename, col_types = rep(col_type, ncol.temp))
}
but a method that doesn't require reading the file twice would be better.
This answer seems to be helpful: https://stackoverflow.com/a/34015430/5220858. I have found that the excel file needs to be formatted correctly from the start in order for R to automatically detect the correct data type (i.e. numeric, date, text). I think the post though is more relevant to your question. The poster shows a bit of code similar to what you have provided, except only one line of data is read to determine the number of columns, then the rest is read into R based on the first line.
library("xlsx")
file<-"myfile.xlsx"
sheetIndex<-1
mydf<-read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown")
works for me

Resources