The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:
Related
I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.
The following is my R-code:
df <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = read.csv2,
sep = ";",
quote = "", encoding = "UTF-8", skipNul = TRUE)
df_check <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = readLines,skipNul = TRUE)`
Then I verify that all data was loaded, by checking:
if(nrow(df) != (length(df_check) - dif)){
stop("some error msg")
}
dif is set to 1, to account for header in the CSV-files.
This check is the part that fails for a given CSV-file.
This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.
The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure.
I tried parsing different parameters to my readfunctions, but issue still persists.
I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.
This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.
Pick a file that fails. Slurp it with readLines.
Save the file locally using writeLines.
Your first job is to make sure that the check fails also when the file
is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.
Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.
First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.
Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.
Here is some code to give you an idea what I mean:
## first, prepare data – for this example only!
f <- file("test.csv", "w")
writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
close(f)
## actual test
## first, read the file with readlines
f <- file("test.csv", "r")
rl <- readLines(f)
close(f)
## then, read it with test.csv
csv <- read.csv("test.csv")
## third, prepare the lines as they should look based on the CSV
rl_sim <- do.call(paste, c(csv, sep=","))
## find the first mismatch
for(i in 1:length(rl_sim)) {
if(rl_sim[i] != rl[i + 1]) {
message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
break
}
}
I have csv files that for some reason have random line breaks after some codes:
I can read this file fine in R but I was wondering if there was a way to create an output without the random line breaks? Importing this file in other programs is creating issues where the 416 becomes a new line.
id,Abuse,AbuseHistoryOfAbuse,AbuseCurrentlyInAbusive,AbuseHistoryOfCPS,AbuseImminentRisk,AbuseInterventionCodes,Alcohol,AlcoholCurrentlyInTreatment,AlcoholSuspectUse,AlcoholAdmitsUse,AlcoholInterventionCodes,Asthma,AsthmaHistory,AsthmaInterventionCodes,BarriersToService,BarriersExperiencing,BarriersHistoryOf,BarriersToServiceInterventionCodes,BasicNeeds,BasicFood,BasicFoodLimitedAccess,BasicFoodNoWIC,BasicFoodNoDHS,BasicHousing,BasicHousingHasRegular,BasicHousingHomelessWith,BasicHousingHomelessWithout,BasicTransportation,BasicTransportationNoneLimited,BasicOther,BasicNeedsInterventionCodes,Breastfeeding,BreastfeedingPrenatal,BreastfeedingInterventionCodes,BreastHealth,BreastHealthInterventionCodes,Diabetes,DiabetesHistoryGestational,DiabetesHistoryDiabetes,DiabetesInterventionCodes,Drugs,DrugsType,DrugsUse,DrugsInterventionCodes,FamilyPlanning,FamilyPlanningNoPlans,FamilyPlanningInterventionCodes,Hypertension,HypertensionHistoryHypertension,HypertensionHistoryPreeclampsia,HypertensionInterventionCodes,Nutrition,NutritionInterventionCodes,ChronicDisease,ChronicDiseaseHistoryOther,ChronicDiseaseInterventionCodes,Periodontal,PeriodontalNoVisit,PeriodontalInterventionCodes,PersonalGoals,PersonalGoalsInterventionCodes,Smoking,SmokingUse,SmokingInterventionCodes,SocialSupport,SocialSupportInterventionCodes,STD,STDDiscloseSTD,STDDiscloseHIV,STDInterventionCodes,Stress,PrenatalEDSScore,PostnatalEDSScore,StressScore,StressAll,StressModerate,StressHistoryMentalHealth,StressHistoryBabyBlues,StressReportsStress,StressCurrentlyTreated,StressNotFollowing,StressEndorsesSuicidal,StressInterventionCodes,WomensHealth,WomensHealthInterventionCodes
0001,FALSE,FALSE,FALSE,FALSE,FALSE,NA,FALSE,FALSE,FALSE,FALSE,NA,FALSE,FALSE,NA,TRUE,FALSE,TRUE,411
416 ,TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,5F11
5T42
,TRUE,Not breastfeeding,NA,FALSE,NA,FALSE,FALSE,FALSE,NA,FALSE,NA,NA,NA,FALSE,FALSE,NA,FALSE,FALSE,FALSE,NA,FALSE,NA,FALSE,FALSE,NA,FALSE,FALSE,NA,FALSE,NA,FALSE,NA,NA,TRUE,2041,FALSE,FALSE,FALSE,NA,FALSE,NA,NA,NA,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,NA,FALSE,NA
I added a screenshot that helps show more:
As you mentioned, R does know how to skip through null lines when reading CSV files. This means that the resulting dataframe will contain no unexpected linebreaks, so if you just write it to CSV again, it will, likewise, have no random linebreaks:
temp <- read.csv('table_with_line_breaks.csv')
write.csv(temp, 'table_without_line_breaks.csv', row.names = FALSE, quote = FALSE)
I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")
I am trying to read a data file into R with several delimited columns. Some columns have entries which are special characters (such as arrow). Read.table comes back with an error:
incomplete final line found by readTableHeader
and does not read the file. Tried UTF-8, UTF-16 coding options which didn't help either. Here is a small example file.
I am not able to reproduce the arrow in this question box, hence I am attaching the image of the notepad screen of a small file (test1.txt).
Here is what I get when I try to open it.
test <- read.table("test1.TXT", header=T, sep=",", fileEncoding="UTF-8", stringsAsFactor=F)
Warning message: In read.table("test1.TXT", header = T, sep = ",",
fileEncoding = "UTF-8", : incomplete final line found by
readTableHeader on 'test1.TXT'
However, if I remove the second line (with the special character) and try to import the file, R imports it without problem.
test2.txt =
id, ti, comment
1001, 105AB, "All OK"
test <- read.table("test2.TXT", header=T, sep=",", fileEncoding="UTF-8", stringsAsFactor=F)
id ti comment
1 1001 105AB All OK
Although this is a small example, the file I am working with is very large. Is there a way I can import the file to R with those special characters in place?
Thank you.
test1.txt
I'm new, and I have a problem:
I got a dataset (csv file) with the 15 columns and 33,000 rows.
When I view the data in Excel it looks good, but when I try to load the data
into R- studio I have a problem:
I used the code:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="")
View(x)
The result is that the columnnames are good, but the data (row 2 and further) are
all in my first column.
In the first column the data is separated with ; . But when i try the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";")
The next problem is: Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names = NULL)
And it looks liked it worked.... But now the data is in the wrong columns (for example, the "name" column contains now the "time" value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think that is not the best way.
Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.
It happens because of Exel's specifics. The easy solution is just to copy all your data Ctrl+C to Notepad and Save it again from Notepad as filename.csv (don't forget to remove .txt if necessary). It worked well for me. R opened this newly created csv file correctly, all data was separated at columns right.
Open your file in text edit and see if it really is separated with commas...
Sometimes .csv files are separated with tabs instead of commas or semicolon and when opening in excel it has no problem but in R you have to specify the separator like this:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="\t")
I once had the same problem, this was my solution. Hope it works for you.
This problem can arise due to regional settings on the excel application where the .csv file was created.
While in most places a "," separates the columns in a COMMA separated file (which makes sense), in other places it is a ";"
Depending on your regional settings, you can experiment with:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",") #used in North America
or,
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";") #used in some parts of Asia and Europe
You could use -
df <- read.csv("filename.csv", sep = ";", quote = "")
It solved one my problems similar to yours.
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names =
NULL) And it looks liked it worked.... But now the data is in the
wrong columns (for example, the "name" column contains now the "time"
value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think
that is not the best way.
I had the exact same issue. Did quite some research and found out, that the CSV was ill-formed.
In the header line of the CSV there were all the labels (separated by the separator) and then a line break.
Starting with line 2, there was an additional separator at the end of each line. So an example of such an ill-formed CSV file looks like this:
Field1;Field2 <-- see the *missing* semicolon at the end
12;23; <-- see the *trailing* semicolon in each of the data lines
34;67;
45;56;
Such ill-formatted files are even harder to spot for TAB-separated files.
Excel does not care about that, when importing CSV files.
But R does care.
When you use skip=1 you skip the header line that contains part of the mismatch. The data frame will be imported well, but there will be a column of "NA" at the end of each row. And obviously you will not have column names, as these were skipped.
Easiest solution: edit the CSV file and either add an additional separator at the end of the header line as well, or remove the trailing delimiters in the data lines. You can also use generic read and write functions in R for text files to automate that editing.
You can transform the data by arranging the data into many cells corresponding to columns.
1.Open your csv file
2.copy the content and paste it into txt file save and copy its content
3.open new excell file
4.in excell go to the section responsible for data . it is acually called "Data"
5.then on the left side go to external data query , in german "externe Daten abfragen"
6.go ahead step by step and seperate by commas
7. save your file as csv
I had the same problem and it was frustrating...
However, I found the ultimate solution
First take this (csv file) and then convert it online to Json file and download it ... then redo the whole thing backwards (re-convert Jason to csv) online... download the converted file... give it a name...
then put it on your Rstudio
file name <- read.csv(file='name your file.csv')
... took me 4 days to think out of the box... 🙂🙂🙂