r format csv with random line breaks and export - r

I have csv files that for some reason have random line breaks after some codes:
I can read this file fine in R but I was wondering if there was a way to create an output without the random line breaks? Importing this file in other programs is creating issues where the 416 becomes a new line.
id,Abuse,AbuseHistoryOfAbuse,AbuseCurrentlyInAbusive,AbuseHistoryOfCPS,AbuseImminentRisk,AbuseInterventionCodes,Alcohol,AlcoholCurrentlyInTreatment,AlcoholSuspectUse,AlcoholAdmitsUse,AlcoholInterventionCodes,Asthma,AsthmaHistory,AsthmaInterventionCodes,BarriersToService,BarriersExperiencing,BarriersHistoryOf,BarriersToServiceInterventionCodes,BasicNeeds,BasicFood,BasicFoodLimitedAccess,BasicFoodNoWIC,BasicFoodNoDHS,BasicHousing,BasicHousingHasRegular,BasicHousingHomelessWith,BasicHousingHomelessWithout,BasicTransportation,BasicTransportationNoneLimited,BasicOther,BasicNeedsInterventionCodes,Breastfeeding,BreastfeedingPrenatal,BreastfeedingInterventionCodes,BreastHealth,BreastHealthInterventionCodes,Diabetes,DiabetesHistoryGestational,DiabetesHistoryDiabetes,DiabetesInterventionCodes,Drugs,DrugsType,DrugsUse,DrugsInterventionCodes,FamilyPlanning,FamilyPlanningNoPlans,FamilyPlanningInterventionCodes,Hypertension,HypertensionHistoryHypertension,HypertensionHistoryPreeclampsia,HypertensionInterventionCodes,Nutrition,NutritionInterventionCodes,ChronicDisease,ChronicDiseaseHistoryOther,ChronicDiseaseInterventionCodes,Periodontal,PeriodontalNoVisit,PeriodontalInterventionCodes,PersonalGoals,PersonalGoalsInterventionCodes,Smoking,SmokingUse,SmokingInterventionCodes,SocialSupport,SocialSupportInterventionCodes,STD,STDDiscloseSTD,STDDiscloseHIV,STDInterventionCodes,Stress,PrenatalEDSScore,PostnatalEDSScore,StressScore,StressAll,StressModerate,StressHistoryMentalHealth,StressHistoryBabyBlues,StressReportsStress,StressCurrentlyTreated,StressNotFollowing,StressEndorsesSuicidal,StressInterventionCodes,WomensHealth,WomensHealthInterventionCodes
0001,FALSE,FALSE,FALSE,FALSE,FALSE,NA,FALSE,FALSE,FALSE,FALSE,NA,FALSE,FALSE,NA,TRUE,FALSE,TRUE,411
416 ,TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,5F11
5T42
,TRUE,Not breastfeeding,NA,FALSE,NA,FALSE,FALSE,FALSE,NA,FALSE,NA,NA,NA,FALSE,FALSE,NA,FALSE,FALSE,FALSE,NA,FALSE,NA,FALSE,FALSE,NA,FALSE,FALSE,NA,FALSE,NA,FALSE,NA,NA,TRUE,2041,FALSE,FALSE,FALSE,NA,FALSE,NA,NA,NA,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,NA,FALSE,NA
I added a screenshot that helps show more:

As you mentioned, R does know how to skip through null lines when reading CSV files. This means that the resulting dataframe will contain no unexpected linebreaks, so if you just write it to CSV again, it will, likewise, have no random linebreaks:
temp <- read.csv('table_with_line_breaks.csv')
write.csv(temp, 'table_without_line_breaks.csv', row.names = FALSE, quote = FALSE)

Related

Excel Exporting Multiple Data Sets to a Single Spreadsheet File

I am trying to output multiple small data frames to an Excel file. The data frames are residuals and predicted from mgcv models run from a loop. Each is a separate small data set that I am trying to output to a separate worksheet in the same Excel spreadsheet file.
The relevant line of code that is causing the error is from what I can tell this line of code
write.xlsx(resid_pred, parfilename, sheetName=parsheetname, append = TRUE)**
where resid_pred is the residuals predicted data frame, parfilename is the file name and path and
parsheetname is the sheet name.
The error message is
Error in save Workbook(wb, file = file, overwrite = overwrite) : File already exists!
Which makes no sense since the file would HAVE to exist if I am appending to it. Does anyone have a clue?
Amazingly the following code will work:
write.xlsx2(resid_pred, file=parfilename, sheetName= parsheetname, col.names =
TRUE, rowNames = FALSE, append = TRUE, overwrite = FALSE)
The only difference is it is write.xlsx2 not just xlsx.

How to import YRBS ASCII .dat file into R

I'm trying to import the YRBS ASCII .dat file found here to analyze in R, but I'm having trouble importing the file. I followed the recommendations here and here but none seem to work. More specifically, it's still showing up as being one column/variable in R with 14,765 observations.
I've tried using the readLines(), read.table, and read.csv functions but none seem to be separating the columns.
Here are the specific codes I tried:
readLines("D:/Projects/XXH2017_YRBS_Data.dat", n=5)
read.csv("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
read.table("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
readLines and read.csv only provided one column and I got an error message from using read.table that stated that line 1 did not have 23 elements (which I'm assuming is just referring to the missing values?).
The data also starts from line 1 so I cannot use skip = 1 like some have suggested online.
How do I import this file into R so that I can separate the columns?
Bulky file, so I did not download them.
First, use an Access file version then use try following codes.
Compare it to Access data.
data<- readr::read_table2("XXH2017_YRBS_Data.dat", col_names = FALSE, na = ".")

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

Resources