How to import YRBS ASCII .dat file into R

How to import YRBS ASCII .dat file into R - r

I'm trying to import the YRBS ASCII .dat file found here to analyze in R, but I'm having trouble importing the file. I followed the recommendations here and here but none seem to work. More specifically, it's still showing up as being one column/variable in R with 14,765 observations.
I've tried using the readLines(), read.table, and read.csv functions but none seem to be separating the columns.
Here are the specific codes I tried:
readLines("D:/Projects/XXH2017_YRBS_Data.dat", n=5)
read.csv("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
read.table("D:/Projects/XXH2017_YRBS_Data.dat", header = FALSE)
readLines and read.csv only provided one column and I got an error message from using read.table that stated that line 1 did not have 23 elements (which I'm assuming is just referring to the missing values?).
The data also starts from line 1 so I cannot use skip = 1 like some have suggested online.
How do I import this file into R so that I can separate the columns?

Bulky file, so I did not download them.
First, use an Access file version then use try following codes.
Compare it to Access data.
data<- readr::read_table2("XXH2017_YRBS_Data.dat", col_names = FALSE, na = ".")

Related

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")

You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4

Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Asking for a fix for R backtick operator

I am trying to import data from in xls format in R, but it reads the header incorrectly, instead of
X1
R interprets the data as
`X1 `
that makes writing complicated R syntax impossible.
How this issue can be resolved ?

One can skip the header record and give your own column names with any number of R packages that read excel data. Here is an example with readxl::read_excel().
library(readxl)
data <- read_excel("./data/anExcelWorksheet.xlsx",
col_names=FALSE,
skip=1)

Read excel file with formulas in cells into R

I was trying to read an excel spreadsheet into R data frame. However, some of the columns have formulas or are linked to other external spreadsheets. Whenever I read the spreadsheet into R, there are always many cells becomes NA. Is there a good way to fix this problem so that I can get the original value of those cells?
The R script I used to do the import is like the following:
options(java.parameters = "-Xmx8g")
library(XLConnect)
# Step 1 import the "raw" tab
path_cost = "..."
wb = loadWorkbook(...)
raw = readWorksheet(wb, sheet = '...', header = TRUE, useCachedValues = FALSE)

UPDATE: read_excel from the readxl package looks like a better solution. It's very fast (0.14 sec in the 1400 x 6 file I mentioned in the comments) and it evaluates formulas before import. It doesn't use java, so no need to set any java options.
# sheet can be a string (name of sheet) or integer (position of sheet)
raw = read_excel(file, sheet=sheet)
For more information and examples, see the short vignette.
ORIGINAL ANSWER: Try read.xlsx from the xlsx package. The help file implies that by default it evaluates formulas before importing (see the keepFormulas parameter). I checked this on a small test file and it worked for me. Formula results were imported correctly, including formulas that depend on other sheets in the same workbook and formulas that depend on other workbooks in the same directory.
One caveat: If an externally linked sheet has changed since the last time you updated the links on the file you're reading into R, then any values read into R that depend on external links will be the old values, not the latest ones.
The code in your case would be:
library(xlsx)
options(java.parameters = "-Xmx8g") # xlsx also uses java
# Replace file and sheetName with appropriate values for your file
# keepFormulas=FALSE and header=TRUE are the defaults. I added them only for illustration.
raw = read.xlsx(file, sheetName=sheetName, header=TRUE, keepFormulas=FALSE)

Import text file using ff package

I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).
So, I use the following code to import:
test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)
but this returns the following error message :
Error in read.table.ffdf(FUN = "read.csv2", ...) :
only ffdf objects can be used for appending (and skipping the first.row chunk)
What am I doing wrong?
Here are the first 5 rows of the text file:
CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1 1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2 1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3 1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4 1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5 1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z

I encountered a similar problem related to reading csv into ff objects. On using
read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")
instead of implicit call
read.csv2.ffdf("FD_INDCVIZC_2010.txt")
I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.

You could try the following code:
read.csv2.ffdf("FD_INDCVIZC_2010.txt",
sep = "\t",
VERBOSE = TRUE,
first.rows = 100000,
next.rows = 200000,
header=T)
I am assuming that since its a txt file, its a tab-delimited file.
Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.

If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:
textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)
You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.
This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.

perhaps you could try the following code:
read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )

issues of reading csv files using read.table [duplicate]

This question already has answers here:
'Incomplete final line' warning when trying to read a .csv file into R
(17 answers)
Closed 9 years ago.
I am trying to import CSV files to graph for a project. I'm using R 2.15.2 on a Mac OS X.
The first way tried
The script I'm trying to run to import the CSV file is this:
group4 <- read.csv("XXXX.csv", header=T)
But I keep getting this error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
object 'XXXXXX.csv' not found
The second way tried
I tried moving my working directory but got another error saying I can't move my working directory. So I went into Preferences tab and changed the working directory to the file that has my CSV files. But I still get the same error(as the first way).
The third way tried
Then I tried this script:
group4 <- read.table(file.choose(), sep="\t", header=T)
And I get this error:
Warning message:
In read.table(file.choose(), sep = "\t", header = T) :
incomplete final line found by readTableHeader on '/Users/xxxxxx/Documents/Programming/R/xxxxxx/xxxxxx.csv'
I've searched on the R site and all over the Internet, and nothing has got me to the point where I can import this simple CSV file into the R console.

The file is not in your working directory, change it, or use an absolute path.
Than you are pointing to a non-existing directory, or you do not have write privileges there.
The last line of your file is malformed.

As to the missing EOF (i.e. last line in file is corrupted)...
Usually, a data file should end with an empty line. Perhaps check your file if that is the case.
As an alternative, I would suggest to try out readLines(). This function reads each line of your data file into a vector. If you know the format of your input, i.e. the number of columns in the table, you could do this...
number.of.columns <- 5 # the number of columns in your data file
delimiter <- "\t" # this is what separates the values in your data file
lines <- readLines("path/to/your/file.csv", -1L)
values <- unlist(lapply(lines, strsplit, delimiter, fixed=TRUE))
data <- matrix(values, byrow=TRUE, ncol=number.of.columns)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to import YRBS ASCII .dat file into R - r

Bulky file, so I did not download them. First, use an Access file version then use try following codes. Compare it to Access data. data<- readr::read_table2("XXH2017_YRBS_Data.dat", col_names = FALSE, na = ".")

Related

Different number of lines when loading a file into R

Asking for a fix for R backtick operator

Read excel file with formulas in cells into R

Import text file using ff package

issues of reading csv files using read.table [duplicate]

Categories

Resources