How to specify end of header line with read.table - r

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names

Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

Related

Exporting to csv as a vector, and be able to import it as a vector [duplicate]

Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"

Meta data .gz file in r won't load properly

The file here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104323
called: GSE104323_metadata_barcodes_24185cells.txt.gz
Will not load propoerly in R, it is missing the age column which is arguably the most important metadata and half the coilums are NA's.
The following code loads the data.
hochgerner24k_2018_meta <- read.table(paste(testsetpath,"/Hochgerner2018/GSE104323_metadata_barcodes_24185cells.txt.gz", sep=""), header =TRUE, fill =TRUE)
Without fill = TRUE the following error is produced: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 30 elements
How can I load this metadata into a dataframe without all this missing information?
That file doesn't have any metadata that I can see. It's a tab-separated file. How do I know that? Well I suppose I could have looked at its documentation which is probably around some where, but what I did instead was to look at it in a text editor --->
You can observe the typical arrangement of text for a tab-separated file: all the columns are left aligned with some of them shifting over when the text bleeds over into the next column.
hochgerner24k_2018_meta <- read.table("~/Downloads/GSE104323_metadata_barcodes_24185cells.txt.gz", header =TRUE, sep="\t")
> str(hochgerner24k_2018_meta)
'data.frame': 24216 obs. of 11 variables:
$ Sample.name..24185.single.cells. : chr "10X79_1_AAACTAGCTAGCCC-" "10X79_1_AAACTAGGATGTAT-" "10X79_1_AAACTCACGGCGTT-" "10X79_1_AAACTGTCGGCTCA-" ...
$ source.name : chr "dentate gyrus" "dentate gyrus" "dentate gyrus" "dentate gyrus" ...
$ organism : chr "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
$ characteristics..strain : chr "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" ...
$ characteristics..age : chr "P120" "P120" "P120" "P120" ...
$ characteristics..sex.of.pooled.animals: chr "2males+1female" "2males+1female" "2males+1female" "2males+1female" ...
$ characteristics..cell.cluster : chr "Neuroblast" "OPC" "GC-adult" "MOL" ...
$ molecule : chr "total RNA" "total RNA" "total RNA" "total RNA" ...
$ SRR.run.accession : chr "SRR6089817" "SRR6089947" "SRR6089529" "SRR6089595" ...
$ raw.file..original.file.name. : chr "10X79_1_AAACTAGCTAGCCC.fq.gz" "10X79_1_AAACTAGGATGTAT.fq.gz" "10X79_1_AAACTCACGGCGTT.fq.gz" "10X79_1_AAACTGTCGGCTCA.fq.gz" ...
$ UMI_CellularBarcode : chr "CGGCGATCCC_AAACTAGCTAGCCC" "AGTGGTAATG_AAACTAGGATGTAT" "GGGTGCGCTC_AAACTCACGGCGTT" "CCTTTCAACG_AAACTGTCGGCTCA" ...
Note: .gz files will not always be this format (or any format). They are just compressed and the delimiter has no particular special importance to the compression algorithm. The next .gz file you see could have any structure or no structure.

How to read file with irregularly nested quotations?

I have a file with irregular quotes like the following:
"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"
It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).
What I've tried is the following:
# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")
And I also tried the suggestions offered here:
# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])
# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.
I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.
library(data.table)
library(magrittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)),
.SDcols = string_cols]
str(DT0)
Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables:
$ INDICATOR : chr "Spi-3"
$ CTY_CODE : chr "AFG"
$ MGN_CODE : chr ""
$ EVENT_NR : int 1
$ EVENT_NR_CR : int 1
$ START_DATE : chr "1952-10-01"
$ PEAK_DATE : chr "1952-11-01"
$ END_DATE : chr "1953-06-01"
$ MAX_EXT_ON : chr "1952-11-01"
$ DURATION : int 9
$ SEVERITY : num 6.98
$ INTENSITY : num 0.78
$ AVERAGE_AREA : num 19.8
$ WIDEST_AREA_PERC: num 44.1
$ SCORE : int 5
$ GRP_ID : chr "1"
- attr(*, ".internal.selfref")=<externalptr>

troubleshooting a connection to the ebd file path for awk in r in mac and windows

I'm opening ebird data with auk, having trouble creating a path for the file. I set the path to a folder. When I try to change it to a file it says the path is not true.
with the Sys.getenv() I can see the path is set to a folder. with the auk_get_ebd_path() command I see the same thing. When I try to change the path to a file inside that folder with the auk_set_ebd_path() command I receive an error message.
library(auk)
auk_get_ebd_path()
[1] "/Users/lucypullen/Documents/bird/data"
auk_set_ebd_path("/Users/lucypullen/Documents/bird/data/ebd_CA_relApr-2019.txt", overwrite = TRUE)
[1] Error: dir.exists(paths = path) is not TRUE
other attempts yeilded an Error in file(con, "r") : cannot open the connection message
Warning messages: 1: In file(con, "r") :
'raw = FALSE' but '/Users.....data/CA' is not a regular file
2: In file(con, "r") :
cannot open file '/Users/lucypullen/Documents/bird/data/CA': it is a directory
seems like they want the path to go to a file. I thought the path would be complete with the system.file() command. I've tried a bunch of variations:
input_file <- system.file("/Users/lucypullen/Documents/bird/data/CA/ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("~/ebd_CA_relApr-2019.txt", package = "auk")
I suspect you should be doing this, since there appears to have been some sort of setup operation that preceded this question:
my_ebd_path = auk_get_ebd_path() # since you appear to already set it.
my_full_file_loc <- paste0(my_ebd_path, ”/“, "ebd_CA_relApr-2019.txt")
my_ebd_data <- read_ebd(my_full_file_loc)
str(my_ebd_data)
# ------what I get with the sample file in the package--------------
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 494 obs. of 45 variables:
$ checklist_id : chr "S6852862" "S14432467" "S39033556" "S38303088" ...
$ global_unique_identifier : chr "URN:CornellLabOfOrnithology:EBIRD:OBS97935965" "URN:CornellLabOfOrnithology:EBIRD:OBS201605886" "URN:CornellLabOfOrnithology:EBIRD:OBS530638734" "URN:CornellLabOfOrnithology:EBIRD:OBS520887169" ...
$ last_edited_date : chr "2016-02-22 14:59:49" "2013-06-16 17:34:19" "2017-09-06 13:13:34" "2017-07-24 15:17:16" ...
$ taxonomic_order : num 20145 20145 20145 20145 20145 ...
$ category : chr "species" "species" "species" "species" ...
$ common_name : chr "Green Jay" "Green Jay" "Green Jay" "Green Jay" ...
$ scientific_name : chr "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" ...
$ observation_count : chr "4" "2" "1" "1" ...
$ breeding_bird_atlas_code : chr NA NA NA NA ...
#----snipped a bunch of output---------

Bad interpretation of #N/A using `fread`

I am using data.table fread() function to read some data which have missing values and they were generated in Excel, so the missing values string is "#N/A". However, when I use the na.strings command the final str of the read data is still character. To replicate this, here is code and data.
Data:
Date,a,b,c,d,e,f,g
1/1/03,#N/A,0.384650146,0.992190069,0.203057232,0.636296656,0.271766148,0.347567706
1/2/03,#N/A,0.461486974,0.500702057,0.234400718,0.072789936,0.060900352,0.876749487
1/3/03,#N/A,0.573541006,0.478062582,0.840918789,0.061495666,0.64301024,0.939575302
1/4/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/5/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/6/03,#N/A,0.66678429,0.897482818,0.569609033,0.524295691,0.132941158,0.194114347
1/7/03,#N/A,0.576835985,0.982816576,0.605408973,0.093177815,0.902145012,0.291035649
1/8/03,#N/A,0.100952961,0.205491093,0.376410642,0.775917986,0.882827749,0.560508499
1/9/03,#N/A,0.350174456,0.290225065,0.428637309,0.022947911,0.7422805,0.354776101
1/10/03,#N/A,0.834345466,0.935128099,0.163158666,0.301310627,0.273928596,0.537167776
1/11/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/12/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/13/03,#N/A,0.325914633,0.68192633,0.320222677,0.249631582,0.605508964,0.739263677
1/14/03,#N/A,0.715104989,0.639040211,0.004186366,0.351412982,0.243570606,0.098312443
1/15/03,#N/A,0.750380716,0.264929325,0.782035411,0.963814327,0.93646428,0.453694758
1/16/03,#N/A,0.282389354,0.762102103,0.515151803,0.194083842,0.102386764,0.569730516
1/17/03,#N/A,0.367802161,0.906878948,0.848538256,0.538705673,0.707436236,0.186222899
1/18/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/19/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/20/03,#N/A,0.79933188,0.214688799,0.37011313,0.189503843,0.294051763,0.503147404
1/21/03,#N/A,0.620066341,0.329949446,0.123685075,0.69027192,0.060178071,0.599825005
(data saved in temp.csv)
Code:
library(data.table)
a <- fread("temp.csv", na.strings="#N/A")
gives (I have larger dataset so neglect the number of observations):
Classes ‘data.table’ and 'data.frame': 144 obs. of 8 variables:
$ Date: chr "1/1/03" "1/2/03" "1/3/03" "1/4/03" ...
$ a : chr NA NA NA NA ...
$ b : chr "0.384650146" "0.461486974" "0.573541006" NA ...
$ c : chr "0.992190069" "0.500702057" "0.478062582" NA ...
$ d : chr "0.203057232" "0.234400718" "0.840918789" NA ...
$ e : chr "0.636296656" "0.072789936" "0.061495666" NA ...
$ f : chr "0.271766148" "0.060900352" "0.64301024" NA ...
$ g : chr "0.347567706" "0.876749487" "0.939575302" NA ...
- attr(*, ".internal.selfref")=<externalptr>
This code works fine
a <- read.csv("temp.csv", header=TRUE, na.strings="#N/A")
Is it a bug? Is there some smart workaround?
The documentation from ?fread for na.strings reads:
na.strings A character vector of strings to convert to NA_character_. By default for columns read as type character ",," is read as a blank string ("") and ",NA," is read as NA_character_. Typical alternatives might be na.strings=NULL or perhaps na.strings = c("NA","N/A","").
You should convert them to numeric yourself after, I suppose. At least this is what I understand from the documentation.
Something like this?
cbind(a[, 1], a[, lapply(.SD[, -1], as.numeric)])

Resources