Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"
The file here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104323
called: GSE104323_metadata_barcodes_24185cells.txt.gz
Will not load propoerly in R, it is missing the age column which is arguably the most important metadata and half the coilums are NA's.
The following code loads the data.
hochgerner24k_2018_meta <- read.table(paste(testsetpath,"/Hochgerner2018/GSE104323_metadata_barcodes_24185cells.txt.gz", sep=""), header =TRUE, fill =TRUE)
Without fill = TRUE the following error is produced: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 30 elements
How can I load this metadata into a dataframe without all this missing information?
That file doesn't have any metadata that I can see. It's a tab-separated file. How do I know that? Well I suppose I could have looked at its documentation which is probably around some where, but what I did instead was to look at it in a text editor --->
You can observe the typical arrangement of text for a tab-separated file: all the columns are left aligned with some of them shifting over when the text bleeds over into the next column.
hochgerner24k_2018_meta <- read.table("~/Downloads/GSE104323_metadata_barcodes_24185cells.txt.gz", header =TRUE, sep="\t")
> str(hochgerner24k_2018_meta)
'data.frame': 24216 obs. of 11 variables:
$ Sample.name..24185.single.cells. : chr "10X79_1_AAACTAGCTAGCCC-" "10X79_1_AAACTAGGATGTAT-" "10X79_1_AAACTCACGGCGTT-" "10X79_1_AAACTGTCGGCTCA-" ...
$ source.name : chr "dentate gyrus" "dentate gyrus" "dentate gyrus" "dentate gyrus" ...
$ organism : chr "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
$ characteristics..strain : chr "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" ...
$ characteristics..age : chr "P120" "P120" "P120" "P120" ...
$ characteristics..sex.of.pooled.animals: chr "2males+1female" "2males+1female" "2males+1female" "2males+1female" ...
$ characteristics..cell.cluster : chr "Neuroblast" "OPC" "GC-adult" "MOL" ...
$ molecule : chr "total RNA" "total RNA" "total RNA" "total RNA" ...
$ SRR.run.accession : chr "SRR6089817" "SRR6089947" "SRR6089529" "SRR6089595" ...
$ raw.file..original.file.name. : chr "10X79_1_AAACTAGCTAGCCC.fq.gz" "10X79_1_AAACTAGGATGTAT.fq.gz" "10X79_1_AAACTCACGGCGTT.fq.gz" "10X79_1_AAACTGTCGGCTCA.fq.gz" ...
$ UMI_CellularBarcode : chr "CGGCGATCCC_AAACTAGCTAGCCC" "AGTGGTAATG_AAACTAGGATGTAT" "GGGTGCGCTC_AAACTCACGGCGTT" "CCTTTCAACG_AAACTGTCGGCTCA" ...
Note: .gz files will not always be this format (or any format). They are just compressed and the delimiter has no particular special importance to the compression algorithm. The next .gz file you see could have any structure or no structure.
I have a file with irregular quotes like the following:
"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"
It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).
What I've tried is the following:
# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")
And I also tried the suggestions offered here:
# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])
# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.
I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.
library(data.table)
library(magrittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)),
.SDcols = string_cols]
str(DT0)
Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables:
$ INDICATOR : chr "Spi-3"
$ CTY_CODE : chr "AFG"
$ MGN_CODE : chr ""
$ EVENT_NR : int 1
$ EVENT_NR_CR : int 1
$ START_DATE : chr "1952-10-01"
$ PEAK_DATE : chr "1952-11-01"
$ END_DATE : chr "1953-06-01"
$ MAX_EXT_ON : chr "1952-11-01"
$ DURATION : int 9
$ SEVERITY : num 6.98
$ INTENSITY : num 0.78
$ AVERAGE_AREA : num 19.8
$ WIDEST_AREA_PERC: num 44.1
$ SCORE : int 5
$ GRP_ID : chr "1"
- attr(*, ".internal.selfref")=<externalptr>
I'm opening ebird data with auk, having trouble creating a path for the file. I set the path to a folder. When I try to change it to a file it says the path is not true.
with the Sys.getenv() I can see the path is set to a folder. with the auk_get_ebd_path() command I see the same thing. When I try to change the path to a file inside that folder with the auk_set_ebd_path() command I receive an error message.
library(auk)
auk_get_ebd_path()
[1] "/Users/lucypullen/Documents/bird/data"
auk_set_ebd_path("/Users/lucypullen/Documents/bird/data/ebd_CA_relApr-2019.txt", overwrite = TRUE)
[1] Error: dir.exists(paths = path) is not TRUE
other attempts yeilded an Error in file(con, "r") : cannot open the connection message
Warning messages: 1: In file(con, "r") :
'raw = FALSE' but '/Users.....data/CA' is not a regular file
2: In file(con, "r") :
cannot open file '/Users/lucypullen/Documents/bird/data/CA': it is a directory
seems like they want the path to go to a file. I thought the path would be complete with the system.file() command. I've tried a bunch of variations:
input_file <- system.file("/Users/lucypullen/Documents/bird/data/CA/ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("~/ebd_CA_relApr-2019.txt", package = "auk")
I suspect you should be doing this, since there appears to have been some sort of setup operation that preceded this question:
my_ebd_path = auk_get_ebd_path() # since you appear to already set it.
my_full_file_loc <- paste0(my_ebd_path, ”/“, "ebd_CA_relApr-2019.txt")
my_ebd_data <- read_ebd(my_full_file_loc)
str(my_ebd_data)
# ------what I get with the sample file in the package--------------
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 494 obs. of 45 variables:
$ checklist_id : chr "S6852862" "S14432467" "S39033556" "S38303088" ...
$ global_unique_identifier : chr "URN:CornellLabOfOrnithology:EBIRD:OBS97935965" "URN:CornellLabOfOrnithology:EBIRD:OBS201605886" "URN:CornellLabOfOrnithology:EBIRD:OBS530638734" "URN:CornellLabOfOrnithology:EBIRD:OBS520887169" ...
$ last_edited_date : chr "2016-02-22 14:59:49" "2013-06-16 17:34:19" "2017-09-06 13:13:34" "2017-07-24 15:17:16" ...
$ taxonomic_order : num 20145 20145 20145 20145 20145 ...
$ category : chr "species" "species" "species" "species" ...
$ common_name : chr "Green Jay" "Green Jay" "Green Jay" "Green Jay" ...
$ scientific_name : chr "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" ...
$ observation_count : chr "4" "2" "1" "1" ...
$ breeding_bird_atlas_code : chr NA NA NA NA ...
#----snipped a bunch of output---------
I am using data.table fread() function to read some data which have missing values and they were generated in Excel, so the missing values string is "#N/A". However, when I use the na.strings command the final str of the read data is still character. To replicate this, here is code and data.
Data:
Date,a,b,c,d,e,f,g
1/1/03,#N/A,0.384650146,0.992190069,0.203057232,0.636296656,0.271766148,0.347567706
1/2/03,#N/A,0.461486974,0.500702057,0.234400718,0.072789936,0.060900352,0.876749487
1/3/03,#N/A,0.573541006,0.478062582,0.840918789,0.061495666,0.64301024,0.939575302
1/4/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/5/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/6/03,#N/A,0.66678429,0.897482818,0.569609033,0.524295691,0.132941158,0.194114347
1/7/03,#N/A,0.576835985,0.982816576,0.605408973,0.093177815,0.902145012,0.291035649
1/8/03,#N/A,0.100952961,0.205491093,0.376410642,0.775917986,0.882827749,0.560508499
1/9/03,#N/A,0.350174456,0.290225065,0.428637309,0.022947911,0.7422805,0.354776101
1/10/03,#N/A,0.834345466,0.935128099,0.163158666,0.301310627,0.273928596,0.537167776
1/11/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/12/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/13/03,#N/A,0.325914633,0.68192633,0.320222677,0.249631582,0.605508964,0.739263677
1/14/03,#N/A,0.715104989,0.639040211,0.004186366,0.351412982,0.243570606,0.098312443
1/15/03,#N/A,0.750380716,0.264929325,0.782035411,0.963814327,0.93646428,0.453694758
1/16/03,#N/A,0.282389354,0.762102103,0.515151803,0.194083842,0.102386764,0.569730516
1/17/03,#N/A,0.367802161,0.906878948,0.848538256,0.538705673,0.707436236,0.186222899
1/18/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/19/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/20/03,#N/A,0.79933188,0.214688799,0.37011313,0.189503843,0.294051763,0.503147404
1/21/03,#N/A,0.620066341,0.329949446,0.123685075,0.69027192,0.060178071,0.599825005
(data saved in temp.csv)
Code:
library(data.table)
a <- fread("temp.csv", na.strings="#N/A")
gives (I have larger dataset so neglect the number of observations):
Classes ‘data.table’ and 'data.frame': 144 obs. of 8 variables:
$ Date: chr "1/1/03" "1/2/03" "1/3/03" "1/4/03" ...
$ a : chr NA NA NA NA ...
$ b : chr "0.384650146" "0.461486974" "0.573541006" NA ...
$ c : chr "0.992190069" "0.500702057" "0.478062582" NA ...
$ d : chr "0.203057232" "0.234400718" "0.840918789" NA ...
$ e : chr "0.636296656" "0.072789936" "0.061495666" NA ...
$ f : chr "0.271766148" "0.060900352" "0.64301024" NA ...
$ g : chr "0.347567706" "0.876749487" "0.939575302" NA ...
- attr(*, ".internal.selfref")=<externalptr>
This code works fine
a <- read.csv("temp.csv", header=TRUE, na.strings="#N/A")
Is it a bug? Is there some smart workaround?
The documentation from ?fread for na.strings reads:
na.strings A character vector of strings to convert to NA_character_. By default for columns read as type character ",," is read as a blank string ("") and ",NA," is read as NA_character_. Typical alternatives might be na.strings=NULL or perhaps na.strings = c("NA","N/A","").
You should convert them to numeric yourself after, I suppose. At least this is what I understand from the documentation.
Something like this?
cbind(a[, 1], a[, lapply(.SD[, -1], as.numeric)])