Meta data .gz file in r won't load properly - r

The file here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104323
called: GSE104323_metadata_barcodes_24185cells.txt.gz
Will not load propoerly in R, it is missing the age column which is arguably the most important metadata and half the coilums are NA's.
The following code loads the data.
hochgerner24k_2018_meta <- read.table(paste(testsetpath,"/Hochgerner2018/GSE104323_metadata_barcodes_24185cells.txt.gz", sep=""), header =TRUE, fill =TRUE)
Without fill = TRUE the following error is produced: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 30 elements
How can I load this metadata into a dataframe without all this missing information?

That file doesn't have any metadata that I can see. It's a tab-separated file. How do I know that? Well I suppose I could have looked at its documentation which is probably around some where, but what I did instead was to look at it in a text editor --->
You can observe the typical arrangement of text for a tab-separated file: all the columns are left aligned with some of them shifting over when the text bleeds over into the next column.
hochgerner24k_2018_meta <- read.table("~/Downloads/GSE104323_metadata_barcodes_24185cells.txt.gz", header =TRUE, sep="\t")
> str(hochgerner24k_2018_meta)
'data.frame': 24216 obs. of 11 variables:
$ Sample.name..24185.single.cells. : chr "10X79_1_AAACTAGCTAGCCC-" "10X79_1_AAACTAGGATGTAT-" "10X79_1_AAACTCACGGCGTT-" "10X79_1_AAACTGTCGGCTCA-" ...
$ source.name : chr "dentate gyrus" "dentate gyrus" "dentate gyrus" "dentate gyrus" ...
$ organism : chr "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
$ characteristics..strain : chr "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" ...
$ characteristics..age : chr "P120" "P120" "P120" "P120" ...
$ characteristics..sex.of.pooled.animals: chr "2males+1female" "2males+1female" "2males+1female" "2males+1female" ...
$ characteristics..cell.cluster : chr "Neuroblast" "OPC" "GC-adult" "MOL" ...
$ molecule : chr "total RNA" "total RNA" "total RNA" "total RNA" ...
$ SRR.run.accession : chr "SRR6089817" "SRR6089947" "SRR6089529" "SRR6089595" ...
$ raw.file..original.file.name. : chr "10X79_1_AAACTAGCTAGCCC.fq.gz" "10X79_1_AAACTAGGATGTAT.fq.gz" "10X79_1_AAACTCACGGCGTT.fq.gz" "10X79_1_AAACTGTCGGCTCA.fq.gz" ...
$ UMI_CellularBarcode : chr "CGGCGATCCC_AAACTAGCTAGCCC" "AGTGGTAATG_AAACTAGGATGTAT" "GGGTGCGCTC_AAACTCACGGCGTT" "CCTTTCAACG_AAACTGTCGGCTCA" ...
Note: .gz files will not always be this format (or any format). They are just compressed and the delimiter has no particular special importance to the compression algorithm. The next .gz file you see could have any structure or no structure.

Related

Error with ymd() in r: "All formats failed to parse. No formats found. "

I'm trying to convert a column in a table from integer to date with ymd().
The table is large one creating from merging several csv. The table structure is as follows:
Classes ‘data.table’ and 'data.frame': 49229 obs. of 46 variables:
PEP : chr
PN_Oper_M : chr
Desig_Oper_M : chr
Refer_M : int
Estado_M : chr
Conc_SAP : chr
Inc_SAP : chr
Conc_SICAP : chr
Incu_SICAP : chr
Avance : chr
Quemado : chr
RTD : chr
F_Ini_Plan_ : int 20200303 20201021 20210211 20210211 20210211...
F_Fin_Plan_ : int 20200424 20201021 20210211 20210211 20210211...
F_Cie_Plan_ : int 20200430 20201027 20210217 20210217 20210217...
Grupos : chr
Localiza : chr
Zona : chr
AG : chr
GNTs : chr
Empresas : chr
Hitos : chr
Req_Pdtes : int
Falta : int
CADO : chr
UDC : chr
New : int
Estr : int
N_HNC : chr
F_Libe_R_ : int 20191211 20200727 20201202 20201202 20201202...
F_Ini_R_ : int 20200303 20210308 20210216 20210204 20210218...
F_CERR_ : int 20200430 20210323 20210305 20210305 20210316...
F_Fin_Prod_ : int 20200424 20210316 20210216 20210204 20210218...
F_Fin_Cal_ : int 20200429 20210318 20210222 20210204 20210313...
Est_SICAP : chr
Est_SIPLA : chr
ZPLA : int
ZING : int
Comentarios : chr
IntExt : chr
Perfil : chr
H_Planif : chr
DOP_DI_Number : chr
DOP_DI_Status : chr
DOP_DI_Ubicacion: chr
Manual_JC : int
- attr(*, ".internal.selfref")=<externalptr>
All dates are merged as integer and I want to convert them into date. I have two questions:
- I extract one of the columns and try to convert it with ymd() using the following code:
d1 <- all[ , 30]
d1 <- ymd(d1)
But I get the following error:
"Warning message:
All formats failed to parse. No formats found."
There are empty values, could it be the problem?
Is there a quick way to convert several columns format?? The dataframe has no headers so I have to do it calling the column position.
Many thanks
Hugo
I use ymd() in other contexts. I think what you want, is to define the columns as dates.
Try:
df$F_Fin_Cal_ <- as.Date(df$F_Fin_Cal_, format="%Y%m%d") # df is the name of your data.frame
for all colums seperatly or with lapply() for all columns at once.
cols <- c("col1", "col2",...) # names of all relevant columns
cols <- c(1,2,3,...) # alternative adressing of columns
df[cols] <- lapply(df[cols], as.Date, format="%Y%m%d")
You could also use the lubridate package
df$F_Fin_Cal_ <- as.character(df$F_Fin_Cal_) #in case your cols are not char, convert to char
df$F_Fin_Cal_ <- lubridate::as_date(df$F_Fin_Cal_)
I convert the column to character because lubridate or even as.Date works best with char. I am not sure what the col type of your column is.
You can also use lapply like #Clem showed

How to consider "Long Stopword List" from http://www.ranks.nl/stopwords?

I'm interested to remove all stopwords from my text using R. The list of stopwords that I want to remove can be found at http://www.ranks.nl/stopwords under the section which says "Long Stopword List" (a very long list version). I'm using tm package. Can one help me, please? Tnx!
You can copy that list (after you select it in your browser) aand then paste it into this expression in R:
LONGSWS <- " <paste into this position> "
You would place the cursor for your editor or the IDE console device inside the two quotes. Then do this:
sw.vec <- scan(text=LONGSWS, what="")
#Read 474 items
The scan function needs to have the type of input specified via an example given to the what argument, and for that purpose just using "" is sufficient for character types. Then you should be able to apply the code you offered in your comment:
tm_map(text, removeWords, sw.vec)
You have not supplied an example text object. Using just a character vector is not successful:
tm_map("test of my text", removeWords, sw.vec )
#Error in UseMethod("tm_map", x) :
# no applicable method for 'tm_map' applied to an object of class "character"
So we will need to assume you have a suitable object of a suitable class to place in the first position of the arguments to tm_map. So using the example from the ?tm_map help page:
> res <- tm_map(crude, removeWords, sw.vec )
> str(res)
List of 20
$ 127:List of 2
..$ content: chr "Diamond Shamrock Corp said \neffective today cut contract prices crude oil \n1.50 dlrs barrel.\n The re"| __truncated__
..$ meta :List of 15
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "1987-02-26 17:00:56"
.. ..$ description : chr ""
.. ..$ heading : chr "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
.. ..$ id : chr "127"
.. ..$ language : chr "en"
.. ..$ origin : chr "Reuters-21578 XML"
.. ..$ topics : chr "YES"
.. ..$ lewissplit : chr "TRAIN"
.. ..$ cgisplit : chr "TRAINING-SET"
# ----------------snipped remainder of long output.

How to specify end of header line with read.table

I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "

troubleshooting a connection to the ebd file path for awk in r in mac and windows

I'm opening ebird data with auk, having trouble creating a path for the file. I set the path to a folder. When I try to change it to a file it says the path is not true.
with the Sys.getenv() I can see the path is set to a folder. with the auk_get_ebd_path() command I see the same thing. When I try to change the path to a file inside that folder with the auk_set_ebd_path() command I receive an error message.
library(auk)
auk_get_ebd_path()
[1] "/Users/lucypullen/Documents/bird/data"
auk_set_ebd_path("/Users/lucypullen/Documents/bird/data/ebd_CA_relApr-2019.txt", overwrite = TRUE)
[1] Error: dir.exists(paths = path) is not TRUE
other attempts yeilded an Error in file(con, "r") : cannot open the connection message
Warning messages: 1: In file(con, "r") :
'raw = FALSE' but '/Users.....data/CA' is not a regular file
2: In file(con, "r") :
cannot open file '/Users/lucypullen/Documents/bird/data/CA': it is a directory
seems like they want the path to go to a file. I thought the path would be complete with the system.file() command. I've tried a bunch of variations:
input_file <- system.file("/Users/lucypullen/Documents/bird/data/CA/ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("~/ebd_CA_relApr-2019.txt", package = "auk")
I suspect you should be doing this, since there appears to have been some sort of setup operation that preceded this question:
my_ebd_path = auk_get_ebd_path() # since you appear to already set it.
my_full_file_loc <- paste0(my_ebd_path, ”/“, "ebd_CA_relApr-2019.txt")
my_ebd_data <- read_ebd(my_full_file_loc)
str(my_ebd_data)
# ------what I get with the sample file in the package--------------
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 494 obs. of 45 variables:
$ checklist_id : chr "S6852862" "S14432467" "S39033556" "S38303088" ...
$ global_unique_identifier : chr "URN:CornellLabOfOrnithology:EBIRD:OBS97935965" "URN:CornellLabOfOrnithology:EBIRD:OBS201605886" "URN:CornellLabOfOrnithology:EBIRD:OBS530638734" "URN:CornellLabOfOrnithology:EBIRD:OBS520887169" ...
$ last_edited_date : chr "2016-02-22 14:59:49" "2013-06-16 17:34:19" "2017-09-06 13:13:34" "2017-07-24 15:17:16" ...
$ taxonomic_order : num 20145 20145 20145 20145 20145 ...
$ category : chr "species" "species" "species" "species" ...
$ common_name : chr "Green Jay" "Green Jay" "Green Jay" "Green Jay" ...
$ scientific_name : chr "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" ...
$ observation_count : chr "4" "2" "1" "1" ...
$ breeding_bird_atlas_code : chr NA NA NA NA ...
#----snipped a bunch of output---------

How to read a ".TAB" file

I am trying to find a way to retrieve data from Harvard Dataverse website through R. I am using "dataverse" and "dvn" packages, among others. Many of the data files end with ".tab", although they are not formatted as normal tab-delimited text.
I have done this:
library(dataverse)
## 01. Using the dataverse server and making a search
Sys.setenv("DATAVERSE_SERVER" ="dataverse.harvard.edu")
## 02. Loading the dataset that I chose, by url
doi_url <- "https://doi.org/10.7910/DVN/ZTCWYQ"
my_dataset <- get_dataset(doi_url)
## 03. Grabbing the first file of the dataset
## which is named "001_AppendixC.tab"
my_files <- my_dataset$files$label
my_file <- get_file(my_files[1], doi_url)
AppendixC <- tempfile()
writeBin(my_file, AppendixC)
read.table(AppendixC)
> Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
> line 1 did not have 2 elements
> In addition: Warning message:
> In read.table(AppendixC) :
> line 1 appears to contain embedded nulls
Any hint?
The problem is that dataverse::get_file() returns the file in a raw binary format. The easiest way to load it into memory is to write it to a tempfile with writeBin() and then read that file with the appropriate import/read function.
Here is a function that should automagically read it into memory:
# Uses rio, which automatically chooses the appropriate import/read
# function based on file type.
library(rio)
install_formats() # only needs to run once after
# pkg installation
load_raw_file <- function(raw, type) {
match.arg(
arg = type,
choices = c(
"csv", "tab", "psc", "tsv", "sas7bdat",
"sav", "dta", "xpt", "por", "xls", "xlsx",
"R", "RData", "rda", "rds", "rec", "mtb",
"feather", "csv.gz", "fwf"
)
)
tmp <- tempfile(fileext = paste0(".", type))
writeBin(as.vector(raw), tmp)
out <- import(tmp)
unlink(tmp)
out
}
Let's try it out with your file, which is a an excel file.
library(dataverse)
raw <- get_file(
"001_AppendixC.tab",
"https://doi.org/10.7910/DVN/ZTCWYQ"
)
data <- load_raw_file(raw, "xlsx")
And look at the data:
str(data)
> 'data.frame': 132 obs. of 17 variables:
> $ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
> $ UN_9193 : chr "37.4" "7.7" "9.1" "65.400000000000006" ...
> $ UN_9901 : chr "46.1" "7.2" "10.7" "50" ...
> $ UN_0709 : chr "24.6" "9.6999999999999993" "7.5" "23.7" ...
> $ UN_1416 : chr "23" "4.9000000000000004" "4.5999999999999996" "14" ...
> $ stu90_94 : chr "51.3" "37.200000000000003" "22.9" "52.9" ...
> $ stu98_02 : chr "54.7" "39.200000000000003" "23.6" "47.1" ...
> $ stu06_10 : chr "51.3" "23.1" "13.2" "29.2" ...
> $ stu12_16 : chr "40.9" "17.899999999999999" "11.7" "37.6" ...
> $ wast90_94: chr "11.5" "9.4" "7.1" "7.9" ...
> $ wast98_02: chr "13.4" "12.2" "3.1" "8.6999999999999993" ...
> $ wast06_10: chr "8.9" "9.4" "4.0999999999999996" "8.1999999999999993" ...
> $ wast12_16: chr "9.5" "6.2" "4.0999999999999996" "4.9000000000000004" ...
> $ UM1992 : chr "16.8" "3.7" "4.5" "22.6" ...
> $ UM2000 : chr "13.7" "2.6" "4" "21.7" ...
> $ UM2008 : chr "11" "1.8" "2.9" "19.2" ...
> $ UM2015 : chr "9.1" "1.4" "2.6" "15.7" ...

Resources