Read delimited text (without newlines) and tell how many columns - r

I have several hundred text delimited files. In some columns, a newline before the end of the row appears in random columns. When is try to read, it looks for the correct number of columns, but because it is split on to the next row.
the arg fill=T does not help because it creates empty incorrect columns.
If I have:
"Aa|Bb|C\nc\ntwo|three|four"
But really should be two rows by three columns:
"Aa|Bb|Cc\ntwo|three|four"
How can I get there for all rows of the data (the error occurs randomly throughout)?

Note that you have C\nc in the string, which introduces c to a new line. I guess you need to ensure the format of your input string as the first step, otherwise it is difficult to fixed via post-processing.
I am not sure if the code below is what you are after. Do you mean something like using read.csv?
read.csv(text = sub("\n","",s),sep = "|",header = FALSE)
which gives
V1 V2 V3
1 Aa Bb Cc
2 two three four
If you are using data.table, you can try fread (thank #akrun)
fread(sub("\n", "", s))
Data
s <- "Aa|Bb|C\nc\ntwo|three|four"

Related

importing a SAS data set to R

I am trying to import a SAS data set to R (I cannot share the data set). SAS sees columns as number or character. However, some of the number columns have coded character values. I've used the sas7bdat package to bring in the data set but those character values in number columns return NaN. I would like the actual character value. I have tried exporting the data set to csv and tab delimited files. However, I end up with observations that take 2 lines (a problem with SAS that I haven't been able to figure out). Since there are over 9000 observations I cannot go back and look for those observations that take 2 lines manually. Any ideas how I can fix this?
SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.
First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.
Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.
Third would be a custom format that displays the numbers using letters.
The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.
As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.
filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;
The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.
data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;
partial answer for dealing with data across multiple rows
library( data.table )
#first, read the whole lines into a single colunm, for example with
DT <- data.table::fread( myfile, sep = "")
#sample data for this example: a data.table with ten rows containing the numbers 1 to 10
DT <- data.table( 1:10 )
#column-bind wo subsets of the data, using a logical vector to select the evenery first
#and every second row. then paste the colums together and collapse using a
#comma-separator (of whatever separator you like)
ans <- as.data.table(
cbind ( DT[ rep( c(TRUE, FALSE), length = .N), 1],
DT[ rep( c(FALSE, TRUE), length = .N), 1] )[, do.call( paste, c(.SD, sep = ","))] )
# V1
# 1: 1,2
# 2: 3,4
# 3: 5,6
# 4: 7,8
# 5: 9,10
I prefer read_sas function from 'haven' package for reading sas data
library(haven)
data <- read_sas("data.sas7bdat")

Loading csv - One row of intergers

Problem reading data vector - My csv data file (rab.csv) has just one row of > 10,000 numbers read into R with:
bab <- read.table("rab.csv") #which yields:
bab
V1
1 23,29,9,28,16,10,8,24,16,20,14,15,17,31,25,19,24,55,28,55,23, . . . and so on
In using this data vector, I get:
Error: data vector must consist of at least two distinct values!
It seems to only see the number "1" that was somehow added in front of the data.
I'm quite new to this so probably something simple, but I've spent 2 days searching every possibility I can think of without finding a solution.
We can use scan to read the file as a vector.
v1 <- scan("rab.csv", what=numeric(), sep=",")
In the read.table, if we don't specify header=FALSE, it will take the first column as header and as it is numeric, it will append X as prefix. (though, it can be avoided by using check.names=FALSE argument)

In R, how to read file with custom end of line (eol)

I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.

Concatenate select rows into one row without space in R (using forloop)

I'm trying to concatenate multiple rows into one.
Each row, it is either start with ">Gene Identifier" or Sequence information
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
Here I just put two genes, but there are hundreds of genes following this.
Basically I will just leave the gene identifier as this, but I want to concatenate sequences only when it is separated into multiple rows.
Therefore, the final results should look like this:
The sequences were concatenated and combined into one row, without any space inbetween.
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
By using "paste" function in R, I was able to achieve this manually.
i.e. paste(dat[2,1], dat[3,1], sep="")
However, I have a list of hundreads of gene, so I need a way to concatenate rows automatically.
I was thinking forloop, basically, if the row starts from ">", skip it, but if it is not start from ">", concatenate.
But I'm not expert in bioinformatics/R, it is hard for me to actually generate a script to achieve it.
Any help would be greatly appreciated!
Something happened when I pasted this into the answer box to concatenate the data lines but they were separate in my R session so this should work:
Lines <-
readLines(textConnection(">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*
>*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*
"))
geneIdx <- grepl("\\|", Lines)
grp <- cumsum(geneIdx)
grp
#[1] 1 1 1 2 2 2
tapply(Lines, grp, FUN=function(x) c(x[1], paste(x[-1], collapse="") ) )
#----------------------
$`1`
[1] ">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714"
[2] "GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*"
$`2`
[1] ">*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909"
[2] "GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*"
Would regular expressions do the trick? The regular expression below deletes newlines (\\n) not followed by > ((?!>) being a negative lookahead).
text <-">Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT"
cat(text)
cat(gsub("\\n(?!>)", "", text, perl=TRUE))
Result
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT

Changing column names or exceptions to strsplit

I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`

Resources