In R, how to read file with custom end of line (eol)

In R, how to read file with custom end of line (eol) - r

I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!

You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.

Related

Replacing Column Names without typing the actual column names

Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()

Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0

Loading csv - One row of intergers

Problem reading data vector - My csv data file (rab.csv) has just one row of > 10,000 numbers read into R with:
bab <- read.table("rab.csv") #which yields:
bab
V1
1 23,29,9,28,16,10,8,24,16,20,14,15,17,31,25,19,24,55,28,55,23, . . . and so on
In using this data vector, I get:
Error: data vector must consist of at least two distinct values!
It seems to only see the number "1" that was somehow added in front of the data.
I'm quite new to this so probably something simple, but I've spent 2 days searching every possibility I can think of without finding a solution.

We can use scan to read the file as a vector.
v1 <- scan("rab.csv", what=numeric(), sep=",")
In the read.table, if we don't specify header=FALSE, it will take the first column as header and as it is numeric, it will append X as prefix. (though, it can be avoided by using check.names=FALSE argument)

Read selected files from the directory based on selection criteria in R

I would like to read only selected .txt files in a folder to construct a giant table... I have over 9K files, and would like to import the files with the selected distance and building type, which is indicated in part of the file name.
For example, I want to first select files with name containing "_U0" and "_0_Final.txt":
Type = c(0,1)
D3Test = 1
Distance = c(0,50,150,300,650,800)
D2Test = 1;
files <- list.files(path=data.folder, pattern=paste("*U", Type[D3Test],"*_",Distance[D2Test],"_Final.txt",sep=""))
But the result returned empty...
Any problem with my construction?
filename <- scan(what="")
"M10_F1_T1_D1_U0_H1_0_Final.txt" "M10_F1_T1_D1_U0_H1_150_Final.txt" "M10_F1_T1_D1_U0_H1_300_Final.txt"
"M10_F1_T1_D1_U0_H1_50_Final.txt" "M10_F1_T1_D1_U0_H1_650_Final.txt" "M10_F1_T1_D1_U0_H1_800_Final.txt"
"M10_F1_T1_D1_U0_H2_0_Final.txt" "M10_F1_T1_D1_U0_H2_150_Final.txt" "M10_F1_T1_D1_U0_H2_300_Final.txt"
"M10_F1_T1_D1_U0_H2_50_Final.txt" "M10_F1_T1_D1_U0_H2_650_Final.txt" "M10_F1_T1_D1_U0_H2_800_Final.txt"
"M10_F1_T1_D1_U0_H3_0_Final.txt" "M10_F1_T1_D1_U0_H3_150_Final.txt" "M10_F1_T1_D1_U0_H3_300_Final.txt"
"M10_F1_T1_D1_U0_H3_50_Final.txt" "M10_F1_T1_D1_U0_H3_650_Final.txt" "M10_F1_T1_D1_U0_H3_800_Final.txt"
"M10_F1_T1_D1_U1_H1_0_Final.txt" "M10_F1_T1_D1_U1_H1_150_Final.txt" "M10_F1_T1_D1_U1_H1_300_Final.txt"
"M10_F1_T1_D1_U1_H1_50_Final.txt" "M10_F1_T1_D1_U1_H1_650_Final.txt" "M10_F1_T1_D1_U1_H1_800_Final.txt"

Another way would be to use sprintf and grepl.
x <- c("M10_F1_T1_D1_U0_H1_150_Final.txt", "M10_F1_T1_D1_U0_H2_650_Final.txt", "M10_F1_T1_D1_U1_H1_650_Final.txt")
x[grepl(sprintf("U%i_H%i_%i", 1, 1, 650), x)]
[1] "M10_F1_T1_D1_U1_H1_650_Final.txt"

You should look at the result that you are passing to pattern:
"*U0*_0_Final.txt"
It is not going to pick up any of those filenames. The asterisk is saying zero or more instances of "0" between "U" and the underscore. If Type and Distance are not represented by T and D in the file names, then this delivers the correct pattern:
grep( pattern=paste0("_U", Type[D3Test],".*_", Distance[D2Test],"_Final\\.txt"), filename)
#-----------
#[1] 1 7 13 So matches 3 filenames
Notice that you need to escape (with two backslashes) the periods that you want to be only periods because periods are special characters. You also need to use ".*" to allow a gap in the pattern.

files <- list.files(path=data.folder, pattern=paste("*U", Type[D3Test], "....",Distance[D2Test], sep=""))
I revised my code and this one works! Basically the idea is to use dot to present each character between Type[D3Test] and Distance[D2Test], since the characters between these two are fixed at 4.
Thanks to:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Changing column names or exceptions to strsplit

I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.

The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))

Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.

The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`

R - Remove commas from values in a column and place separated values into new rows

I have a column of gene symbols that I have retrieved directly from a database, and some of the rows contain two or more symbols which are comma separated (see example below).
SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26
I would like to remove the commas, and place the separated symbols into new rows like so:
SLC6A13
ATP5J2-PTCD1
BUD31
PTCD1
ACOT7
BUD3
PDAP1
TTC26
I haven't been able to find a straight forward way to do this in R, does anyone have any suggestions?

You can use this vector result to put into a matrix or a data.frame:
vec <- scan(text="SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26", what=character(), sep=",")
Read 8 items
vec
[1] "SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7" "BUD31" "PDAP1"
[8] "TTC26"
Perhaps:
as.matrix(vec)
(The scan function can also read from files. The "text" parameter was only added relatively recently, but it saves typing file=textConnection("...").)

Another option is to use readLines and strsplit :
unlist(strsplit(readLines(textConnection(txt)),','))
"SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7"
"BUD31" "PDAP1" "TTC26"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

In R, how to read file with custom end of line (eol) - r

Related

Replacing Column Names without typing the actual column names

Loading csv - One row of intergers

Read selected files from the directory based on selection criteria in R

Changing column names or exceptions to strsplit

R - Remove commas from values in a column and place separated values into new rows

Categories

Resources