I am trying to read a file which has a column of + and - but apparently associated with a space after that. In anyway I read (read.csv or read.table or read_excel that file I pick up the space after.
For example:
df = data.frame(x = c('a', 'b', 'c'), y = c('+ ', "- ", "+ "))
df
x y
1 a +
2 b -
3 c +
Now, is there a way to specifically instruct not to pick up space after the value in read.csv or read_excel
Or is there a way to clean up the spaces after reading into a dataframe?
Related
I have multiple FASTA files that are using very simple headers which identify the specimen. However, I would like to make the headers much more detailed by adding geographic location, source and culture date.
My first thought is to use stringr package in R to read in each FASTA and replace any matching sequence ID with the appropriate string.
Using a .xlsx with the necessary data and specimen ID I can create a .txt with a series of strings with the new names I desire. Using this "Master Text" file I would like to rename each of the sequences in each FASTA appropriately through matching the Specimen ID.
So I created rename.txt in the following format:
SpecimenID|ST|Geographic Location|Source|CultDate
VRE32491|736|PUH - 10C|Blood|2016-12-07
VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08
VRE32503|1471|PUH - 11N|Wound|2017-01-05
VRE32504|1471|PUH - EMEP|Blood|2017-01-10
VRE32514|1471|PUH - 6F|Wound|2017-01-20
Using Biostrings::readDNAStringSet(*.fasta) I am able to obtain the names for each sequence using names() on the object. I want to create a matching string from rename.txt that will enable me to simply rename the DNAStringSet object using names({DNAStringSet object}) <- {matching string}.
What my problem is that I can't seem to extract a character string set from the rename.txt.
Below is some code anyone can use for a reprex to simulate my issue:
cat(
">VRE32493", "AGCT",
">VRE32503", "CAGT",
">VRE32504", "TCAA",
file = "example.fasta", sep = "\n"
)
cat(
"SpecimenID|ST|Geographic Location|Source|CultDate",
"VRE32491|736|PUH - 10C|Blood|2016-12-07",
"VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08",
"VRE32503|1471|PUH - 11N|Wound|2017-01-05",
"VRE32504|1471|PUH - EMEP|Blood|2017-01-10",
"VRE32514|1471|PUH - 6F|Wound|2017-01-20",
file = "example.txt", sep = "\n"
)
origMult <- Biostrings::readDNAStringSet("example.fasta")
fasta_rename <- read.delim("example.txt", skip = 1, header = F)
Expected output of example.fasta:
>VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08
AGCT
>VRE32503|1471|PUH - 11N|Wound|2017-01-05
CAGT
>VRE32504|1471|PUH - EMEP|Blood|2017-01-10
TCAA
Thanks to assistance elsewhere worked out this solution.
cat(
">VRE32493", "AGCT",
">VRE32503", "CAGT",
">VRE32504", "TCAA",
file = "example.fasta", sep = "\n"
)
cat(
"SpecimenID|ST|Geographic Location|Source|CultDate",
"VRE32491|736|PUH - 10C|Blood|2016-12-07",
"VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08",
"VRE32503|1471|PUH - 11N|Wound|2017-01-05",
"VRE32504|1471|PUH - EMEP|Blood|2017-01-10",
"VRE32514|1471|PUH - 6F|Wound|2017-01-20",
file = "example.txt", sep = "\n"
)
# Read in from example.txt
fasta_rename <- readr::read_file("example.txt")
fasta_rename <- unlist(strsplit(fasta_rename, "\\r"))
fasta_rename <- stringr::str_remove(fasta_rename[-1], "\\n")
fasta_rename <- fasta_rename[-length(fasta_rename)]
# Remove everything after first | to get the pattern to match off of
patterns <- stringr::str_remove(fasta_rename, "\\|(.+)$")
# Make fasta_rename a named character vector in the form of patterns = fasta_rename
names(fasta_rename) <- patterns
fasta_rename # print to verify
example_fasta <- readr::read_file("example.fasta")
example_fasta <- unlist(strsplit(example_fasta, "\\r"))
example_fasta <- stringr::str_remove(example_fasta, "\\n")
example_fasta <- example_fasta[-length(example_fasta)]
example_fasta #print to verify
cat(stringr::str_replace_all(example_fasta, fasta_rename),
file = "example.fasta",
sep = "\n")
I am writing a script to enter a formula into an openxlsx excel sheet using writeFormula. For the function, I need to specify a formula vector of length equal to number of cells.
Here is what I am trying to do:
for(i in 2:(nrow(data)+1)){formula_dep<-append(formula_dep, paste0("IFERROR(SEARCH(\"pack\",H",i,"))"))}
writeFormula(wb=file, sheet="data", x=formula_dep, startCol=9, startRow=2)
In the output, the escape characters are probably getting printed into the excel sheet and thus it is getting corrupted (I have to repair to open the file, where the column has nothing in it).
In R, the output is (as usual):
"IFERROR(SEARCH(\"pack\",H2))"
While the escape characters are not a problem in many other tasks, in this one, I cannot make this work. I cannot use single quote as for some unknown reason Excel does not allow that in FIND or SEARCH functions (regex issues maybe). Please help with the solution here.
Note: I cannot just inculcate the formula in the dataframe itself (using R formulae) as it is supposed to work on user inputs in the excel file itself.
I am open to solutions both from Excel side (changing the formula while doing the same thing), or from R side.
"\" is not there, it just shows that print is escaping the ", see:
cat("IFERROR(SEARCH(\"pack\",H2))")
# IFERROR(SEARCH("pack",H2))
Here is a working example:
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Sheet 1")
df <- data.frame(
a = letters[1:3],
)
writeData(wb, sheet = 1, x = df)
f <- paste0("FIND(\"b\",A", seq(nrow(df)) + 1L, ")")
f[ 1 ]
#[1] "FIND(\"b\",A2)"
cat(f[ 1 ])
# FIND("b",A2)
writeFormula(wb, sheet ="Sheet 1", x = f, startCol = 2, startRow = 2)
saveWorkbook(wb, "writeFormulaExample.xlsx", overwrite = TRUE)
I have a text file where each line begins with known character identifiers as such (* is a delimiter):
AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
.
.
.
ZZZ*123456789*.*.*.
The problem is even though the information is organized this way. Every line from AAA to ZZZ represents one record in this particular data. So after that ZZZ line, data goes back to AAA up to ZZZ again.
Is there a way, other than using a for loop and processing line by line,to take the chunk of lines from AAA to ZZZ and basically put it on one line so I can separate out each line by the delimiter after that?
Or let me know if you have any other suggestions on processing this kind of data.
Thanks,
We can use tapply to paste the elements
tapply(lines, cumsum(grepl("^AAA", lines)), FUN = paste, collapse="")
No packages are used as well
data
lines <- readLines(textConnection("AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
ZZZ*123456789*.*.*.
AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
ZZZ*123456789*.*.*."))
Using the sample data in the Note at the end read it into a data frame, create a grouping variable g and then use reshape to convert it from long to wide form. No packages are used. text=Lines can be replaced with a filename, e.g. "myfile", if the input comes from a file.
DF <- read.table(text = Lines, sep = "*", as.is = TRUE, strip.white = TRUE)
DF$g <- cumsum(DF$V1 == "AAA")
reshape(DF, dir = "wide", idvar = "g", timevar = "V1")
Note:
Lines <- "AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*.
AAA*123456789*.*.*.
BBB*123456789*.*.*.
CCC*123456789*.*.*."
I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.
Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')
You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc
Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))
I want to write a data frame from R into a CSV file. Consider the following toy example
df <- data.frame(ID = c(1,2,3), X = c("a", "b", "c"), Y = c(1,2,NA))
df[which(is.na(df[,"Y"])), 1]
write.table(t(df), file = "path to CSV/test.csv", sep = ""), col.names=F, sep=",", quote=F)
The output in test.csvlooks as follows
ID,1,2,3
X,a,b,c
Y, 1, 2,NA
At first glance, this is exactly as I need it, BUT what cannot be seen in the code insertion above is that after the NA in the last line, there is another linebreak. When I pass test.csv to a Javascript chart on a website, however, the trailing linebreak causes trouble.
Is there a way to avoid this final linebreak within R?
This is a little convoluted, but obtains your desired result:
zz <- textConnection("foo", "w")
write.table(t(df), file = zz, col.names=F, sep=",", quote=F)
close(zz)
foo
# [1] "ID,1,2,3" "X,a,b,c" "Y, 1, 2,NA"
cat(paste(foo, collapse='\n'), file = 'test.csv', sep='')
You should end up with a file that has newline character after only the first two data rows.
You can use a command line utility like sed to remove trailing whitespace from a file:
sed -e :a -e 's/^.\{1,77\}$/ & /;ta'
Or, you could begin by writing a single row then using append.
An alternative in the similar vein of the answer by #Thomas, but with slightly less typing. Send output from write.csv to a character string (capture.out). Concatenate the string (paste) and separate the elements with linebreaks (collapse = \n). Write to file with cat.
x <- capture.output(write.csv(df, row.names = FALSE, quote = FALSE))
cat(paste(x, collapse = "\n"), file = "df.csv")
You may also use format_csv from package readr to create a character vector with line breaks (\n). Remove the last end-of-line \n with substr. Write to file with cat.
library(readr)
x <- format_csv(df)
cat(substr(x, 1, nchar(x) - 1), file = "df.csv")