Load data from a text file to R - r

I have data in text format whose structure is as follows:
ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
.........T.....,,,,,,,,,.......T,,,,,,.........
......A..*............,,,,,,,,.A........T......
....*..................,,,T...............
...*.....................*...........
...................*.....
I have been trying to import it into R using the read.table() command but when I do the output has an altered structure like this:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....
For some reason, R is shifting the rows with lesser number of characters to the right. How can I load my data into R without altering the data structure present in the original text file?

Try this :)
read.table(file, sep = "\n")
result:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....

Related

Read Excel file and select specific rows and columns

I want to read a xls file into R and select specific columns.
For example I only want columns 1 to 10 and rows 5 - 700. I think you can do this with xlsx but I can't use that library on the network that I am using.
Is there another package that I can use? And how would I go about selecting the columns and rows that I want?
You can try this:
library(xlsx)
read.xlsx("my_path\\my_file.xlsx", "sheet_name", rowIndex = 5:700, colIndex = 1:10)
Since you are unable to lead the xlsx package, you might want to consider base R and use read.csv. For this, save your Excel file as a csv. The explanation for how to do this can be easily found on the web. Note, csv files can still be opened as Excel.
These are the steps you need to take to only read the 2nd and 3rd column and row.
hd = read.csv('a.csv', header=F, nrows=1, as.is=T) # first read headers
removeCols <- c('NULL', NA, NA) #define which columns to keep/remove
df <- read.csv('a.csv', skip=2, header=F, colClasses=removeCols) #skip says which rows not to read
colnames(df) <- hd[is.na(removeCols)]
df
two three
1 5 8
2 6 9
This is the example data I used.
a <- data.frame(one=1:3, two=4:6, three=7:9)
write.csv(a, 'a.csv', row.names=F)
read.csv('a.csv')
one two three
1 1 4 7
2 2 5 8
3 3 6 9

Saving data frame to csv with column names as strings in R

Hi I have the following data frame:
b = data.frame(c(1,2),c(3,4))
> colnames(b) <- c("100.X0","100.00")
> b
100.X0 100.00
1 1 3
2 2 4
I would like to save this as a csv file with headers as strings. When I use write.csv the result ends up being:
100.X0 100
1 3
2 4
It turns the 100.00 to 100, how do I incorporate this?
I think the problem might be the way you read the csv file. Certain programs will guess the type and convert (for eg Excel)
Use write.xls from package dataframes2xls instead:
> library(dataframes2xls)
> write.xls(b, "test.csv")
Result :

dealing with blank/missing data with write.table in R

I have a data frame where some of the rows have blanks entries, e.g. to use a toy example
Sample Gene RS Chromosome
1 A rs1 10
2 B X
3 C rs4 Y
i.e. sample 2 has no rs#. If I attempt to save this data frame in a file using:
write.table(mydata,file="myfile",quote=FALSE,sep='\t')
and then read.table('myfile',header=TRUE,sep='\t'), I get an error stating that the number of entries in line 2 doesn't have 4 elements. If I set quote=TRUE, then a "" entry appears in the table. I'm trying to figure out a way to create a table using write.table with quote=FALSE while retaining a blank placeholder for rows with missing entries such as 2.
Is there a simple way to do this? I attempted to use the argument NA="" in write.table() but this didn't change anything.
If result of my script's data frame has NA I always replace it , One way would be to replace NA in the data frames with a some other text which tells you that this entry was NA in the data frame -Specially if you are saving the result in a csv /database or some non -R env
a simple script to do that
replace_NA <- function(x,replacement="N/A"){
x[is.na(x)==T] <- replacement
}
sapply(df,replace_NA,replacement ="N/A" )
You are attempting to reinvent the fixed-width file format. Your requested format would have a blank column between every real column. I don't find a write.fwf, although the 'utils' package has read.fwf. The simplest method of getting your requested output would be:
capture.output(dat, file='test.dat')
# Result in a text file
Sample Gene RS Chromosome
1 1 A rs1 10
2 2 B X
3 3 C rs4 Y
This essentially uses the print method (at the end of the R REPL) for dataframes to do the spacing for you.

How to read a badly formatted CSV file with multiple embedded data sets and non-printing characters

I need to open a CSV file with the following options in the figure below. I add the link to my files. You can try with the file "20140313_Helix2_FP140_SC45.csv"
https://www.dropbox.com/sh/i5y8r8g7wymalw8/AABXsLkbpowxGObFpGHgv4m-a?dl=0
I have tried many options with read.table and read.csv but I need a dataframe with more than one column and data are separated.
It looks like captured printer output. But it's not too messy:
# read it in as raw lines
lines <- readLines("20140313_Helix2_FP140_SC45.csv")
I'm assuming you want the "frequency point" data (it's the most prevalent) so we find the first one of those:
start <- which(grepl("^FREQUENCY POINTS:", lines))[1]
The rest of the file is "regular" enough to just look for lines beginning with a number (i.e. the PNT column) and read that in, giving it saner column names than the read.table default):
dat <- read.table(textConnection(grep("^[0-9]+",lines[start:length(lines)], value=TRUE)),
col.names=c("PNT", "FREQ", "MAGNITUDE"))
And, here's the result:
head(dat)
## PNT FREQ MAGNITUDE
## 1 1 0.800000 -19.033
## 2 2 0.800125 -19.038
## 3 3 0.800250 -19.071
## 4 4 0.800375 -19.092
## 5 5 0.800500 -19.137
## 6 6 0.800625 -19.167
nrow(dat)
## [1] 1601
The # of rows matches (from what I can tell) the # of frequency point records.

Remove index column in read.csv

Inspired by Prevent row names to be written to file when using write.csv, I am curious if there a way to ignore the index column in R using the read.csv() formula. I want to import a text file into an RMarkdown document and don't want the row numbers to show in my HTML file produced by RMarkdown.
Running the following code
write.csv(head(cars), "cars.csv", row.names=FALSE)
produces a CSV that looks like this:
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
But, if you read this index-less file back into R (ie, read.csv("cars.csv")), the index column returns:
. speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I was hoping the solution would be as easy as including row.names=FALSE to the read.csv() statement, as is done with write.csv(), however after I run read.csv("cars.csv", row.names=FALSE), R gets sassy and returns an "invalid 'row.names' specification" error message.
I tried read.csv("cars.csv")[-1], but that just dropped the speed column, not the index column.
How do I prevent the row index from being imported?
If you save your object, you won't have row names.
x <- read.csv("cars.csv")
But if you print it (to HTML), you will use the print.data.frame function. Which will show row numbers by default. If I use the following (as last line) in my markdown chunk, I didn't have row numbers displayed:
print(read.csv("cars.csv"), row.names = FALSE)
Why?: This problem seems associated with a previous subset procedure that created the data. I have a file that keeps coming back with a pesky index column as I round-trip the data via read/write.csv.
Bottom Line: read.csv takes a file completely and outputs a dataframe, but the file has to be read before any other operation, like dropping a column, is possible.
Easy Workaround: Fortunately it's very simple to drop the column from the new dataframe:
df <- read.csv("data.csv")
df <- df[,-1]

Resources