Read Excel file and select specific rows and columns - r

I want to read a xls file into R and select specific columns.
For example I only want columns 1 to 10 and rows 5 - 700. I think you can do this with xlsx but I can't use that library on the network that I am using.
Is there another package that I can use? And how would I go about selecting the columns and rows that I want?

You can try this:
library(xlsx)
read.xlsx("my_path\\my_file.xlsx", "sheet_name", rowIndex = 5:700, colIndex = 1:10)

Since you are unable to lead the xlsx package, you might want to consider base R and use read.csv. For this, save your Excel file as a csv. The explanation for how to do this can be easily found on the web. Note, csv files can still be opened as Excel.
These are the steps you need to take to only read the 2nd and 3rd column and row.
hd = read.csv('a.csv', header=F, nrows=1, as.is=T) # first read headers
removeCols <- c('NULL', NA, NA) #define which columns to keep/remove
df <- read.csv('a.csv', skip=2, header=F, colClasses=removeCols) #skip says which rows not to read
colnames(df) <- hd[is.na(removeCols)]
df
two three
1 5 8
2 6 9
This is the example data I used.
a <- data.frame(one=1:3, two=4:6, three=7:9)
write.csv(a, 'a.csv', row.names=F)
read.csv('a.csv')
one two three
1 1 4 7
2 2 5 8
3 3 6 9

Related

Load data from a text file to R

I have data in text format whose structure is as follows:
ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
.........T.....,,,,,,,,,.......T,,,,,,.........
......A..*............,,,,,,,,.A........T......
....*..................,,,T...............
...*.....................*...........
...................*.....
I have been trying to import it into R using the read.table() command but when I do the output has an altered structure like this:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....
For some reason, R is shifting the rows with lesser number of characters to the right. How can I load my data into R without altering the data structure present in the original text file?
Try this :)
read.table(file, sep = "\n")
result:
V1
1 ATCTTTGAT*TTAGGGGGAAAAATTCTACGC*TTACTGGACTATGCT
2 .........T.....,,,,,,,,,.......T,,,,,,.........
3 ......A..*............,,,,,,,,.A........T......
4 ....*..................,,,T...............
5 ...*.....................*...........
6 ...................*.....

r data.table readcsv file increases column amount

I have the issue, that I am trying to read immense amounts of data from csv files (Probably around 80 million rows separated into around 200 files)
Some of the files are not well structured. After a few hundred thousand rows, for some reason, the rows are ending with a comma (","), but no additional information behind this comma. A short example to illustrate this behaviour:
a,b,c
1,2,3
d,e,f,
4,5,6,
The rows have 19 columns. I tried manually telling readcsv to read it as 20 columns, using colClasses and col.names and fill=TRUE
all.files <- list.files(getwd(), full.names=T, recursive=T)
lapply(all.files, fread,
select=c(5,6,9),
col.names=paste0("V",seq_len(20)),
#colClasses=c("V1"="character","V2"="character","V3"="integer"),
colClasses=c(<all 20 data types, 20th arbitrarily as integer>),
fill=T)
Another workaround I tried was to not use fread at all, by doing
data <- lapply(all.files, readLines)
data <- unlist(data)
data <- as.data.table(tstrsplit(data,","))
data <- data[, c("V5","V6","V9"), with=F]
However, this approach leads to "Error: memory exhausted", which I believe might be solved by actually only reading the required 3 columns, instead of all 19.
Any hints on how to use fread for this scenario is greatly appreciated.
You can try using readr::read_csv as follows:
library(readr)
txt <- "a,b,c
1,2,3
d,e,f,
4,5,6,"
read_csv(txt)
results in the expected result:
# A tibble: 3 × 3
a b c
<chr> <chr> <chr>
1 1 2 3
2 d e f
3 4 5 6
And the following warning
Warning: 2 parsing failures.
row col expected actual
2 -- 3 columns 4 columns
3 -- 3 columns 4 columns
To only read specific columns use cols_only as follows:
read_csv(txt,
col_types = cols_only(a = col_character(),
c = col_character()))

Saving data frame to csv with column names as strings in R

Hi I have the following data frame:
b = data.frame(c(1,2),c(3,4))
> colnames(b) <- c("100.X0","100.00")
> b
100.X0 100.00
1 1 3
2 2 4
I would like to save this as a csv file with headers as strings. When I use write.csv the result ends up being:
100.X0 100
1 3
2 4
It turns the 100.00 to 100, how do I incorporate this?
I think the problem might be the way you read the csv file. Certain programs will guess the type and convert (for eg Excel)
Use write.xls from package dataframes2xls instead:
> library(dataframes2xls)
> write.xls(b, "test.csv")
Result :

Remove index column in read.csv

Inspired by Prevent row names to be written to file when using write.csv, I am curious if there a way to ignore the index column in R using the read.csv() formula. I want to import a text file into an RMarkdown document and don't want the row numbers to show in my HTML file produced by RMarkdown.
Running the following code
write.csv(head(cars), "cars.csv", row.names=FALSE)
produces a CSV that looks like this:
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
But, if you read this index-less file back into R (ie, read.csv("cars.csv")), the index column returns:
. speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I was hoping the solution would be as easy as including row.names=FALSE to the read.csv() statement, as is done with write.csv(), however after I run read.csv("cars.csv", row.names=FALSE), R gets sassy and returns an "invalid 'row.names' specification" error message.
I tried read.csv("cars.csv")[-1], but that just dropped the speed column, not the index column.
How do I prevent the row index from being imported?
If you save your object, you won't have row names.
x <- read.csv("cars.csv")
But if you print it (to HTML), you will use the print.data.frame function. Which will show row numbers by default. If I use the following (as last line) in my markdown chunk, I didn't have row numbers displayed:
print(read.csv("cars.csv"), row.names = FALSE)
Why?: This problem seems associated with a previous subset procedure that created the data. I have a file that keeps coming back with a pesky index column as I round-trip the data via read/write.csv.
Bottom Line: read.csv takes a file completely and outputs a dataframe, but the file has to be read before any other operation, like dropping a column, is possible.
Easy Workaround: Fortunately it's very simple to drop the column from the new dataframe:
df <- read.csv("data.csv")
df <- df[,-1]

Read Excel in R

In excel I have a table that looks like this:
` Data Freq
1 [35-39] 1
2 [40-44] 3
3 [45-49] 5
4 [50-54] 11
5 [55-59] 7
6 [60-64] 7`
I'm trying to figure out a way of being able to read the value in the Data column as the intervals for calculations in the R Project software.
I need to calculate things as:
`(39-35)/2`
# read
library(xlsx)
d <- read.xlsx('data.xlsx',header=T,sheetIndex=1)
# reorder
dl <- do.call(rbind,strsplit(as.character(d$Data),split='-|\\[|\\]'))
d$b <- as.numeric(dl[,3])
d$a <- as.numeric(dl[,2])
# calculate
d$mid <- (d$b-d$a)/2+d$a
Another way that doesn't use libraries is to convert you excel file into a csv (via save as in excel) and then read the data using read.csv.
xlsx uses rJava and needs Java. An alternative is readxl
library(readxl)
ed=read_excel("myfile.xlsx")

Resources