Read / Import specific rows from large Excel files in R - r

I have dozens of very heavy Excel files that I need to import into R (then rebind). Each file has 2 sheets, where the second sheet (name: "Results") consists of 100K rows at least and has about 350 columns.
I would like to read a subset of the sheet "Results" from each file by columns, but most importantly, by specific rows. Each "ID" in the data, has a main row and then multiple rows below which contain data in specific columns. I would like to read the main row only (this leaves each file with 50-400 rows (depending on the file) and 150 variables). The first column that numbers main rows does not have a header.
This is what the data looks like (simplified):
I would like to import only the rows whose first column isn't empty but numbered (i.e., 1., 13., 34., 211.) and particular columns, in this example columns 2,3,5 (i.e., name, ID, status). The desired output would be:
Is there a simple way to do this?

Let's say a is our excel file, as data frame.
library(readxl)
a <- as.data.frame(read_excel("Pattern/File.xlsx",sheet = "Results"))
For instance, we want to select columns 1 to 3, so use
subset(a[,1:3],is.na(a[1])==FALSE)
By this function, you are subsetting the input data frame with values different than NA in first column.
Output:
...1 name ID
1 1 Dan us1d
4 13 Nev sa2e
6 34 Sam il5a
Note first column name (" ...1 "). This is autogenerated by read_excel() function, but should not be a problem.

Related

How to split rows within a dataframe for a target column with multiple/nested values

With a dataframe that has, for example, one column x that has nested or multiple values for some rows, how would i, for those rows that have multiple values for x, append duplicate rows to the dataframe, save that that they correspond to one value within x.
To try to explain better, see "mock dataframe pre-transform", below. Row 1 has values "webui, cli, mobile" for column "module", and what i want is to append three near copies of row 1 to the dataframe, one with module value "webui", one with module value "cli" and one with module value "mobile". I also then want to remove the the original row 1. A similar operation would occur for row 4, such that the final dataframe would have 7 rows (see "mock dataframe post-transform, below).
mock dataframe pre-transform
mock dataframe post-transform

How can I split the headers of data to their own column?

I have a file that has 52560 row but only one column with different header names in the one row, so I need to separate that rows with their own values and columns. So the data frame has 52560 X 1 but I need 52560 X 19 (headings). I tried separate and split functions but that did not work. I am new with R programming.
Picture of data frame:
I think the values are separated by ';' . Hence create a list of 9 column headers. Read this file in R with separation ';' while assigning header to the columns.

How to sort the first 20 rows in first column in alphabetical order in a data frame

I'm new to R coding and i'm doing exercises and I got stuck. In my data frame, the first row are patients e.g patient 1, patient 2 etc and the first column are gene names eg gene abc123,gene def456. What I want to know is how to sort the first 20 rows in column 1 in alphabetical order. Thanks
EDIT
I have put up a screenshot of the file in excel and i am trying to extract the ones in the red box in alphabetical order. I am unsure what to call column 1 in the console as it doesn't have a heading. In the file provided, each row represents expression values for a single gene, and each column
represents expression values for a single sample (patient).
The first column of each row is the gene identifier: (gene-symbol|entrez ID)
e.g. "A2M|2" (A2M is the gene-symbol and 2 is the entrez database identifier for alpha 2 macroglobulin)
Each sample identifier is formatted as: TCGA-ID_Tissue
where the Tissue is either "TissueA" or "TissueB" e.g. "TCGA-AA-3548_TissueA"
The question is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
screenshot of the table

How to check eliminate several rows from a dataframe using R

I have two excel files A and B. Excel file A has 6 columns with 10,000 rows. Assume that columns are named A-F. Excel file B has one column (A) with 3500 rows. Here is what I want to do-
I want to eliminate rows based on the cell values (ids) in column A (in excel file A) and have a dataframe without them. To further elaborate- I want to check each id in Column A (excelfile A) against all ids in columnA in excel file B. If the id in column A in excel file A matches with any of the listed ids in columnA of excel file B, then, I want to eliminate those rows with matching ids in excel file A.
I was able to do this in excel. I want to do the same with R as a crosscheck. I am new to R and I am learning. Could someone help me with a best way to do this?
I know how to subset rows based on header title and a particular value of a cell. But, in this case, I have data with 10,000 observations, of which, I want to eliminate at least 3500 through matching the ids.

R: Replace column names with row values except where cells equal NA

I have a data frame extracted from a data base that contains different types of data (record types). The different record types have different column names which occupy the first three rows (including header). This data frame is made to be used in excel where you can easily filter out the data by choosing the correct record type.
Here I present small sample of my data frame which in reality contains many more columns (59) as well as rows (34000).
sample <- data.frame(X01RecordType=c("01HL","01CA","HH","HH","HH","HL"), X02Quarter=c(NA,NA,2,2,2,1),X05Gear=c(NA,NA,"KRA","KRA","KRA",NA),X06SweepLngt=c(NA,NA,35,35,-9,-9),
X12Month=c("12SpecCodeType",NA,4,5,4,2), X13Day=c("13SpecCode",NA,26,5,25,160617), X22StatRec=c("22LngtCode","22CANoAtLngt","45G1",NA,NA,NA),X23Depth=c("23LngtClass","23IndWgt",41,NA,63,NA))
As you might see the cells which contain column names are preceded by an X and a number and then a text, e.g. X01RecordType. It would be very easy to replace column names with the first rows by using:
colnames(df) <- df[1,]
However, as you can see some of the cells in the first two rows also contain NA-values. These NA-values indicate that the column names are the same for all record types, using the current header and therefore I would like to keep these. So really what I would like to do is replace the column names with the values of the first row (where record type header equals 01HL) except for NA-values.
If possible I would like to do this without using any external packages. Cells within the data may also contain NA-values and I would like to keep these rows so filtering out all columns containing NA is not an option if it doesn't only apply to the first row. Which is really the way I tried to approach this problem, but I can't figure out how.
I hope this is all the information required to help me out and thanks!
Another option without a loop
colnames(sample)[!is.na(sample[1,])] <- sample[1,][!is.na(sample[1,])]
sample[1:2,]
# 01HL X02Quarter X05Gear X06SweepLngt 12SpecCodeType 13SpecCode 22LngtCode
#1 01HL NA <NA> NA 12SpecCodeType 13SpecCode 22LngtCode
#2 01CA NA <NA> NA <NA> <NA> 22CANoAtLngt
# 23LngtClass
#1 23LngtClass
#2 23IndWgt
I suggest a simple loop:
for(c in 1:length(sample)) if(!is.na(sample[1,c])) colnames(sample)[c] = as.character(sample[1,c])

Resources