With a dataframe that has, for example, one column x that has nested or multiple values for some rows, how would i, for those rows that have multiple values for x, append duplicate rows to the dataframe, save that that they correspond to one value within x.
To try to explain better, see "mock dataframe pre-transform", below. Row 1 has values "webui, cli, mobile" for column "module", and what i want is to append three near copies of row 1 to the dataframe, one with module value "webui", one with module value "cli" and one with module value "mobile". I also then want to remove the the original row 1. A similar operation would occur for row 4, such that the final dataframe would have 7 rows (see "mock dataframe post-transform, below).
mock dataframe pre-transform
mock dataframe post-transform
I have a file that has 52560 row but only one column with different header names in the one row, so I need to separate that rows with their own values and columns. So the data frame has 52560 X 1 but I need 52560 X 19 (headings). I tried separate and split functions but that did not work. I am new with R programming.
Picture of data frame:
I think the values are separated by ';' . Hence create a list of 9 column headers. Read this file in R with separation ';' while assigning header to the columns.
I'm new to R coding and i'm doing exercises and I got stuck. In my data frame, the first row are patients e.g patient 1, patient 2 etc and the first column are gene names eg gene abc123,gene def456. What I want to know is how to sort the first 20 rows in column 1 in alphabetical order. Thanks
EDIT
I have put up a screenshot of the file in excel and i am trying to extract the ones in the red box in alphabetical order. I am unsure what to call column 1 in the console as it doesn't have a heading. In the file provided, each row represents expression values for a single gene, and each column
represents expression values for a single sample (patient).
The first column of each row is the gene identifier: (gene-symbol|entrez ID)
e.g. "A2M|2" (A2M is the gene-symbol and 2 is the entrez database identifier for alpha 2 macroglobulin)
Each sample identifier is formatted as: TCGA-ID_Tissue
where the Tissue is either "TissueA" or "TissueB" e.g. "TCGA-AA-3548_TissueA"
The question is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
screenshot of the table
I have two excel files A and B. Excel file A has 6 columns with 10,000 rows. Assume that columns are named A-F. Excel file B has one column (A) with 3500 rows. Here is what I want to do-
I want to eliminate rows based on the cell values (ids) in column A (in excel file A) and have a dataframe without them. To further elaborate- I want to check each id in Column A (excelfile A) against all ids in columnA in excel file B. If the id in column A in excel file A matches with any of the listed ids in columnA of excel file B, then, I want to eliminate those rows with matching ids in excel file A.
I was able to do this in excel. I want to do the same with R as a crosscheck. I am new to R and I am learning. Could someone help me with a best way to do this?
I know how to subset rows based on header title and a particular value of a cell. But, in this case, I have data with 10,000 observations, of which, I want to eliminate at least 3500 through matching the ids.
I have a data frame extracted from a data base that contains different types of data (record types). The different record types have different column names which occupy the first three rows (including header). This data frame is made to be used in excel where you can easily filter out the data by choosing the correct record type.
Here I present small sample of my data frame which in reality contains many more columns (59) as well as rows (34000).
sample <- data.frame(X01RecordType=c("01HL","01CA","HH","HH","HH","HL"), X02Quarter=c(NA,NA,2,2,2,1),X05Gear=c(NA,NA,"KRA","KRA","KRA",NA),X06SweepLngt=c(NA,NA,35,35,-9,-9),
X12Month=c("12SpecCodeType",NA,4,5,4,2), X13Day=c("13SpecCode",NA,26,5,25,160617), X22StatRec=c("22LngtCode","22CANoAtLngt","45G1",NA,NA,NA),X23Depth=c("23LngtClass","23IndWgt",41,NA,63,NA))
As you might see the cells which contain column names are preceded by an X and a number and then a text, e.g. X01RecordType. It would be very easy to replace column names with the first rows by using:
colnames(df) <- df[1,]
However, as you can see some of the cells in the first two rows also contain NA-values. These NA-values indicate that the column names are the same for all record types, using the current header and therefore I would like to keep these. So really what I would like to do is replace the column names with the values of the first row (where record type header equals 01HL) except for NA-values.
If possible I would like to do this without using any external packages. Cells within the data may also contain NA-values and I would like to keep these rows so filtering out all columns containing NA is not an option if it doesn't only apply to the first row. Which is really the way I tried to approach this problem, but I can't figure out how.
I hope this is all the information required to help me out and thanks!
Another option without a loop
colnames(sample)[!is.na(sample[1,])] <- sample[1,][!is.na(sample[1,])]
sample[1:2,]
# 01HL X02Quarter X05Gear X06SweepLngt 12SpecCodeType 13SpecCode 22LngtCode
#1 01HL NA <NA> NA 12SpecCodeType 13SpecCode 22LngtCode
#2 01CA NA <NA> NA <NA> <NA> 22CANoAtLngt
# 23LngtClass
#1 23LngtClass
#2 23IndWgt
I suggest a simple loop:
for(c in 1:length(sample)) if(!is.na(sample[1,c])) colnames(sample)[c] = as.character(sample[1,c])