Matching multiple letters across rows - r

I want to analyze my dataset in a certain way, but unfortunately, despite spending a lot of time on R, I could not figure out how to accomplish the task. Below is what I want to do:
Dataset name: Proteome (this dataset has thousands of rows and 14 columns: below I am showing only four entries in column 5)
Row 1, column 5: GHFCLKPGCNFHAESTRGYR
Row 2, column 5: FCLKPGCNFHAESTRGYR
Row 3, column 5: GHFCLKPGCNFHAESTR
Row 4: column 5: GCNFHAESTR
In row 2, first two letters of row 1 are missing;
in row 3, last three letters of row 1 are missing;
in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
Ideally, I want R to return me the top entry, but it would be OK if R could only collapse such rows into a single row. My idea is to collapse multiple rows into one if five consecutive letters in those rows match with each other. In the above example, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one.

Related

How to assign multiple values for a matrix in R?

I have a 93060-by-141 matrix file with filled values.
I need to assign zeros to some rows and columns with the condition: Row 1: 65 of column 1; row 67:131 of column 2; row 133:197 of column 3 and so on.
This condition excludes for some column, i.e the values in rows of column 10,20,36 are unchanged.
I think I will need a For-loop. But I have no idea how to code for the link of rows and columns expressing the mentioned condition.
You can change multiple values by providing a matrix of indexes to a matrix. The indices correspond with the row (first column) and column (second column) numbers of the cells. For example doing mymat[cbind(1:10, 1)] <- 0 would change the first through tenth rows of column 1 to zero. In you case, you could put together several such cbind() statements with a call to rbind(). For example, mymat[rbind(cbind(1:5, 1), cbind(6:10, 2))] <- 0 would change the first through fifth rows of column 1 and the sixth through tenth rows of column 2 to zero. In the example you proposed above, it would be something like this:
my_mat[rbind(
cbind(1:65, 1),
cbind(67:131, 2),
cbind(133:197,3))] <- 0

How to split rows within a dataframe for a target column with multiple/nested values

With a dataframe that has, for example, one column x that has nested or multiple values for some rows, how would i, for those rows that have multiple values for x, append duplicate rows to the dataframe, save that that they correspond to one value within x.
To try to explain better, see "mock dataframe pre-transform", below. Row 1 has values "webui, cli, mobile" for column "module", and what i want is to append three near copies of row 1 to the dataframe, one with module value "webui", one with module value "cli" and one with module value "mobile". I also then want to remove the the original row 1. A similar operation would occur for row 4, such that the final dataframe would have 7 rows (see "mock dataframe post-transform, below).
mock dataframe pre-transform
mock dataframe post-transform

Dividing two columns in a Dataframe and placing result in existing column, but referencing columns by Index rather than name

I have a dataframe with 21 columns, columns 4 on wards are pairs of values (numerator and denominator) I want to divide the two and place into the first column, i.e. i want column 4 to become the result of column 4 divided by column 5, then i want column 6 to be the result of column 6 divided by 7 and so on.
I know (or at least can find on google) how to do this easily enough with reference to the column names, but I would prefer not to use these and rather refer to the column index.
It can be done by dividing equal sized datasets. In the numerator, we have the columns starting from 4 till the one before the last column and in denominator, subset from 5th to the last column, update the results by assigning it to the numerator column index subset
df1[4:(ncol(df1)-1)] <- df1[4:(ncol(df1)-1)]/df1[5:ncol(df1)]
NOTE: Assuming the columns are numeric classs

How to sort the first 20 rows in first column in alphabetical order in a data frame

I'm new to R coding and i'm doing exercises and I got stuck. In my data frame, the first row are patients e.g patient 1, patient 2 etc and the first column are gene names eg gene abc123,gene def456. What I want to know is how to sort the first 20 rows in column 1 in alphabetical order. Thanks
EDIT
I have put up a screenshot of the file in excel and i am trying to extract the ones in the red box in alphabetical order. I am unsure what to call column 1 in the console as it doesn't have a heading. In the file provided, each row represents expression values for a single gene, and each column
represents expression values for a single sample (patient).
The first column of each row is the gene identifier: (gene-symbol|entrez ID)
e.g. "A2M|2" (A2M is the gene-symbol and 2 is the entrez database identifier for alpha 2 macroglobulin)
Each sample identifier is formatted as: TCGA-ID_Tissue
where the Tissue is either "TissueA" or "TissueB" e.g. "TCGA-AA-3548_TissueA"
The question is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
screenshot of the table

R data - Combine two rows based on a single similar column within a dataset

I think this will be relatively elementary, but I cannot for the life of me figure it out.
Imagine a dataset in which there are 108 rows, made up of two readings for 54 clones. Pretty much, I need to condense a dataset based on clone (column 2), by averaging the cells from [6:653], whilst keeping the information for column 1, 2, 3, 654 (which is identical for these columns between the two readings).
I have a pretty small dataset, in which I have 108 rows, and 654 columns, which i would like to whittle down to a smaller dataset. Now, the rows consist of 54 different tree clones (column 2), each with two readings (column 4) (54 * 2 = 108). I would like to average the two readings for each clone, reducing my dataset to 54 rows. Just FYI, the first 5 columns are characters, the next 648 are numeric. I would like to remove columns 4 and 5 from the new dataset, leaving a dataset of 54x652, but this is optional.
I believe that a (plyr) function or something will do the trick, but i can't make it work. I've tried a bunch of things, but it just won't play ball.
Thanks in advance.
For average you can use mean for leaving out a row or column just subtract the row.
Example:
table[-x, ] - deletes the x row
table[ ,-x] - deletes the x column
(x can be one number or x<-c(1:3) # the first three rows/columns)
If you provide more information I think others will also help.

Resources