Subsetting a dataframe with columns of interest in an unusual pattern - r

I currently have a dataframe which is 500 rows x 39042 columns in size. Essentially I would like to subset a specific sequence of columns in this data frame which are in the following pattern:
1, 4, 7, 8, 11, 14, 15, 18, 21 ....
The file is too large for R so I have attached a picture of what the data looks like in Excel. The columns in red are the ones I would like to extract
I have thought about stacking the dataframe every 9 columns (so the data will follow a consistent pattern), subsetting only the three columns of interest and re-formatting to a wide format, but I don't know how to do this. There was a similar question posted previously about stacking columns, (Automatically stack every nth column of a dataframe) but that was for a small dataframe and I am having issues using the answers posted there on my data.
If you need any more info please let me know.
Thanks!

Related

Getting subset of data based on conditional values from two columns

I have a data frame called (image attached below) where I need to select rows based on conditional values of two columns. Specifically, I need to select those rows where c1Pos and c2Pos have consecutive values (e.g., 4 and 5, 3 and 2, etc.).
I have tried the following, but it doesn't work:
combined_df_locat_short_cues_consec<-subset(combined_df_color_locat_cues, subset=!(c2Pos==c1Pos+1|c1Pos==c1Pos-1))
Any help would be very much appreciated.
Thanks in advance,
Mikel
Please replace
subset=! with subset=
c1Pos==c1Pos-1 with c2Pos==c1Pos-1
combined_df_locat_short_cues_consec<-subset(combined_df_color_locat_cues, subset=(c2Pos==c1Pos+1|c2Pos==c1Pos-1))

Render rows that do not have zero in any columns in R [duplicate]

This question already has answers here:
How to remove rows with 0 values using R
(2 answers)
Closed 2 years ago.
I searched many questions that were suggested by the stackoverflow before posting this question but I couldn't find what I was looking for, I decided to ask here, I have data file:
https://github.com/learnseq/learning/blob/main/GSE133399_Fig2_FPKM.csv
The file has 9 columns, first column has names, the other 8 columns have values, I want to render into an object all columns that do not have zero and save the in csv format.
I had a look on your data set: it contains some rows having all values zero, except the identifier. I assume you want to omit the lines being full of zero's. This code does the job:
data1 = read.csv("GSE133399_Fig2_FPKM.csv")
## Apply <all> on each row.
allZero = apply(data1[, -1] == 0, 1, all)
data2 = data1[!allZero, ]
Now, data2 is the same as data1, but without the rows having only zeros.

Match a vector to multiple consecutive rows in R [duplicate]

This question already has answers here:
How to index a vector sequence within a vector sequence
(5 answers)
Closed 5 years ago.
I have got a dataframe and I need to find row numbers where the values of the entries in one column match a certain pattern.
Let the col1 col1 = matrix(c(1,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,0,1), nrow = 21, ncol = 1) be an example of by column and vector r r = c(2, 0 ,2) be a vector I need to match it with.
I need R to return an index number of rows where the pattern in r matches the values in col1 (in this case row 11, 12, 13).
I thought I could achieve this with row.match, but that is not the case. I have tried different combinations of match function, but it doesn't yield any results either.
Maybe the way I am approaching this problem is wrong from the beginning, but I have trouble believing that there isn't any function, that would provide me with the expected result given some adjustment.
Thanks.
You could do this using rollapply from zoo. Basically, this runs identical on a rolling basis with a window of length(r). This tells you that the sequence is present starting at positon 11 of the col1 vector..
library(zoo)
which(rollapply(col1,length(r),identical,r))
[1] 11
To get a vector of positions, you could do:
which(rollapply(col1,length(r),identical,r))+0:(length(r)-1)
[1] 11 12 13

subset whole data frame for value and return rows in which value are found [duplicate]

This question already has answers here:
Finding rows containing a value (or values) in any column
(3 answers)
Closed 6 years ago.
I am trying to subset a data frame containing 626 obs. of 149 variables and I want to look for a specific string and return the rows that have that value regardless of what column it is found in.
For example:
I am looking for this string "GO:0004674" in a data frame that can contain this string in many different columns and rows as shown below in the image link.
For example the string "GO:0004674" can be found in row 12, 13 and 14. So I would want to keep only those rows and later on export them.
How can I perform this? All examples that I have seen thus far only look for string in a specific column and not in the whole dataframe.
Ant help will be greatly appreciated.
You can use apply to do row-wise operation using the argument MARGIN = 1. Example:
mydf[apply(mydf, MARGIN = 1, FUN = function(x) {"GO:0004674" %in% x}), ]

R data - Combine two rows based on a single similar column within a dataset

I think this will be relatively elementary, but I cannot for the life of me figure it out.
Imagine a dataset in which there are 108 rows, made up of two readings for 54 clones. Pretty much, I need to condense a dataset based on clone (column 2), by averaging the cells from [6:653], whilst keeping the information for column 1, 2, 3, 654 (which is identical for these columns between the two readings).
I have a pretty small dataset, in which I have 108 rows, and 654 columns, which i would like to whittle down to a smaller dataset. Now, the rows consist of 54 different tree clones (column 2), each with two readings (column 4) (54 * 2 = 108). I would like to average the two readings for each clone, reducing my dataset to 54 rows. Just FYI, the first 5 columns are characters, the next 648 are numeric. I would like to remove columns 4 and 5 from the new dataset, leaving a dataset of 54x652, but this is optional.
I believe that a (plyr) function or something will do the trick, but i can't make it work. I've tried a bunch of things, but it just won't play ball.
Thanks in advance.
For average you can use mean for leaving out a row or column just subtract the row.
Example:
table[-x, ] - deletes the x row
table[ ,-x] - deletes the x column
(x can be one number or x<-c(1:3) # the first three rows/columns)
If you provide more information I think others will also help.

Resources