How to group rows in r based on row number - r

I would like to make 2 groups based on their row numbers (the 1st group being rows 2 to 47, and the 2nd group being rows 48 to 92). The groups are my top and bottom performing samples and I would like to compare the groups' values in the 12 data columns (genes being tested). So, my ultimate goal is to divide the samples into their appropriate groups, and run statistical analyses the group's values for each of the genes tested. Here is a small section of my table:
Sample icaA icaB icaC icaD
ST1 12 13 15 18
ST2 11 9 8 16
ST3 15 18 18 15
ST4 13 16 17 20
I don't know if I can use cbind to combine the groups. I think I've also seen others flip the rows and columns; I can do that if needed. I'm just a beginner with the software, so any suggestions would be great!

To get the first group:
df1 <- df[2:47, ]
To get the second group:
df2 <- df[48:92, ]
Right?
Then you can run stats on each column, for instance, like this:
apply(df1[-1], 2, mean)
...To get the mean for each column, in the first group.
Then for the mean of each column in the second group:
apply(df2[-1], 2, mean)
Then to bind each group into 1 dataframe (or matrix) again, then I recommend:
rbind(df1, df2)

Related

Removing extreme values in a dataframe while sorting for multiple columns R

I have a dataframe like this:
mydf <- data.frame(A = c(40,9,55,1,2), B = c(12,1345,112,45,789))
mydf
A B
1 40 12
2 9 1345
3 55 112
4 1 45
5 2 789
I want to retain only 95% of the observations and throw out 5% of the data that have extreme values. First, I calculate how many observations they are:
th <- length(mydf$A) * 0.95
And then I want to remove all the rows above the th (or retain the rows below the th, as you wish). I need to sort mydf in an ascending order, to remove only those extreme values. I tried several approaches:
mydf[order(mydf["A"], mydf["B"]),]
mydf[order(mydf$A,mydf$B),]
mydf[with(mydf, order(A,B)), ]
plyr::arrange(mydf,A,B)
but nothing works, so mydf is not sorted in ascending order by the two columns at the same time. I looked here Sort (order) data frame rows by multiple columns but the most common solutions do not work and I don't get why.
However, if I consider only one column at a time (e.g., A), those ordering methods work, but then I don't get how to throw out the extreme values, because this:
mydf <- mydf[(order(mydf$A) < th),]
removes the second row that has a value of 9, while my intent is to subset mydf retaining only the values below threshold (intended in this case as number of observations, not value).
I can imagine it is something very simple and basic that I am missing... And probably there are nicer tidyverse approaches.
I think you want rank here, but it doesn't work on multiple columns. To work around that, note that rank(.) is equivalent to order(order(.)):
rank(mydf$A)
# [1] 4 3 5 1 2
order(order(mydf$A))
# [1] 4 3 5 1 2
With that, we can order on both (all) columns, then order again, then compare the resulting ranks with your th value.
mydf[order(do.call(order, mydf)) < th,]
# A B
# 1 40 12
# 2 9 1345
# 4 1 45
# 5 2 789
This approach benefits from preserving the natural sort of the rows.
If you would prefer to stick with a single call to order, then you can reorder them and use head:
head(mydf[order(mydf$A, mydf$B),], th)
# A B
# 4 1 45
# 5 2 789
# 2 9 1345
# 1 40 12
though this does not preserve the original order of rows (which may or may not be important to you).
Possible approach
An alternative to your approach would be to use a dplyr ranking function such as cume_dist() or percent_rank(). These can accept a dataframe as input and return ranks / percentiles based on all columns.
set.seed(13)
dat_all <- data.frame(
A = sample(1:60, 100, replace = TRUE),
B = sample(1:1500, 100, replace = TRUE)
)
nrow(dat_all)
# 100
dat_95 <- dat_all[cume_dist(dat_all) <= .95, ]
nrow(dat_95)
# 95
General cautions about quantiles
More generally, keep in mind that defining quantiles is slippery, as there are multiple possible approaches. You'll want to think about what makes the most sense given your goal. As an example, from the dplyr docs:
cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.
percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.
Some implications of this are that the lowest value is always 1 / nrow() for cume_dist() but 0 for percent_rank(), while the highest value is always 1 for both methods. This means different cases might be excluded depending on the method. It also means the code I provided will always remove the highest-ranking row, which may or may not match your expectations. (e.g., in a vector with just 5 elements, is the highest value "above the 95th percentile"? It depends on how you define it.)

Extract 100 sections from a vector

I have a vector of length 1000. It contains (numeric) survey answers of 100 participants, thus 10 answers per participant. I would like to drop the first three values for every participant to create a new vector of length 700 (including only the answers to questions 4-10).
I only know how to extract every n-th value of the vector, but cannot figure how to solve the above problem.
vector <- seq(1,1000,1)
Expected output:
4 5 6 7 8 9 10 14 15 16 17 18 19 20 24 ...
Using a matrix to first structure and then flatten is one method. Another somewhat similar method is to use what I am calling a "logical pattern index":
head( # just showing the first couple of "segments"
vector[ c( rep(FALSE, 3), rep(TRUE, 10-3) ) ],
15)
[1] 4 5 6 7 8 9 10 14 15 16 17 18 19 20 24
This method can also be use inside the two argument version of [ to select rows ore columns using a logical pattern index. This works because of R's recycling of logical indices.
Thanks for providing example data, based on which this thread is reproducible. Here is one solution
c(matrix(vector, 10)[4:10, ])
We first convert the vector to a matrix with 10 rows, so that each column attributes to a participant. Then use row subsetting to remove first three rows. Finally the matrix is flattened to a vector again.

Extract multiple data.frames from one with selection criteria

Let this be my data set:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000),
split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))
So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.
My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.
I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.
In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:
x1 x2 x3 split
1 0.307379064 0.400526799 0.2898194543 SPLITMEHERE
2 0.465236674 0.915204924 0.5168274657 OBS
3 0.063814420 0.110380201 0.9564822116 OBS
4 0.401881416 0.581895095 0.9443995396 OBS
5 0.495227871 0.054014926 0.9059893533 SPLITMEHERE
6 0.091463620 0.945452614 0.9677482590 OBS
7 0.876123151 0.702328031 0.9739113525 OBS
8 0.413120761 0.441159673 0.4725571219 OBS
9 0.117764512 0.390644966 0.3511555807 OBS
10 0.576699384 0.416279417 0.8961428872 OBS
11 0.854786077 0.164332814 0.1609375612 OBS
12 0.336853841 0.794020157 0.0647337821 SPLITMEHERE
13 0.122690541 0.700047133 0.9701538396 OBS
14 0.733926139 0.785366852 0.8938749305 OBS
15 0.520766503 0.616765349 0.5136788010 OBS
16 0.628549288 0.027319848 0.4509875809 OBS
17 0.944188977 0.913900539 0.3767973795 OBS
18 0.723421337 0.446724318 0.0925365961 OBS
19 0.758001243 0.530991725 0.3916394396 SPLITMEHERE
20 0.888036748 0.862066601 0.6501050976 OBS
What I would like to get is this:
data.frame1:
1 0.465236674 0.915204924 0.5168274657 OBS
2 0.063814420 0.110380201 0.9564822116 OBS
3 0.401881416 0.581895095 0.9443995396 OBS
4 0.091463620 0.945452614 0.9677482590 OBS
5 0.876123151 0.702328031 0.9739113525 OBS
6 0.413120761 0.441159673 0.4725571219 OBS
7 0.117764512 0.390644966 0.3511555807 OBS
8 0.576699384 0.416279417 0.8961428872 OBS
9 0.854786077 0.164332814 0.1609375612 OBS
And
data.frame2:
1 0.122690541 0.700047133 0.9701538396 OBS
2 0.733926139 0.785366852 0.8938749305 OBS
3 0.520766503 0.616765349 0.5136788010 OBS
4 0.628549288 0.027319848 0.4509875809 OBS
5 0.944188977 0.913900539 0.3767973795 OBS
6 0.723421337 0.446724318 0.0925365961 OBS
7 0.888036748 0.862066601 0.6501050976 OBS
Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.
Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.
Is there any way to extract this efficiently in R?
The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.
With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:
df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)
The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

Extract the first, second and last row that meets a criterion

I would like to know how to extract the last row that follow a criterion. I have seen the solution for getting the first one by the function "duplicate" in the next link How do I select the first row in an R data frame that meets certain criteria?.
However is it possible to get the second or last row that meet a criterion?
I would like to make a loop for each Class (here I only put two) and select the first, second and last row that meet the criterion Weight >= 10. And if there is no row that meets the criterion to get a NA.
Finally I want to store the three values (first, second, and last row) in a list containing the values for each class.
Class Weight
1 A 20
2 A 15
3 B 10
4 B 23
5 A 11
6 B 12
7 B 11
8 A 25
9 A 7
10 B 3
Data table can help with this.
This is an edit off of Davids comment to move it into the answers as his approach is the correct way to do this.
library(data.table)
DT <- as.data.table(db)
DT[Weight >= 10][, .SD[c(1, 2, .N)], by = Class]
As as faster alternative also from David look at
indx <- DT[Weight >= 10][, .I[c(1, 2, .N)], by = Class]$V1 ; DT[indx]
Which creates the wanted index using .I and then subsets DT based on those rows.

Delete random subsets of variables within a group that have certain values with a list

This is a slight variation on the question Deleting random subset of observations within a group of variables that have a certain value.
The variation I am looking for is how to delete subsets of rows where the number of rows removed changes each time the grouping criteria changes. Here is a simple example data set with a column of numeric values and a numeric grouping column (grouping column can also be a factor like "AA1", "AA2", etc).
set.seed(23)
df<-data.frame(a=round(rnorm(500,mean=20,sd=2)))
df$group<-seq(from = 1, to = length (df),by=5)
A table of the data (table(df$a) gives this result:
group: 14 15 16 17 18 19 20 21 22 23 24 25
count: 1 7 13 24 65 87 91 91 59 42 12 8
For example: When the grouping value is equal to 15, I want to randomly remove 4 rows; when group = 16, randomly remove 7 rows; when group = 17, randomly remove 7 rows. This process continues for each grouping variable.
Here is my current solution:
(dfindex<-which(df$a==15)) ##create index that meets the grouping variable criteria
(delete.df.index<-sample(dfindex,4)) ##select number of rows to randomly remove
dfnew<-df[-delete.df.index,] ##create a new data frame and delete the randomly selected rows
Repeate steps from above on the newly created data frame:
(dfindex<-which(dfnew$a==16)) ##create another index from the grouping variable criteria
(delete.df.index<-sample(dfindex,3)) ##select rows to randomly delete
dfnew<-dfnew[-delete.df.index,] ##delete rows
Repeate for each combination of grouping variable and randomly selected rows to remove.
(dfindex<-which(dfnew$a==17))
(delete.df.index<-sample(dfindex,7))
dfnew<-dfnew[-delete.df.index,]
With this example, I have 12 grouping levels. The simple but time consuming approach is to copy/paste/and edit the code for each combination of grouping variable and row removal. I was wondering if it would be possible to use a table (or something similar) to specify the grouping values and number of rows to remove for that specific grouping variable:
Example table of groups and rows to remove.
Group Number of rows to randomly remove
14 0
15 4
16 3
17 7
18 40
19 23
Thanks in advance for any input.
Try running this -
set.seed(23)
df<-data.frame(a=round(rnorm(50,mean=20,sd=2)))
# create table of no of rows that need to be removed per each a
noofrowsremove <- read.table(textConnection(
'a toremove
21 1
23 2
15 2
17 1
19 2
20 2
24 2
16 1
22 1
18 3'), header = TRUE)
library(data.table)
# assign random number in a new column, this will help in sampling
df$tosample <- runif(50)
# convert data.frame to data.table, grouped operations are easier on data.table
dt <- data.table(df)
# rank the tosample column within each unique a value
dt[,samplerank := rank(tosample), by = 'a']
# merge the filtering no of rows with dt
dt <- merge(dt,noofrowsremove, by = 'a')
# filter out rows that have samplerank columns <= the no of rows that need to be removed
dttrimmed <- dt[samplerank > toremove]
After working through the answer provided by Codoremifa, I noticed a few small details that might be worth documenting for others that find this post. Using the answer provided by Codoremifa, I made a few small changes and included a little extra code to illustrate a few important details. Basically, pay attention to the merge step and decide how to handle the NA values generated by the merge step.
set.seed(23)
df<-data.frame(a=round(rnorm(50,mean=20,sd=2)))
# create table of no of rows that need to be removed per each a
noofrowsremove <- read.table(textConnection(
'a toremove
21 0
17 1
19 2
20 2
24 2
16 1
22 1
18 3'), header = TRUE)
##excluded values 23 and 15 from the above df to illustrate an example below
#Keep value 21 and just assigned it a 0 (i.e., do not remove any values of 21).
library(data.table)
# assign random number in a new column, this will help in sampling
df$tosample <- runif(50) #can also use runif(nrow(df))
# convert data.frame to data.table, grouped operations are easier on data.table
dt <- data.table(df)
# rank the tosample column within each unique a value
dt[,samplerank := rank(tosample), by = 'a']
# merge the filtering no of rows with dt. Be careful with merge options.
dt1 <- merge(dt,noofrowsremove, by = 'a') #46 rows
dt2 <- merge(dt,noofrowsremove, by = 'a',all=TRUE) #51 rows.
#Notice slight differences in the number of rows between dt1 and dt2
#In dt2, value 23 in the toremove column is "NA" because 23 was not included in noofrowsremove
nrow(dt1) #46 rows
nrow(dt2) #51 rows
##to keep values with "NA" change the "NA" to a 0
dt2$toremove[is.na(dt2$toremove)] <- 0 #assign NA to 0
# filter out rows that have samplerank columns <= the no of rows that need to be removed
dttrimmed1 <- dt1[samplerank > toremove] #36 rows. toremove values with NA are exlcuded
dttrimmed2 <- dt2[samplerank > toremove] #40 rows. Kept values with NA reasigned to 0

Resources