How to select rows based on median value in R? - r

I am quite new to R and cannot figure this one out. Let’s say I have a data frame with four columns. The first column determines group membership, the second column should be used for filtering and the two last columns should just follow along. It will look like below:
> test.data
group filter a b
first 1 1 2
first 2 3 1
first 3 2 3
second 1 2 1
second 2 2 5
second 3 3 1
second 4 3 1
For each group, I would like to calculate the median in the filter column. The same rows should then be used in column a and b to, when necessary, calculate the mean of the two rows or just return the one row if number of rows is odd.
The result should be:
group filter a b
first 2 3 1
second 2.5 2.5 3
When using dplyr, I can calculate the median of each column independently of the filter column, but not with regard to the filter column:
median.data <- test.data %>% group_by(group) %>% summarise_all(funs(median))
> median.data
group filter a b
first 2.0 2.0 2
second 2.5 2.5 1
When using tapply, I can calculate the median, but don't know how to also take the other columns into account:
median.data <- tapply(test.data$filter, test.data$group, median)
> median.data
first second
2.0 2.5
Then I figured that I should try to write a function myself that performs the steps below.
for each group:
order by column "filter"
extract middle row, two rows if even
calculate mean
But then I got stuck on how to find the middle (or two middle) rows...
Do you have any suggestions on how to solve it? Any help would be greatly appreciated!

Related

How to find a total of row values in R

I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2

Removing/collapsing duplicate rows in R

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.

Extract the first, second and last row that meets a criterion

I would like to know how to extract the last row that follow a criterion. I have seen the solution for getting the first one by the function "duplicate" in the next link How do I select the first row in an R data frame that meets certain criteria?.
However is it possible to get the second or last row that meet a criterion?
I would like to make a loop for each Class (here I only put two) and select the first, second and last row that meet the criterion Weight >= 10. And if there is no row that meets the criterion to get a NA.
Finally I want to store the three values (first, second, and last row) in a list containing the values for each class.
Class Weight
1 A 20
2 A 15
3 B 10
4 B 23
5 A 11
6 B 12
7 B 11
8 A 25
9 A 7
10 B 3
Data table can help with this.
This is an edit off of Davids comment to move it into the answers as his approach is the correct way to do this.
library(data.table)
DT <- as.data.table(db)
DT[Weight >= 10][, .SD[c(1, 2, .N)], by = Class]
As as faster alternative also from David look at
indx <- DT[Weight >= 10][, .I[c(1, 2, .N)], by = Class]$V1 ; DT[indx]
Which creates the wanted index using .I and then subsets DT based on those rows.

R: Compare a column of a data.table with a vector

I have a column of a data.table:
DT = data.table(R=c(3,8,5,4,6,7))
Further on I have a vector of upper cluster limits for the cluster 1, 2, 3 and 4:
CP=c(2,4,6,8)
Now I want to compare each entry of R with the elements of CP considering the order of CP. The result
DT[,NoC:=c(2,4,3,2,3,4)]
shall be a column NoC in DT, whose entries are just the number of that cluster, which the element of R belongs to.
(I need the cluster number to choose a factor out of another data.table.)
For example take the 1st entry of R: 3 is not smaller than 2 (out of CP), but smaller than 4 (out of CP). So, 3 belongs to cluster 2.
Another exmaple, take the 6th entry of R: 7 is neither smaller than 2, 4 nor 6 (out of CP), but shmaller than 8 (out of CP). So, 7 belongs to cluster 4.
How can I do that without using if-clauses?
You can accomplish this using rolling joins:
data.table(CP, key="CP")[DT, roll=-Inf, which=TRUE]
# [1] 2 4 3 2 3 4
roll=-Inf performs a NOCB rolling join - Next Observation Carried Backward. That is, in the event of value falling in a gap, the next observation will be rolled backward. Ex: 7 falls between 6 and 8. The next value is 8 - will be rolled backward. We simply get the corresponding index of each match using which=TRUE.
You can just add this as a column to DT using := as you've shown.
Note that this will return the indices after ordering CP. In your example, CP is already ordered, so it returns the result as intended. If CP is not already ordered, you'll have to add an additional column and extract that column instead of using which=TRUE. But I'll leave it to you to work it out.
From your description this would seem to be the code to deliver the correct answers, but Arun, a most skillful data.tablist, seems to have come up with a completely different way to fit your expectations, so I think there must be a different way of reading your requirements.
> DT[ , NoC:= findInterval(R, c(0, 2,4,6,8) , rightmost.closed=TRUE)]
> DT
R NoC
1: 3 2
2: 8 4
3: 5 3
4: 4 3
5: 6 4
6: 7 4
I'm also very puzzled that findInterval is assigning the 5th item to the 4th interval since 6 is not greater than the upper boundary of the third interval (6).

How to bin data based on values in one column, and sum occurrences from another column in R?

I have a dataframe df and want to bin rows using data from column A, and then for each bin, count the number of times that a value is present in another column B. Here is an example using only 2 columns (although my real example has many columns):
A B
5.4
4.6 36_8365
2.4
3.6
0.6
8.9 83_7433
4
7.6
4.7 54_3874
1.5 54_8364
I want look in column A, and find all values less than 1, greater than 1 but less than 2, and so on, and for each bin, I want to count the number of times that a value appears in column B. For the table above, this would give the following results:
Class Number
<1 0
1<=A<2 1
2<=A<3 0
3<=A<4 0
4<=A<5 2
5<=A<6 0
6<=A<7 0
7<=A<8 0
8<=A<9 1
9<=A<10 0
The following is close, but it will sum the values when instead I just want to count them:
with(df, sum(df[A >= 1 & A < 2, "B"]))
I'm not sure what to replace "sum" with to get just counts, instead of a sum. I know I can identify which rows in column B have a value by using
thing <- B==''
or make a table using
thing_table <- table(B=='')
However, I'm not sure how to search through column A, test if the value is between 2 other values, and then count the items in B that meet those criteria. Can anyone point me in the right direction?
Thanks!
First:
newdf<-na.omit(df)
This will shrink the df down to only rows with data in them. Make sure the empty cells are showing up as NAs before attempting.
Second:
Replace sum with length
with(newdf, length(newdf[A>=1 $ A < 2, "B"]))

Resources