I have a data frame, which i want to reorder based on column mean. I want to reorder it by decreasing column mean
SNR SignalIntensity ID
1 1.0035798 6.817374 109
2 11.9438978 11.545993 110
4 3.2894878 9.780420 112
5 4.0170266 9.871984 113
6 1.6310523 9.078186 114
7 1.6405415 8.228931 116
8 1.6625413 8.043536 117
9 0.8489116 6.179346 118
10 7.5312260 10.558180 119
11 7.2832911 10.474533 120
12 0.5732577 4.157294 121
14 0.8149754 6.045174 124
I use the following code
means <- colMeans(df) ## to get mean
df <- df[,order(means)] ## to reorder
to get the mean of columns and the order, but i get the column in increasing mean, opposite of my interest. what should i do to reorder in decreasing column mean
expected output
ID SignalIntensity SNR
1 109 6.817374 1.0035798
2 110 11.545993 11.9438978
4 112 9.780420 3.2894878
5 113 9.871984 4.0170266
6 114 9.078186 1.6310523
7 116 8.228931 1.6405415
8 117 8.043536 1.6625413
9 118 6.179346 0.8489116
10 119 10.558180 7.5312260
11 120 10.474533 7.2832911
12 121 4.157294 0.5732577
14 124 6.045174 0.8149754
The default settings in order is decreasing=FALSE. We can change that to TRUE
df[order(means, decreasing=TRUE)]
Or get the order of negative values of 'means'
df[order(-means)]
Related
In other words, I want to group on a column and then perform calculations using only some of the rows per group.
The data set I have is:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
I would like to repeat the following calculation for each group:
ClientInstallment - LoanIssue.
So, group 1 will be for LoanRefId number 11. The calculation will take ClientInstallment of 1295.47 and subtract LoanIssue of 1000 to give me a new column, call it "Income, with value 295.47.
Is this possible using data.table or dplyr or any other clever tricks.
Alternatively I can create two data summaries, one for Clientinstallment and one for LoanIssue and then subtract them, but the truth is I need to do much more than just subtracting two numbers, so I would need a data summary for each calculation which is just plain unclever imho.
any help is appreciated
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'LoanRefId', we get the corresponding 'TransactionAmount' for 'Tran_Type's 'ClientInstallment' and 'LoanIssue' and then subtract it.
library(data.table)
setDT(df1)[,list(Income =TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue']), by = LoanRefId]
# LoanRefId Income
#1: 11 295.47
#2: 12 532.61
We can also use dplyr with similar appraoch
df1 %>%
group_by(LoanRefId) %>%
summarise(Income = TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue'])
Update
If we don't have a 'ClientInstallment' or 'LoanIssue' for a 'LoanRefId', we can use an if/else condition
setDT(df1)[, list(Income= if(any(Tran_Type=='ClientInstallment') &
any(Tran_Type=='LoanIssue'))
TransactionAmount[Tran_Type=='ClientInstallment'] -
TransactionAmount[Tran_Type=='LoanIssue'] else 0 ), by = LoanRefId]
I am trying to find the most efficient way to split a list of numbers into bins by value and then calculate a cumulative sum for each successive category.
I can't seem to get the value categories from this for the plot.
> scores
[1] 115 119 119 134 121 128 128 152 97 108 98 130 108 110 111 122 106 142 143 140 141 151 125 126
> table(cut(scores,breaks=10))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 1 4 1 4 5 1 2 2 2
> cumsum(table(cut(scores,breaks=10)))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 3 7 8 12 17 18 20 22 24
> plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores")
> lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
This produces an acceptable plot, which contains index values (2,4,6...). How can I get the values 96.9, 102, etc... Is there a better way to do this?
You need to set xaxt = "n" to force the plot not to display the x axis labels, and display them by yourself using axis while retrieving them using names
plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores", xaxt = "n")
lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
axis(1, 1:10, names(table(cut(scores,breaks=10))))
I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2
The conversion of probe ids to entrez ids is quite straight forward
i1<-c("246653_at", "246897_at", "251347_at", "252988_at", "255528_at", "256535_at", "257203_at", "257582_at", "258807_at", "261509_at", "265050_at", "265672_at")
select(ath1121501.db, i1, "ENTREZID", "PROBEID")
PROBEID ENTREZID
1 246653_at 833474
2 246897_at 832631
3 251347_at 825272
4 252988_at 829998
5 255528_at 827380
6 256535_at 840223
7 257203_at 821955
8 257582_at 841494
9 258807_at 819558
10 261509_at 843504
11 265050_at 841636
12 265672_at 817757
But Iam unsure how to do it for a long list of lists resulting from a clustering and store it as a list of ENTREZ ids instead of probe ids again:
For instance:
[[1]]
247964_at 248684_at 249126_at 249214_at 250223_at 253620_at 254907_at 259897_at 261256_at 267126_s_at
28 40 44 45 54 95 108 152 171 229
[[2]]
248230_at 250869_at 259765_at 265948_at 266221_at
33 64 151 216 221
[[3]]
245385_at 247282_at 248967_at 250180_at 250881_at 251073_at 53874_at 256093_at 257054_at 260007_at
5 22 42 52 65 67 101 117 125 155
261868_s_at 263136_at 267497_at
181 195 232
It should be something like
[[1]]
"835761","834904","834356","834281","831256","829175","826721","843479","837084","816891","816892"
and similarly for other list of lists.
I'm looking at some ecological data (diet) and trying to work out how to group by Predator. I would like to be able to extract the data so that I can look at the weights of each individual prey for each species for each predator, i.e work out the mean weight of each species eaten by e.g Predator 117. I've put a sample of my data below.
Predator PreySpecies PreyWeight
1 114 10 4.2035496
2 114 10 1.6307026
3 115 1 407.7279775
4 115 1 255.5430495
5 117 10 4.2503708
6 117 10 3.6268814
7 117 10 6.4342073
8 117 10 1.8590861
9 117 10 2.3181421
10 117 10 0.9749844
11 117 10 0.7424772
12 117 15 4.2803743
13 118 1 126.8559155
14 118 1 276.0256158
15 118 1 123.0529734
16 118 1 427.1129793
17 118 3 237.0437606
18 120 1 345.1957190
19 121 1 160.6688815
You can use the aggregate function as follows:
aggregate(formula = PreyWeight ~ Predator + PreySpecies, data = diet, FUN = mean)
# Predator PreySpecies PreyWeight
# 1 115 1 331.635514
# 2 118 1 238.261871
# 3 120 1 345.195719
# 4 121 1 160.668881
# 5 118 3 237.043761
# 6 114 10 2.917126
# 7 117 10 2.886593
# 8 117 15 4.280374
There are a few different ways of getting what you want:
The aggregate function. Probably what you are after.
aggregate(PreyWeight ~ Predator + PreySpecies, data=dd, FUN=mean)
tapply: Very useful, but only divides the variable by a single factor, hence, we need to create a need joint factor with the paste command:
tapply(dd$PreyWeight, paste(dd$Predator, dd$PreySpecies), mean)
ddply: Part of the plyr package. Very useful. Worth learning.
require(plyr)
ddply(dd, .(Predator, PreySpecies), summarise, mean(PreyWeight))
dcast: The output is in more of a table format. Part of the reshape2 package.
require(reshape2)
dcast(dd, PreyWeight ~ PreySpecies+ Predator, mean, fill=0)
mean(data$PreyWeight[data$Predator==117]);