R delete first and last x % of rows

R delete first and last x % of rows - r

I have a data frame with 3 ID variables, then several values for each ID.
user Log Pass Value
2 2 123 342
2 2 123 543
2 2 123 231
2 2 124 257
2 2 124 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
The start and end of each set of values is sometimes noisy, and I want to be able to delete the first few values. Unfortunately the number of values varies significantly, but it is always the first and last 20% of values that are noisy.
I want to delete the first 20% of rows, with a minimum of 1 row deleted.
So for instance if there are 20 values for user 2 log 2 pass 123 I want to delete the first and last 4 rows. If there are only 3 values for the ID variable I want to delete the first and last row.
The resulting dataset would be:
user Log Pass Value
2 2 123 543
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
I've tried fiddling around with nrow but I struggle to figure out how to reference the % of rows by id variable.
Thanks.
Jonathan.

I believe the following can do it.
DATA.
dat <-
structure(list(user = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Log = c(2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), Pass = c(123L, 123L, 123L, 124L, 124L, 125L, 125L,
125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L,
125L, 125L, 125L), Value = c(342L, 543L, 231L, 257L, 342L, 543L,
231L, 257L, 342L, 543L, 231L, 257L, 543L, 231L, 257L, 543L, 231L,
257L, 543L, 231L, 257L)), .Names = c("user", "Log", "Pass", "Value"
), class = "data.frame", row.names = c(NA, -21L))
CODE.
fun <- function(x, p = 0.20){
n <- nrow(x)
m <- max(1, round(n*p))
inx <- c(seq_len(m), n - seq_len(m) + 1)
x[-inx, ]
}
result <- do.call(rbind, lapply(split(dat, dat$user), fun))
row.names(result) <- NULL
result
# user Log Pass Value
#1 2 2 123 543
#2 2 2 123 231
#3 2 2 124 257
#4 4 3 125 342
#5 4 3 125 543
#6 4 3 125 231
#7 4 3 125 257
#8 4 3 125 543
#9 4 3 125 231
#10 4 3 125 257
#11 4 3 125 543
#12 4 3 125 231
#13 4 3 125 257

Would something like this help?
For a dataframe df:
df[-c(1:floor(nrow(df)*0.2), (1+ceiling(nrow(df)*0.8)):nrow(df)),]
Just removing the first and last 20%, taking the upper and lower values so that for smaller data frame you keep some of the information:
> df<-data.frame(a=1:100)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[31] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> df<-data.frame(1:3)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 2

You can do this with dplyr...
library(dplyr)
df2 <- df %>% group_by(user, Log, Pass) %>%
filter(n()>2) %>% #remove those with just two elements or fewer
slice(max(2, 1+ceiling(n()*0.2)):min(n()-1, floor(0.8*n())))
df2
user Log Pass Value
1 2 2 123 543
2 4 3 125 543
3 4 3 125 231
4 4 3 125 257
5 4 3 125 543
6 4 3 125 231
7 4 3 125 257
8 4 3 125 543
9 4 3 125 231

Calculate the offset for what you want to retain:
rem <- ceiling( nrow( x ) * .2 ) + 1
Then take out the records you don-t want:
dat <- dat[ rem : ( nrow( dat ) - rem ), ]

Here is an idea using base R that returns the row indices of each user to keep and then subsets on these indices.
idx <- unlist(lapply(split(seq_along(dat[["user"]]), dat[["user"]]), function(x) {
tmp <- max(1, ceiling(.2 * length(x)))
tail(head(x, -tmp), -tmp)}),
use.names=FALSE)
split(seq_along(dat[["user"]]), dat[["user"]]) returns a list of the rows for each user. lapply loops through these rows, calculating the number of rows to drop from each end with split(seq_along(dat[["user"]]), dat[["user"]]), and then dropping them with tail(head(x, -tmp), -tmp)}). Since lapply returns a named list, this is unlisted and the names are dropped.
This returns
idx
2 3 4 10 11 12 13 14 15 16 17
Now subset
dat[idx,]
user Log Pass Value
2 2 2 123 543
3 2 2 123 231
4 2 2 124 257
10 4 3 125 543
11 4 3 125 231
12 4 3 125 257
13 4 3 125 543
14 4 3 125 231
15 4 3 125 257
16 4 3 125 543
17 4 3 125 231

Related

Creating differences in a new column for certain dates in R

i have a data frame that looks like this;
Date Value1 Value 2 Value 3
1997Q1 100 130 120
1997Q1 100 130 124
1997Q1 120 136 154
1997Q2 180 145 154
1997Q2 186 134 126
1997Q2 186 124 176
1997Q3 190 143 176
1997Q3 192 143 123
I would like to calculate differences for each values within the same date, for example the differences in value 1 column for 1997q1, then 1997q2 and so on.
I would like these differences to be shown in a new column, so that the results would look something like this;
Date Value1 Value 2 Value 3 Diff Val 1 Diff Val 2 Diff Val 3
1997Q1 100 130 120 0 0 4
1997Q1 100 130 124 20 6 30
1997Q1 120 136 154 N/A N/A N/A
1997Q2 180 145 154 6 -11 -28
1997Q2 186 134 126 0 10 50
1997Q2 186 124 176 N/A N/A N/A
1997Q3 190 143 176 2 0 -53
1997Q3 192 143 123

You can use dplyr functions for this. The ~ .x - lead(.x) is the function applied to every value column, selected with starts_with. we take the current value minus the next value. If you need lag, switch it around, ~ lag(.x) - .x
library(dplyr)
df1 %>%
group_by(Date) %>%
mutate(across(starts_with("Value"), ~.x - lead(.x), .names = "diff_{.col}"))
if the values are numeric and the column names are not easily found, you can use mutate(across(where(is.numeric), ~.x - lead(.x), .names = "diff_{.col}")).
# A tibble: 8 × 7
# Groups: Date [3]
Date Value1 Value2 Value3 diff_Value1 diff_Value2 diff_Value3
<chr> <int> <int> <int> <int> <int> <int>
1 1997Q1 100 130 120 0 0 -4
2 1997Q1 100 130 124 -20 -6 -30
3 1997Q1 120 136 154 NA NA NA
4 1997Q2 180 145 154 -6 11 28
5 1997Q2 186 134 126 0 10 -50
6 1997Q2 186 124 176 NA NA NA
7 1997Q3 190 143 176 -2 0 53
8 1997Q3 192 143 123 NA NA NA
data:
df1 <- structure(list(Date = c("1997Q1", "1997Q1", "1997Q1", "1997Q2",
"1997Q2", "1997Q2", "1997Q3", "1997Q3"), Value1 = c(100L, 100L,
120L, 180L, 186L, 186L, 190L, 192L), Value2 = c(130L, 130L, 136L,
145L, 134L, 124L, 143L, 143L), Value3 = c(120L, 124L, 154L, 154L,
126L, 176L, 176L, 123L)), class = "data.frame", row.names = c(NA,
-8L))

data.table efficiently finding common pairs between 2 columns

say I have a dataframe
subject stim1 stim2 feedback
1 1003 50 51 1
2 1003 48 50 1
3 1003 49 51 1
4 1003 47 49 1
5 1003 47 46 1
6 1003 46 48 1
10 1003 50 48 1
428 1003 48 51 0
433 1003 46 50 0
434 1003 50 49 0
435 1003 54 59 0
I want to create a new column "transitive_pair" by
group by subject (column 1),
For every row in which feedback==0 (starting index 428, otherwise transitive_pair=NaN).
I want to return a boolean which tells me whether there is any chain of pairings (but only those in which feedback==1) that would transitively link stim1 and stim2 values.
Working out a few examples.
row 428- stim1=48 and stim2=51
48 and 51 are not paired but 51 was paired with 50 (e.g.row 1 ) and 50 was paired with 48 (row 10) so transitive_pair[428]=True
row 433- stim 1=46 and stim2=50
46 and 48 were paired (row 6) and 48 was paired with 50 (row 2) so transitive_pair[433]=True
in row 435, stim1=54, stim2=59
there is no chain of pairs that could link them (59 is not paired with anything while feedback==1) so transitive_pair[435]=False
desired output
subject stim1 stim2 feedback transitive_pair
1 1003 50 51 1 NaN
2 1003 48 50 1 NaN
3 1003 49 51 1 NaN
4 1003 47 49 1 NaN
5 1003 47 46 1 NaN
6 1003 46 48 1 NaN
10 1003 50 48 1 NaN
428 1003 48 51 0 1
433 1003 46 50 0 1
434 1003 50 49 0 1
435 1003 54 59 0 0
any help would be greatly appreciated!!
and putting a recreateble df here
structure(list(subject = c(1003L, 1003L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1003L, 1003L), stim1 = c(50L, 48L,
49L, 47L, 47L, 46L, 50L, 48L, 46L, 50L, 54L), stim2 = c(51L,
50L, 51L, 49L, 46L, 48L, 48L, 51L, 50L, 49L, 59L), feedback = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), transitive_pair = c(NaN,
NaN, NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 0)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 428L, 433L, 434L, 435L), class = "data.frame")

The columns "stim1" and "stim2" define an undirected graph. Create the graph for feedback == 1, get its connected components and for each row of the data.frame, check if the values of "stim1" and "stim2" belong to the same component. In the end assign NaN to the rows where feedback is 1.
suppressPackageStartupMessages(library(igraph))
inx <- df1$feedback == 1
g <- graph_from_data_frame(df1[inx, c("stim1", "stim2")], directed = FALSE)
plot(g)
g_comp <- components(g)$membership
df1$transitive_pair_2 <- apply(df1[c("stim1", "stim2")], 1, \(x) {
i <- names(g_comp) == x[1]
j <- names(g_comp) == x[2]
if(any(i) & any(j))
g_comp[i] == g_comp[j]
else 0L
})
df1$transitive_pair_2[inx] <- NaN
df1
#> subject stim1 stim2 feedback transitive_pair transitive_pair_2
#> 1 1003 50 51 1 NaN NaN
#> 2 1003 48 50 1 NaN NaN
#> 3 1003 49 51 1 NaN NaN
#> 4 1003 47 49 1 NaN NaN
#> 5 1003 47 46 1 NaN NaN
#> 6 1003 46 48 1 NaN NaN
#> 10 1003 50 48 1 NaN NaN
#> 428 1003 48 51 0 1 1
#> 433 1003 46 50 0 1 1
#> 434 1003 50 49 0 1 1
#> 435 1003 54 59 0 0 0
Created on 2022-07-31 by the reprex package (v2.0.1)

Split data frame by class regarding to OID

I try to split dataframe by 50% by class. However, I do not want to split fields with the same OID (object identifier). I would like the fields with the same OID to be in the same set.
#Data frame:
"b1""b2""b3""CLASS" "OID"
110 134 119 "tree" 1
112 133 118 "tree" 1
105 125 110 "tree" 2
112 132 117 "tree" 2
109 125 115 "meadow" 6
93 110 101 "meadow" 6
86 106 95 "meadow" 7
105 136 116 "meadow" 7
102 128 111 "meadow" 8
108 129 115 "meadow" 8
113 134 119 "meadow" 8
Expected data:
#Expected:
"b1""b2""b3""CLASS" "OID"
110 134 119 "tree" 1
112 133 118 "tree" 1
109 125 115 "meadow" 6
93 110 101 "meadow" 6
86 106 95 "meadow" 7
105 136 116 "meadow" 7

This selects the top half of rows in each group, plus any rows which have the same OID as the rows in that top half.
library(dplyr)
df %>%
group_by(CLASS) %>%
filter(OID %in% head(OID, n() %/% 2)) %>%
ungroup
# # A tibble: 6 x 5
# b1 b2 b3 CLASS OID
# <int> <int> <int> <chr> <int>
# 1 110 134 119 tree 1
# 2 112 133 118 tree 1
# 3 109 125 115 meadow 6
# 4 93 110 101 meadow 6
# 5 86 106 95 meadow 7
# 6 105 136 116 meadow 7
If your real data is arranged by OID like this example, you could also use top_frac
df %>%
group_by(CLASS) %>%
top_frac(.5, -OID)
# # A tibble: 6 x 5
# b1 b2 b3 CLASS OID
# <int> <int> <int> <chr> <int>
# 1 110 134 119 tree 1
# 2 112 133 118 tree 1
# 3 109 125 115 meadow 6
# 4 93 110 101 meadow 6
# 5 86 106 95 meadow 7
# 6 105 136 116 meadow 7

Your data:
df = structure(list(b1 = c(110L, 112L, 105L, 112L, 109L, 93L, 86L,
105L, 102L, 108L, 113L), b2 = c(134L, 133L, 125L, 132L, 125L,
110L, 106L, 136L, 128L, 129L, 134L), b3 = c(119L, 118L, 110L,
117L, 115L, 101L, 95L, 116L, 111L, 115L, 119L), CLASS = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("meadow",
"tree"), class = "factor"), OID = c(1L, 1L, 2L, 2L, 6L, 6L, 7L,
7L, 8L, 8L, 8L)), class = "data.frame", row.names = c(NA, -11L
))
First create a function to take 1/2 according to OID
func = function(x){
x[x$OID %in% x$OID[1:round(nrow(x)/2)],]
}
We randomize the way the OID are sorted
df$OID = factor(df$OID,levels=sample(unique(df$OID)))
df = df[order(df$OID),]
do.call(rbind,by(df,df$CLASS,func))
This will ensure you get random ~ 50% everytime, with complete OID

Change data set from wide to long while retaining group id, and also gathering columns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'd really appreciate some help getting this messy set of new survey data into a usable form. It was collected in a strange way and now I've got strange data to work with. I've looked through tidyr and used those approaches to no end. I suspect my problem is that I'm thinking about this dataset all wrong and I'm blind to some real answer. But given all the things I need to do to this df, I cant figure out where to start and thus where to start googling.
What I need:
For each person to be their own row
Each person retains their GroupID and Treated value
For the variables currently attached to each person individually to become columns (age, weight, height)
Fake (and much smaller):
structure(list(GroupID = 1:5, Treated = c("Y", "Y", "N", "Y",
"N"), person1_age = c(45L, 33L, 71L, 19L, 52L), person1_weight = c(187L,
145L, 136L, 201L, 168L), person1_height = c(69L, 64L, 51L, 70L,
66L), person2_age = c(54L, 20L, 48L, 63L, 26L), person2_weight = c(140L,
122L, 186L, 160L, 232L), person2_height = c(62L, 70L, 65L, 72L,
74L), person3_age = c(21L, 56L, 40L, 59L, 67L), person3_weight = c(112L,
143L, 187L, 194L, 159L), person3_height = c(61L, 69L, 73L, 63L,
72L)), .Names = c("GroupID", "Treated", "person1_age", "person1_weight",
"person1_height", "person2_age", "person2_weight", "person2_height",
"person3_age", "person3_weight", "person3_height"), row.names = c(NA,
5L), class = "data.frame")
Any help or further readings you could point me to would be very much appreciated.

reshape can do this, with the appropriate arguments:
> reshape(x, direction="long", varying=names(x)[3:11], timevar='person', v.names=c('height', 'age', 'weight'), sep='_')
GroupID Treated person height age weight id
1.1 1 Y 1 187 45 69 1
2.1 2 Y 1 145 33 64 2
3.1 3 N 1 136 71 51 3
4.1 4 Y 1 201 19 70 4
5.1 5 N 1 168 52 66 5
1.2 1 Y 2 140 54 62 1
2.2 2 Y 2 122 20 70 2
3.2 3 N 2 186 48 65 3
4.2 4 Y 2 160 63 72 4
5.2 5 N 2 232 26 74 5
1.3 1 Y 3 112 21 61 1
2.3 2 Y 3 143 56 69 2
3.3 3 N 3 187 40 73 3
4.3 4 Y 3 194 59 63 4
5.3 5 N 3 159 67 72 5
This relies on the order of the columns in your original data, for the varying argument, being in increasing order in the original data.
If that's not the case, specify varying manually. Here's what is used above:
> names(x)[3:11]
[1] "person1_age" "person1_weight" "person1_height" "person2_age" "person2_weight" "person2_height"
[7] "person3_age" "person3_weight" "person3_height"

We can also use melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(x), measure = patterns("age$", "weight$", "height$"),
variable.name = "person", value.name = c("age", "weight", "height"))
# GroupID Treated person age weight height
# 1: 1 Y 1 45 187 69
# 2: 2 Y 1 33 145 64
# 3: 3 N 1 71 136 51
# 4: 4 Y 1 19 201 70
# 5: 5 N 1 52 168 66
# 6: 1 Y 2 54 140 62
# 7: 2 Y 2 20 122 70
# 8: 3 N 2 48 186 65
# 9: 4 Y 2 63 160 72
#10: 5 N 2 26 232 74
#11: 1 Y 3 21 112 61
#12: 2 Y 3 56 143 69
#13: 3 N 3 40 187 73
#14: 4 Y 3 59 194 63
#15: 5 N 3 67 159 72

Merge lines with same ID and take average value

From the table below I need to combine the lines by calculating the average value for those lines with same ID (column 2).
I was thinking of the plyr function??
ddply(df, summarize, value = average(ID))
df:
miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G 134R 141G 141R 167G 167R 185G 185R
1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703 442 715 309 546 113 358 309
2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13 3 6 3 6 4 7 5
3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1 1 3 2 2 0 2 1
4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4 1 3 0 0 2 2 4
5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32 21 23 25 10 14 32 55
Summary of original data:
> str(ClusterMatrix)
'data.frame': 113 obs. of 98 variables:
$ miRNA: Factor w/ 202 levels "hsa-miR-106a",..: 1 3 4 6 8 8 14 15 15 16 ...
$ ID : Factor w/ 27 levels "ID1","ID10","ID11",..: 25 12 12 12 21 21 12 21 21 6 ...
$ 100G : Factor w/ 308 levels "-0.307749042739963",..: 279 11 3 3 101 42 139 158 215 222 ...
$ 100R : Factor w/ 316 levels "-0.138028803567403",..: 207 7 8 8 18 42 128 183 232 209 ...
$ 106G : Factor w/ 260 levels "-0.103556709881933",..: 171 4 1 3 7 258 95 110 149 162 ...
$ 106R : Factor w/ 300 levels "-0.141810346640204",..: 141 4 6 2 108 41 146 196 244 267 ...
$ 122G : Factor w/ 336 levels "-0.0409548922061764",..: 237 12 4 6 103 47 148 203 257 264 ...
$ 122R : Factor w/ 316 levels "-0.135708706475279",..: 177 1 8 6 36 44 131 192 239 244 ...
$ 124G : Factor w/ 267 levels "-0.348439853247856",..: 210 5 2 3 7 50 126 138 188 249 ...
$ 124R : Factor w/ 303 levels "-0.176414190219115",..: 193 3 7 3 21 52 167 200 238 239 ...
$ 126G : Factor w/ 307 levels "-0.227658806811544",..: 122 88 5 76 169 61 240 220 281 265 ...
$ 126R : Factor w/ 249 levels "-0.271925865853123",..: 119 1 2 3 11 247 78 110 151 193 ...
$ 134G : Factor w/ 344 levels "-0.106333543799583",..: 304 14 8 5 33 48 150 196 248 231 ...
$ 134R : Factor w/ 300 levels "-0.0997616469801097",..: 183 5 7 7 22 298 113 159 213 221 ...
$ 141G : Factor w/ 335 levels "-0.134429748398679",..: 253 7 3 3 24 29 142 137 223 302 ...
$ 141R : Factor w/ 314 levels "-0.143299688877927",..: 210 4 5 7 98 54 154 199 255 251 ...
$ 167G : Factor w/ 306 levels "-0.211181452126958",..: 222 7 4 6 11 292 91 101 175 226 ...
$ 167R : Factor w/ 282 levels "-0.0490740880560127",..: 130 2 6 4 15 282 110 146 196 197 ...
$ 185G : Factor w/ 317 levels "-0.0567841338235346",..: 218 2 7 7 33 34 130 194 227 259 ...

We can use dplyr. We group by 'ID', use mutate_each to create columns that show the mean value of '100G' to '185R'. We select the columns in mutate_each by using regex patterns in matches. Then cbind (bind_cols) the original dataset with the mutated columns, and convert to data.frame if needed. We can also change the column names of the mean columns.
library(dplyr)
out <- df1 %>%
group_by(ID) %>%
mutate_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+')) %>%
setNames(., c(names(.)[1:2], paste0('Mean_', names(.)[3:ncol(.)]))) %>%
as.data.frame()
out1 <- bind_cols(df1, out[-(1:2)])
out1
# miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G
#1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703
#2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13
#3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1
#4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4
#5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32
# 134R 141G 141R 167G 167R 185G 185R Mean_100G Mean_100R Mean_106G
#1 442 715 309 546 113 358 309 1585.000000 423.0000000 180.000000
#2 3 6 3 6 4 7 5 4.666667 0.3333333 1.666667
#3 1 3 2 2 0 2 1 4.666667 0.3333333 1.666667
#4 1 3 0 0 2 2 4 4.666667 0.3333333 1.666667
#5 21 23 25 10 14 32 55 29.000000 17.0000000 6.000000
# Mean_106R Mean_122G Mean_122R Mean_124G Mean_124R Mean_126G Mean_126R
#1 113 598.000000 266 227.000000 242.000000 70 106
#2 3 6.333333 4 1.666667 2.666667 16 1
#3 3 6.333333 4 1.666667 2.666667 16 1
#4 3 6.333333 4 1.666667 2.666667 16 1
#5 55 40.000000 35 6.000000 20.000000 171 10
# Mean_134G Mean_134R Mean_141G Mean_141R Mean_167G Mean_167R Mean_185G
#1 2703 442.000000 715 309.000000 546.000000 113 358.000000
#2 6 1.666667 4 1.666667 2.666667 2 3.666667
#3 6 1.666667 4 1.666667 2.666667 2 3.666667
#4 6 1.666667 4 1.666667 2.666667 2 3.666667
#5 32 21.000000 23 25.000000 10.000000 14 32.000000
# Mean_185R
#1 309.000000
#2 3.333333
#3 3.333333
#4 3.333333
#5 55.000000
EDIT: If we need a single row mean for each 'ID', we can use summarise_each
df1 %>%
group_by(ID) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+'))
EDIT2: Based on the OP's update the original dataset ('ClusterMatrix') columns are all factor class. We need to convert the columns to numeric class before getting the mean. There are two options to convert the factor to numeric - 1) by as.numeric(as.character(.. which may be a bit slower, 2) as.numeric(levels(.. which is faster. Here I am using the first method as it may be more clear.
ClusterMatrix %>%
group_by(ID) %>%
summarise_each(funs(mean= mean(as.numeric(as.character(.)),
na.rm=TRUE)), matches('^\\d+'))
data
df1 <- structure(list(miRNA = c("hsa-miR-106a", "hsa-miR-1185-1",
"hsa-miR-1185-2",
"hsa-miR-1197", "hsa-miR-127"), ID = c("ID7", "ID2", "ID2", "ID2",
"ID3"), `100G` = c(1585L, 10L, 2L, 2L, 29L), `100R` = c(423L,
1L, 0L, 0L, 17L), `106G` = c(180L, 3L, 2L, 0L, 6L), `106R` = c(113L,
3L, 1L, 5L, 55L), `122G` = c(598L, 11L, 5L, 3L, 40L), `122R` = c(266L,
8L, 1L, 3L, 35L), `124G` = c(227L, 4L, 1L, 0L, 6L), `124R` = c(242L,
4L, 0L, 4L, 20L), `126G` = c(70L, 28L, 4L, 16L, 171L), `126R` = c(106L,
2L, 1L, 0L, 10L), `134G` = c(2703L, 13L, 1L, 4L, 32L), `134R` = c(442L,
3L, 1L, 1L, 21L), `141G` = c(715L, 6L, 3L, 3L, 23L), `141R` = c(309L,
3L, 2L, 0L, 25L), `167G` = c(546L, 6L, 2L, 0L, 10L), `167R` = c(113L,
4L, 0L, 2L, 14L), `185G` = c(358L, 7L, 2L, 2L, 32L), `185R` = c(309L,
5L, 1L, 4L, 55L)), .Names = c("miRNA", "ID", "100G", "100R",
"106G", "106R", "122G", "122R", "124G", "124R", "126G", "126R",
"134G", "134R", "141G", "141R", "167G", "167R", "185G", "185R"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R delete first and last x % of rows - r

Calculate the offset for what you want to retain: rem <- ceiling( nrow( x ) * .2 ) + 1 Then take out the records you don-t want: dat <- dat[ rem : ( nrow( dat ) - rem ), ]

Related

Creating differences in a new column for certain dates in R

data.table efficiently finding common pairs between 2 columns

Split data frame by class regarding to OID

Change data set from wide to long while retaining group id, and also gathering columns [duplicate]

Merge lines with same ID and take average value

Categories

Resources