selecting middle n rows in R - r

I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.

Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60

You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.

Related

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

I want to take the values of a (large) data frame:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]
Which eventually reaches a 2, a 3, etc.
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables
sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
sub.data
x min max
1 a 2 6
2 c 50 53
3 d 25 30
The output should look like something like this:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the 'if' part complains.
I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr version 1.1.0, we could also use non-equi joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as #Davis Vaughan mentioned, use between with a left_joion
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))

order data.frame based on 2 columns and vector of variables

I need to order a data.frame based on 2 columns and a given vector of variables.
Here n example of my df:
df = data.frame(A = rnorm(45),
B = rep(c('a', 'b', 'c'), each= 5, times = 3),
C = rep(c(10, 20, 30), each = 15))
I need to change the order of col B from c('a', 'b', 'c') to c('c', 'a', 'b') while still keeping col C fixed to the 3 variables groups.
Here the first 30 rows of my desired output:
A B C
-0.11451485 c 10
-0.11860742 c 10
0.08156183 c 10
1.11850750 c 10
-0.79072556 c 10
1.24141030 a 10
0.88538811 a 10
-1.35548712 a 10
0.05723677 a 10
0.14660464 a 10
-0.28587107 b 10
0.59452832 b 10
1.00163605 b 10
1.15892322 b 10
-1.41771696 b 10
-2.05743546 c 20
-1.22835358 c 20
1.50060736 c 20
-0.14956114 c 20
-1.13126592 c 20
1.08571256 a 20
-1.04991699 a 20
-1.50655996 a 20
-0.63675392 a 20
-0.26485423 a 20
0.30509657 b 20
0.85471772 b 20
-0.54064736 b 20
0.24578056 b 20
0.14917900 b 20
Any help will be really appreciated,
thanks
The key is to change the level of the factor column. After that, we can use arrange from the dplyr package to sort multiple columns. Notice that in your original post, sorting column A is not a requirement. I just add column A to the arrange call to show it is easy to include more than two columns to the arrange function.
library(dplyr)
df2 <- df %>%
# Change the level of the factor
mutate(B = factor(B, levels = c("c", "a", "b"))) %>%
# Arrange the column
arrange(C, B, A)
df2
# A B C
# 1 -2.39317699 c 10
# 2 -1.48901928 c 10
# 3 -0.42562766 c 10
# 4 0.03383395 c 10
# 5 0.66362189 c 10
# 6 -0.65324997 a 10
# 7 -0.59408686 a 10
# 8 0.37012883 a 10
# 9 0.53238177 a 10
# 10 3.03972004 a 10
# 11 -2.03192274 b 10
# 12 -1.05138447 b 10
# 13 -0.80795342 b 10
# 14 1.74526091 b 10
# 15 2.07681466 b 10
# 16 -1.90573715 c 20
# 17 -0.72626244 c 20
# 18 -0.48017481 c 20
# 19 -0.42995920 c 20
# 20 0.17729002 c 20
# 21 -0.62947278 a 20
# 22 -0.40038152 a 20
# 23 -0.23368555 a 20
# 24 0.44218806 a 20
# 25 1.58561071 a 20
# 26 -0.66270426 b 20
# 27 -0.50256255 b 20
# 28 -0.19890974 b 20
# 29 0.26562533 b 20
# 30 1.84093124 b 20
# 31 -0.93702848 c 30
# 32 0.10804529 c 30
# 33 0.25758608 c 30
# 34 1.33084399 c 30
# 35 1.67204875 c 30
# 36 -1.88922564 a 30
# 37 -1.74551938 a 30
# 38 -1.32215854 a 30
# 39 -0.43743607 a 30
# 40 1.07554466 a 30
# 41 -0.38154167 b 30
# 42 0.53823057 b 30
# 43 0.83401316 b 30
# 44 1.04418363 b 30
# 45 2.45985490 b 30
You can first use factor() to order your B factor by levels you define; With that, you can now order your data frame by B to get your desired output.
Generating some data:
set.seed(10)
df = data.frame(A = rnorm(45),
B = rep(c('a', 'b', 'c'), each= 5, times = 3),
C = rep(c(10, 20, 30), each = 15))
And using levels to re-level your factor before ordering the data frame:
df$B <- factor(df$B,levels = c('c', 'a', 'b'))
df$B <- sort(df$B, decreasing = F)
df <- df[order(df$C), ]
Output (first 20 rows):
1.0177950 c 10
0.75578151 c 10
-0.23823356 c 10
0.98744470 c 10
0.74139013 c 10
0.01874617 a 10
-0.18425254 a 10
-1.37133055 a 10
-0.59916772 a 10
0.29454513 a 10
0.38979430 b 10
-1.20807618 b 10
-0.36367602 b 10
-1.62667268 b 10
-0.25647839 b 10
-0.37366156 c 20
-0.68755543 c 20
-0.87215883 c 20
-0.10176101 c 20
-0.25378053 c 20

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

Assigning values to a new column based on a condition between two dataframes

I have two dataframes. I need to add the value of one column to every row in the other dataframe where the values of a particular column meet a condition from the first dataframe.
df1:
a b
x 23
s 34
v 15
g 05
k 69
df2:
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
Desired output:
a b n
x 23 3
s 34 4
v 15 2
g 05 1
k 69 7
In my dataset the intervals are large, and it's unlikely that a value from df1 is exactly on the boundary of a df2 interval.
Essentially for every row in df1 I need to assign the number which corresponds to which range it fits into in df2. So if df1$b is between df2$y and df2$z, then assign the value of output$n as the corresponding value of df2$x. This is quite a wordy question, so please ask if I need to clarify.
df1 = read.table(text = "
a b
x 23
s 34
v 15
g 05
k 69
", header=T, stringsAsFactors=F)
df2 = read.table(text = "
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
", header=T, stringsAsFactors=F)
# function
f = function(x) min(which(x >= df2$y & x <= df2$z))
f = Vectorize(f)
# apply function
df1$n = f(df1$b)
# check updated dataset
df1
# a b n
# 1 x 23 3
# 2 s 34 4
# 3 v 15 2
# 4 g 5 1
# 5 k 69 7
You can try:
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(n=df2[ b > df2$y & b <= df2$z,1]) %>%
ungroup()
# A tibble: 5 x 3
a b n
<chr> <int> <int>
1 x 23 3
2 s 34 4
3 v 15 2
4 g 5 1
5 k 69 7
as already commented you have to change < or > to <= or >= accordingly to your needs.

Enumerate instances of a factor level

I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA

Resources