Excluding both the minimum and maximum value - r

I want to exclude the minimum as well as the maximum value of each row in a data frame. (If one of those value are repeated, only one should be excluded.)
I can exclude either the minimum, or the maximum, but not both.
I don't seem to find a way to combine those (which both work fine by themselves):
d[-which(d == min(d))[1]]
d[-which(d == max(d))[1]]
This doesn't work:
d[
-which(d == min(d))[1] &
-which(d == max(d))[1]
]
It gives the full row.
(I also tried an approach using apply(d, 1, min/max), but this also fails.)

Update
Remembered after looking at #Rich Pauloo's answer, we can directly use which.max and which.min to get index of minimum and maximum value
as.data.frame(t(apply(df, 1, function(x) x[-c(which.max(x), which.min(x))])))
# V1 V2 V3
#1 13 11 6
#2 15 8 18
#3 5 10 21
#4 14 12 17
#5 19 9 20
Here which.max/which.min will ensure that you get the index of first minimum and maximum respectively for each row.
Some other variations could be
as.data.frame(t(apply(df, 1, function(x)
x[-c(which.max(x == min(x)), which.max(x == max(x)))])))
If you want to use which we can do
as.data.frame(t(apply(df, 1, function(x)
x[-c(which(x == min(x)[1]), which(x == max(x)[1]))])))
data
set.seed(1234)
df <- as.data.frame(matrix(sample(25), 5, 5))
df
# V1 V2 V3 V4 V5
#1 3 13 11 16 6
#2 15 1 8 25 18
#3 24 5 4 10 21
#4 14 12 17 2 22
#5 19 9 20 7 23

You were very close! With data.frames you need to use a comma within the brackets to accomplish row-column subsetting.
Use which.max() and which.min() to return the index of the max and min values of a vector, respectively.
Bind those indices into a new vector with c().
Use - and the vector from 2. to subset your data frame for the desired rows.
Here's an example to copy/paste:
d <- data.frame(a = 1:5) # make example data.frame
d[-c(which.max(d$a), which.min(d$a)), ]
[1] 2 3 4

This will remove the rows containing the min and max values of score as shown in the example data frame.
library(tidyverse)
df <- tribble(~name, ~score,
'John', 10,
'Mike', 2,
'Mary', 11,
'Jane', 1,
'Jill', 5)
df %>%
arrange(score) %>%
slice(-1, -nrow(.))
# A tibble: 3 x 2
name score
<chr> <dbl>
1 Mike 2
2 Jill 5
3 John 10

We can use
t(apply(df, 1, function(x) x[!x %in% range(x)]))

Related

Split all values in column and store them in a single numeric vector

I have a dataframe in R and I want to create a single numeric vector by splitting all of the character values in a specific column and then appending them to the vector or list. The values in the column are all comma-separated numbers and there are rows with missing values or NA.
Current data
id col
1 2,6,10
2 NA
3 5, 10
4 1
Final vector
# v <- c(2, 6, 10, 5, 10, 1)
# v
[1] 2 6 10 5 10 1
I'm able to do this by iterating through all the values in the column but I know this isn't the most efficient way since R is made to work easily with vectors. Is there a better way to do this?
v <- c()
for(val in df$col){
if(!is.na(val)){
ints <- as.numeric(unlist(strsplit(val, ",")))
v <- c(v, ints)
}
}
You already have the answer in your code since all the functions you are using are vectorised.
v <- as.numeric(na.omit(unlist(strsplit(df$col, ','))))
v
#[1] 2 6 10 5 10 1
Does this work:
library(dplyr)
library(tidyr)
df %>% separate_rows(col) %>% na.omit() %>% pull(col) %>% as.numeric() -> v
v
[1] 2 6 10 5 10 1
Data used:
df
# A tibble: 4 x 2
id col
<dbl> <chr>
1 1 2,6,10
2 2 NA
3 3 5, 10
4 4 1

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

creating a moving window cumsum in R

Here is a sample df in which I would like to calculate cumulative sum over a moving window.
d <- data.frame(y = 1:10)
From previous suggestions, I am able to perform a sliding window cumsum, using the following script (thanks to nograpes):
size <- 2 # size of window
len <- nrow(d) - size +1 # number of sliding windows to perform
sumsmatrix <- apply(d, 2, function(x)
cumsum(x)[size:nrow(d)] - c(0,cumsum(x)[1:(len-1)]))
And gives the following output:
y
3
5
7
9
11
13
15
17
19
My request is to perform a cumsum by moving window, instead of sliding. For instance take my window size is 2, I would like to calculate the cumsum of first two rows of a column and then move to the 3rd and calculate for 3rd and 4th and so on ..
Desired output:
y
1
3
3
7
5
11
7
15
9
19
How can I tweak the script to suit my needs?
We can do a group by operation by creating a grouping variable with %/% and then use on the group by functions i.e. tapply
unlist(tapply(d$y, (seq_len(nrow(d))-1)%/% size, FUN = cumsum), use.names = FALSE)
#[1] 1 3 3 7 5 11 7 15 9 19
Another way would be to create a group variable and use cumsum().
library(dplyr)
d %>%
group_by(group = rep(1:(nrow(.)/2), each = 2)) %>%
transmute(y = cumsum(y)) %>%
ungroup %>%
select(-group)
# y
# <int>
#1 1
#2 3
#3 3
#4 7
#5 5
#6 11
#7 7
#8 15
#9 9
#10 19
The initial example seems to be for a rolling sum over a window of width 2, not a cumulative sum. It can be written more compactly as:
library(zoo)
rollapplyr(d, 2, sum)
or
rollsum(d, 2)
or this which uses no packages:
apply(d, 2, function(x) rowSums(embed(x, 2)))
Now getting to the actual question here are some alternatives:
1) zoo To perform a cumsum rolling forward by 2 at a time:
apply(d, 2, function(x) t(rollapplyr(x, 2, cumsum, by = 2)))
2) no packages This would also work and uses no packages:
apply(d, 2, function(x) apply(matrix(x, 2), 2, cumsum))
Revised to work on every column of its input.

subset rows with (1) ALL and (2) ANY columns larger than a specific value

I have a data frame with an id column and some (potentially many) columns with values, here 'v1', 'v2':
df <- data.frame(id = c(1:5), v1 = c(0,15,9,12,7), v2 = c(9,32,6,17,11))
# id v1 v2
# 1 1 0 9
# 2 2 15 32
# 3 3 9 6
# 4 4 12 17
# 5 5 7 11
How can I extract rows where ALL values are larger than a certain value, say 10, which should return:
# id v1 v2
# 2 2 15 32
# 4 4 12 17
How can I extract rows with ANY (at least one) value is larger than 10:
# id v1 v2
# 2 2 15 32
# 4 4 12 17
# 5 5 7 11
See functions all() and any() for the first and second parts of your questions respectively. The apply() function can be used to run functions over rows or columns. (MARGIN = 1 is rows, MARGIN = 2 is columns, etc). Note I use apply() on df[, -1] to ignore the id variable when doing the comparisons.
Part 1:
> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
Part 2:
> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
5 5 7 11
To see what is going on, x > 10 returns a logical vector for each row (via apply() indicating whether each element is greater than 10. all() returns TRUE if all element of the input vector are TRUE and FALSE otherwise. any() returns TRUE if any of the elements in the input is TRUE and FALSE if all are FALSE.
I then use the logical vector resulting from the apply() call
> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE TRUE FALSE TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE TRUE FALSE TRUE TRUE
to subset df (as shown above).
This can be done using apply with margin 1, which will apply a function to each row. The function to check a given row would be
function(row) {all(row > 10)}
So the way to extract the rows themselves is
df[apply(df, 1, function(row) {all(row > 10)}),]
One option is looping row-by-row (e.g. with apply) and using any or all, as proposed in the other two answers. However, this can be inefficient for large data frames.
A vectorized approach would be to use rowSums to determine the number of values in each row matching your criterion, and filter based on that.
(1) When filtering to rows where ALL values are at least 10, this is the same as filtering to cases where the number of values in a row less than or equal to 10 is 0:
df[rowSums(df[,-1] <= 10) == 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
(2) Similarly, rowSums can easily be used to compute the rows with ANY (at least one) value is larger than 10:
df[rowSums(df[,-1] > 10) > 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
# 5 5 7 11
The speedup is clear with a larger input:
set.seed(144)
df <- matrix(sample(c(1, 10, 20), 3e6, replace=TRUE), ncol=3)
system.time(df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ])
# user system elapsed
# 1.754 0.156 2.102
system.time(df[rowSums(df[,-1] <= 10) == 0,])
# user system elapsed
# 0.04 0.01 0.05
The dplyr equivalent is as follows
library(dplyr)
#ANY
df %>% rowwise() %>%
filter(any(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 3 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17
3 5 7 11
#ALL
df %>% rowwise() %>%
filter(all(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 2 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17

Subtract previous year's from value from each grouped row in data frame

I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2
I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
There's lots of ways to do this but this one is fairly fast and uses base.
Here's a second & third way of approaching this problem utilizing aggregate and by:
aggregate:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
by:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
plyr
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]
It gives you the final product of:
> df[order(as.numeric(rownames(df))),]
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
EDIT: Avoiding the Loop
May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.
set.seed(1234) #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff)) #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),] #reorder it
Of course you'd have to rename these unless you used transform as in:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
This would depend on how many variables you were doing this to.
1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :
library(zoo)
zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)
It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:
> zz
1 2 3 4 5
1 6 5 2 9 2
2 10 5 10 7 13
3 5 16 14 10 14
The wide form shown may actually be preferable but you can convert it to long form if you want that like this:
dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))
This puts the years in ascending order which is the convention normally used in R.
2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:
df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])
3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:
transform(df, actual = value - c(tail(value, -5), rep(0, 5)))
or this variation:
transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))
EDIT: added rollapply and subtraction solutions.
Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:
mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])
(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])
(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]
This is the output:
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2

Resources