How to subset a data.frame according to the values of last two rows? - r

###the original data
df1 <- data.frame(a=c(2,2,5,5,7), b=c(1,5,4,7,6))
df2 <- data.frame(a=c(2,2,5,5,7,7), b=c(1,5,4,7,6,3))
when the a column value of the last two rows are not equal (here the 4th row is not equal to the 5th row, namely, 5!=7), I want to subset the last row only.
#input
> df1
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
#output
> df1
a b
1 7 6
when the a column value of the last two rows are equal (here 5th row is equal to the 6th row, namely, 7=7, I want to subset the last two rows
#input
> df2
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
6 7 3
#output
> df2
a b
1 7 6
2 7 3

You can write a function to check last two row values for a column :
return_rows <- function(data) {
n <- nrow(data)
if(data$a[n] == data$a[n - 1])
tail(data, 2)
else tail(data, 1)
}
return_rows(df1)
# a b
#5 7 6
return_rows(df2)
# a b
#5 7 6
#6 7 3

try it this way
library(tidyverse)
df %>%
filter(a == last(a))
a b
5 7 6
a b
5 7 6
6 7 3

We can use subset from base R
subset(df1, a == a[length(a)])

Related

sorting data frame columns based on specific value in each column

I am using the Tidyverse package in R. I have a data frame with 20 rows and 500 columns. I want to sort all the columns based on the size of the value in the last row of each column.
Here is an example with just 3 rows and 4 columns:
1 2 3 4,
5 6 7 8,
8 7 9 1
The desired result is:
3 1 2 4,
7 5 6 8,
9 8 7 1
I searched stack overflow but could not find an answer to this type of question.
If we want to use dplyr from tidyverse, we can use slice to get the last row and then use order in decreasing order to subset columns.
library(dplyr)
df[df %>% slice(n()) %>% order(decreasing = TRUE)]
# V3 V1 V2 V4
#1 3 1 2 4
#2 7 5 6 8
#3 9 8 7 1
Whose translation in base R would be
df[order(df[nrow(df), ], decreasing = TRUE)]
data
df <- read.table(text = "1 2 3 4
5 6 7 8
8 7 9 1")
The following reorders the data frame columns by the order of the last-rows values:
df <- data.frame(col1=c(1,5,8),col2=c(2,6,7),col3=c(3,7,9),col4=c(4,8,1))
last_row <- df[nrow(df),]
df <- df[,order(last_row,decreasing = T)]
First, to get the last rows. Then to sort them with the order() function and to return the reordered columns.
>df
col3 col1 col2 col4
1 3 1 2 4
2 7 5 6 8
3 9 8 7 1

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Return row number(s) for a particular value in a column in a dataframe

I have a data frame (df) and I was wondering how to return the row number(s) for a particular value (2585) in the 4th column (height_chad1) of the same data frame?
I've tried:
row(mydata_2$height_chad1, 2585)
and I get the following error:
Error in factor(.Internal(row(dim(x))), labels = labs) :
a matrix-like object is required as argument to 'row'
Is there an equivalent line of code that works for data frames instead of matrix-like objects?
Any help would be appreciated.
Use which(mydata_2$height_chad1 == 2585)
Short example
df <- data.frame(x = c(1,1,2,3,4,5,6,3),
y = c(5,4,6,7,8,3,2,4))
df
x y
1 1 5
2 1 4
3 2 6
4 3 7
5 4 8
6 5 3
7 6 2
8 3 4
which(df$x == 3)
[1] 4 8
length(which(df$x == 3))
[1] 2
count(df, vars = "x")
x freq
1 1 2
2 2 1
3 3 2
4 4 1
5 5 1
6 6 1
df[which(df$x == 3),]
x y
4 3 7
8 3 4
As Matt Weller pointed out, you can use the length function.
The count function in plyr can be used to return the count of each unique column value.
which(df==my.val, arr.ind=TRUE)

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources