How to reorder dataframe rows in based on conditions in more than 1 column in R? - r

The Problem
I am trying to reorder rows based on the conditions in 2 other columns. Specifically, I have a sequential ID for hundreds of randomly generated sampling transects called "ID_First" and then for each transect there is a corresponding "ID_Next" that represents the next transect that should be sampled. I am trying to reorder the rows such that the sampling transects are in order of execution rather than the original order based on "ID_First"
I know that data frames can be arranged based on one or more columns for numerical variables in either an ascending or descending way and, for factors, in an "ordered" way (e.g., high, medium, low). Is it possible to arrange the order of the rows based on the sequence of ID_first and then ID_Next? I have not been able to figure out how to do this so I have been doing it manually.
Simplified Reproducible Example
Data
# sequential ID for a small number of randomly generated transects
ID_First <- seq(1,10,1)
# represents the next transect that should be sampled following ID_First
ID_Next <- c(4,5,8,7,10,2,9,6,3,NA)
# make a dataframe
df <- cbind.data.frame(ID_First, ID_Next)
# look at the df
df
> ID_First ID_Next
> 1 1 4
> 2 2 5
> 3 3 8
> 4 4 7
> 5 5 10
> 6 6 2
> 7 7 9
> 8 8 6
> 9 9 3
> 10 10 NA
So, if you start with ID_First equal to 1 and then look at the corresponding ID_Next this would indicate that the next transect to sample is 4. Then you go to ID_First equal to 4 and the corresponding ID_Next to sample next would be 7, and so on. For this example, the order of sampling would progress as follows: 1,4,7,9,3,8,6,2,5,10.
Ideal Outcome
Here is what I am trying to accomplish:
> ID_First ID_Next
> 1 1 4
> 4 4 7
> 7 7 9
> 9 9 3
> 3 3 8
> 8 8 6
> 6 6 2
> 2 2 5
> 5 5 10
> 10 10 NA
Now the transects are following the order needed for sampling (e.g., 1 to 4, 4 to 7, 7 to 9, 9 to 3, etc. through 10) rather than the ascending ID_First.
Question
Is there an easy way to reorder the original data frame using ID_First equal to 1 as the standpoint and then, following the progression of ID_Next to ID_Tirst to ID_Next to arrange the remainder of the transects?

You can use Reduce with match to find the chain from ID_First to ID_Next.
df[Reduce(function(i,j) match(df$ID_Next[i], df$ID_First)
, seq_len(nrow(df)), accumulate = TRUE),]
# ID_First ID_Next
#1 1 4
#4 4 7
#7 7 9
#9 9 3
#3 3 8
#8 8 6
#6 6 2
#2 2 5
#5 5 10
#10 10 NA
Data:
df <- data.frame(ID_First = 1:10, ID_Next = c(4,5,8,7,10,2,9,6,3,NA))
df
# ID_First ID_Next
#1 1 4
#2 2 5
#3 3 8
#4 4 7
#5 5 10
#6 6 2
#7 7 9
#8 8 6
#9 9 3
#10 10 NA

You can accomplish this for your specific example using a while loop and the match() function in R. I also used list.append() from the rlist package.
library(rlist)
# sequential ID for a small number of randomly generated transects
ID_First <- seq(1,10,1)
# represents the next transect that should be sampled following ID_First
ID_Next <- c(4,5,8,7,10,2,9,6,3,NA)
# make a dataframe
df <- cbind.data.frame(ID_First, ID_Next)
#create while loop to define target order
i = 1
order = c(i)
n = 1
while (n < length(df$ID_Next)){
j = df[df$ID_First == i, 2]
order = list.append(order, j)
i = j
n = n+1
}
#match df order to target order
df2 = df[match(order, df$ID_First),]

Related

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Randomly choose value between 1 and 10 with equal number of instances [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Select and insert value unique number of times in R
I would like to generate 2000 random numbers between 1 and 10 such that for each random number I have the same number of instances.
In this case 200 for each number.
What should be random is the order in which it is generated.
I have the following problem:
I have an array with 2000 entries but not each with unique values, for example it starts like this:
11112233333333344445667777777777
and consists of 2000 entries.
I would like to generate random numbers and assign each UNIQUE value a separate random number but have an entry for each value
So my intended result would look like this:
original array: 11112233333333344445667777777777
random numbers: 33334466666666699991778888888888
You could do this in a few steps:
my_numbers <- rep(1:10, each=200)
my_randomizer <- sample(seq_along(my_numbers), length(my_numbers))
my_random_numbers <- my_numbers[my_randomizer]
Based on the edits:
I would use rle. It sounds like you don't have an array, but instead a vector:
my_array_rled <- rle(my_array)
my_random_numbers <- sample(1:10, length(unique(my_array)))
my_array_rled$values <- factor(my_array_rled$values)
levels(my_array_rled$values) <- my_random_numbers
my_array_randomized <- inverse.rle(my_array_rled)
If I understand you correctly you can use "rep" to replicate your random numbers 200 times and "sample" to randomize the resulting vector.
x <- sample(rep(runif(2000,1,10),200))
A non vectorized code:
# using a seed for reproducible example
set.seed(2)
original_array <- c(1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
random_numbers <- numeric(length=length(original_array))
rdnum <- sample(unique(original_array), length(unique(original_array)))
for ( i in 1:length(unique(original_array)))
random_numbers[original_array == i] <- rdnum[i]
random_numbers
2 2 2 2 5 5 3 3 3 3 3 3 3 3 3 1 1 1 1 6 7 7 4 4 4 4 4 4 4 4 4 4
The table function with sample comes in quite handy for this scenerio:
set.seed(1)
## ASSUMING ORIGINAL IS A VECTOR
original <- c(1, 1, 1, 1, 2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
## CREATE A TABLE OF ALL THE VALUES
tabl <- table(original)
## RNG is the sample range to select from. Assuming 1:10 in this example
RNG <- 1:10
## PICK VALUES RANDOMLY FROM RNG
tabl[] <- sample(RNG, length(tabl), replace=FALSE)
# note that the `names` of `tabl` will contain the values from `original`
# whereas the values of `tabl` will contain the new random value.
## ASSIGN NEW VALUES
randomNums <- original
for(i in seq(length(tabl)))
randomNums[ original==as.numeric(names(tabl))[[i]] ] <- tabl[[i]]
Results:
rbind(orig=original, rand=randomNums)
orig: 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7 7
rand: 3 3 3 3 4 4 5 5 5 5 5 5 5 5 5 7 7 7 7 2 8 8 9 9 9 9 9 9 9 9 9 9

Resources