Given this data frame:
library(dplyr)
dat <- data.frame(
bar = c(letters[1:10]),
foo = c(1,2,3,5,8,9,11,13,14,15)
)
bar foo
1 a 1
2 b 2
3 c 3
4 d 5
5 e 8
6 f 9
7 g 11
8 h 13
9 i 14
10 j 15
I first want to identify groups, if the foo number is consecutive:
dat <- dat %>% mutate(in_cluster =
ifelse( lead(foo) == foo +1 | lag(foo) == foo -1,
TRUE,
FALSE))
Which leads to the following data frame:
bar foo in_cluster
1 a 1 TRUE
2 b 2 TRUE
3 c 3 TRUE
4 d 5 FALSE
5 e 8 TRUE
6 f 9 TRUE
7 g 11 FALSE
8 h 13 TRUE
9 i 14 TRUE
10 j 15 TRUE
As can be seen, the values 1,2,3 form a group, then value 5 is on it's own and does not belong to a cluster, then values 8,9 form another cluster and so on.
I would like to add cluster numbers to these "groups".
Expected output:
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
7 g 11 FALSE NA
8 h 13 TRUE 3
9 i 14 TRUE 3
10 j 15 TRUE 3
There is probably a better tidverse approach for something like this. For example, group_indices could be used if in_cluster is defined through an arbitrary length case_when. However, we can also implement our own method to specifically deal with logical value run lengths, using the rle function.
solution 1 (R version > 3.5)
lgl_indices <- function(var){
x <- rle(var)
cumsum(x[[2]]) |> (\(.){ .[which(!x[[2]], T)] <- NA ; .})() |> rep(x[[1]])
}
solution 2
lgl_indices <- function(var){
x <- rle(var)
y <- cumsum(x$values)
y[which(x$values == F)] <- NA
rep(y, x$lengths)
}
solution 3
lgl_indices <- function(var){
x <- rle(var)
l <- vector("list", length(x))
n <- 1L
for (i in seq_along(x[[1]])) {
if(!x$values[i]) grp <- NA else {
grp <- n
n <- n + 1L
}
l[[i]] <- rep(grp, x$lengths[i])
}
Reduce(c, l)
}
dat %>%
mutate(cluster_number = lgl_indices(in_cluster))
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
This may not be the efficient way. Still, this works:
# Cumuative sum of the logical
dat$new_cluster <- cumsum(!dat$in_cluster)+1
# using the in_cluster to subset and replacing the cluster number for FALSE by NA
dat[!dat$in_cluster,]$new_cluster <- NA
dat
bar foo in_cluster new_cluster
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
Related
This is my data
df <- data.frame (Var1 <- c("a", "b", "c","d","e","b","c","d","e","c","d","e","d","e","e"),
Var2 <- c("a","a","a","a","a","b","b","b","b","c","c","c","d","d","e")
pre <- c(1,2,3,4,5,1,6,7,8,1,9,10,1,11,1) )
I would like to build a symmetric matrix with Var1 and Var2 function as rownames and colnames, and the matrix values are the Corresponding number in "pre" in r, like this:
a b c d e
a 1 2 3 4 5
b 2 1 6 7 8
c 3 6 1 9 10
d 4 7 9 1 11
e 5 8 10 11 1
This seems to be an easy problem, but I have googled a lot of posts, but it has not been solved, so I come here to ask, thank you!
Mengying
You can get the data in wide format first.
library(dplyr)
library(tidyr)
mat <- df %>%
pivot_wider(names_from = Var2, values_from = pre, values_fill = 0) %>%
column_to_rownames('Var1') %>%
as.matrix()
mat
# a b c d e
#a 1 0 0 0 0
#b 2 1 0 0 0
#c 3 6 1 0 0
#d 4 7 9 1 0
#e 5 8 10 11 1
Since you have a symmetric matrix you can copy the lower triangular matrix to upper triangle.
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
mat
# a b c d e
#a 1 2 3 4 5
#b 2 1 6 7 8
#c 3 6 1 9 10
#d 4 7 9 1 11
#e 5 8 10 11 1
data
df <- data.frame (Var1 = c("a", "b", "c","d","e","b","c","d","e","c","d","e","d","e","e"),
Var2 = c("a","a","a","a","a","b","b","b","b","c","c","c","d","d","e"),
pre = c(1,2,3,4,5,1,6,7,8,1,9,10,1,11,1) )
Here is an option with igraph package
g <- graph_from_data_frame(df,directed = FALSE)
E(g)$pre <- df$pre
get.adjacency(g,attr = "pre")
which gives
a b c d e
a 1 2 3 4 5
b 2 1 6 7 8
c 3 6 1 9 10
d 4 7 9 1 11
e 5 8 10 11 1
Base R solution (using data provided by Ronak):
# Crosstab:
mdat <- as.data.frame.matrix(xtabs(pre ~ Var1 + Var2, df))
# Reflect on the diag (thanks #Ronak Shah):
mdat[upper.tri(mdat)] <- t(mdat)[upper.tri(mdat)]
As #ThomasIsCoding points as well we can use this one-liner:
xtabs(pre ~ ., unique(rbind(df, cbind(setNames(rev(df[-3]), names(df)[-3]), df[3] ))))
As #thelatemail points out we can also:
xtabs(pre ~ ., unique(data.frame(Map(c, df, df[c(2,1,3)]))))
Here's a base R version:
df <- data.frame (Var1 = c("a", "b", "c","d","e","b","c","d","e","c","d","e","d","e","e"),
Var2 = c("a","a","a","a","a","b","b","b","b","c","c","c","d","d","e"),
pre = c(1,2,3,4,5,1,6,7,8,1,9,10,1,11,1))
# Generate new matrix/data frame
mat2 <- matrix(0, length(unique(df$Var1)), length(unique(df$Var2)))
# Name the columns and rows so we can access values
rownames(mat2) <- unique(df$Var1)
colnames(mat2) <- unique(df$Var2)
# Save values into appropriate places into data frame
mat2[as.matrix(df[, 1:2])] <- as.matrix(df[, 3])
# Using upper triangle trick from #Ronak Shah's answer
mat2[upper.tri(mat)] <- t(mat2)[upper.tri(mat2)]
# See results
mat2
# a b c d e
# a 1 2 3 4 5
# b 2 1 6 7 8
# c 3 6 1 9 10
# d 4 7 9 1 11
# e 5 8 10 11 1
I have the following dataset:
df <- data.frame(c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5), c("a","a","a","b","b","b","b","b","b","b","b",
"a","a","a","b","b","b"),
c(300,295,295,25,25,25,25,25,20,20,20,300,295,295,300, 295,295),
c("c","d","e","f","g","h","i","j","l","m","n","o","p","q","r","s","t"))
colnames(df) <- c("ID", "Group", "Price", "OtherNumber")
> df
ID Group Price OtherNumber
1 1 a 300 c
2 1 a 295 d
3 1 a 295 e
4 2 b 25 f
5 2 b 25 g
6 2 b 25 h
7 2 b 25 i
8 3 b 25 j
9 3 b 20 l
10 3 b 20 m
11 3 b 20 n
12 4 a 300 o
13 4 a 295 p
14 4 a 295 q
15 5 b 300 r
16 5 b 295 s
17 5 b 295 t
I want to compare the first price of subsequent IDs. Only if the two subsequent IDs have the same initial price and are in the same group, I want to flag them. Just in case this was not very clear, here an example: I compare the first and second ID, but both the group (a vs. b) and the initial price is a mismatch (300 vs. 25). On the other hand, between ID 2 and 3, they are both in group b and have the same initial price of 25 (cf. row 4 and 8). The prices afterwards do not really matter as they may differ.
I figure, I must be able to work with the dplyr package and have determined a very rough solution (which does not yet work).
# Load dplyr
library(dplyr)
# Assign row numbers within IDs
df1 <- df %>%
group_by(ID) %>%
mutate(subID = row_number())
# Isolate first observation in ID
df2 <- df1[df1$subID == 1,]
# Set up loop to iterate through IDs
for (i in 2:length(df2)) {
if (df2$Price[i] - df2$Price[i - 1] == 0) {
df2$flag <- TRUE
} else {
df2$flag <- FALSE
}
}
If you tell me that this is the only possible solution, I will obviously devote more resources to it, but I am sure there must be an easier solution. I checked on SO and maybe I missed something, but I was not able to find anything going into this direction. Thanks!
The output I want to get looks something like this:
ID Group Price OtherNumber flag
1 1 a 300 c FALSE
2 1 a 295 d FALSE
3 1 a 295 e FALSE
4 2 b 25 f TRUE
5 2 b 25 g TRUE
6 2 b 25 h TRUE
7 2 b 25 i TRUE
8 3 b 25 j TRUE
9 3 b 20 l TRUE
10 3 b 20 m TRUE
11 3 b 20 n TRUE
12 4 a 300 o FALSE
13 4 a 295 p FALSE
14 4 a 295 q FALSE
15 5 b 300 r FALSE
16 5 b 295 s FALSE
17 5 b 295 t FALSE
Here is a data.table oneliner... cut into smaller pieces to view intermediate results; also see explanation at the bottom of the answer.
dt <- as.data.table( df )
dt[ dt[ , .SD[1], ID][ ( Group == shift( Group, type = "lead") & Price == shift( Price, type = "lead") ) |
( Group == shift( Group, type = "lag") & Price == shift( Price, type = "lag),
flag := TRUE][is.na(flag), flag := FALSE], flag := i.flag, on = .(ID)][]
# ID Group Price OtherNumber flag
# 1: 1 a 300 c FALSE
# 2: 1 a 295 d FALSE
# 3: 1 a 295 e FALSE
# 4: 2 b 25 f TRUE
# 5: 2 b 25 g TRUE
# 6: 2 b 25 h TRUE
# 7: 2 b 25 i TRUE
# 8: 3 b 25 j TRUE
# 9: 3 b 20 l TRUE
# 10: 3 b 20 m TRUE
# 11: 3 b 20 n TRUE
# 12: 4 a 300 o FALSE
# 13: 4 a 295 p FALSE
# 14: 4 a 295 q FALSE
# 15: 5 b 300 r FALSE
# 16: 5 b 295 s FALSE
# 17: 5 b 295 t FALSE
explanation:
dt[ , .SD[1], ID] create a data.table with the first row of each ID
[ Group == shift( ... , flag := TRUE] sets the column flag to TRUE when the next (or previous) row has matching Price and Group.
[is.na(flag), flag := FALSE] fills in the rest (which is not TRUE) with `FALSE
..flag := i.flag, on = .(ID)] performs a left join (by reference, so it's fast and efficient) on the original data.table, to get the final result.
This is my sample data.
index <- c(1,2,3,4,5,6,7,8,9,10)
a <- c('a','b','c',NA,'D','e',NA,'g','h','i')
data <- data.frame(index,a)
What I would like to is create a new column name where only 'a' and 'b' stay. All others like 'c','d','e'...will be tagged as others, while NA stays as NA.
data$name = ifelse(!grepl('(a|b)',data$a),'others',data$name)
I tried to use the grepl function and it seems it is not working with data with missing values
In base R:
data$res <- as.character(data$a)
data$res[! data$a %in% c("a","b") & !is.na(data$a)] <- "Other"
data
# index a res
# 1 1 a a
# 2 2 b b
# 3 3 c Other
# 4 4 <NA> <NA>
# 5 5 D Other
# 6 6 e Other
# 7 7 <NA> <NA>
# 8 8 g Other
# 9 9 h Other
# 10 10 i Other
Note that the new column is of type character here.
Using dplyr and its recode function, you could do
data %>% mutate(name=recode(a, a="a", b="b", .default="other"))
# index a name
# 1 1 a a
# 2 2 b b
# 3 3 c other
# 4 4 <NA> <NA>
# 5 5 D other
# 6 6 e other
# 7 7 <NA> <NA>
# 8 8 g other
# 9 9 h other
# 10 10 i other
With a more complicated match, you migth use case_when instead
data %>% mutate(name=case_when(
is.na(a) ~ NA_character_,
a %in% c("a","b") ~ as.character(a),
TRUE ~ "other"))
I would like remove outliers (remove rows with outliers) from each group (by each BRMA_Name)from a dataframe. My example data as following:
BRMA_No BRMA_Name Price
1 A 5
1 A 6
1 A 100
1 A 90
2 B 50
2 B 60
2 B 40
2 B 400
2 B 4
3 C 4
3 C 2
I look through but could not find any answer (sorry), could anyone shed some light on it.
Kind regards
Lutfor
You could try this:
#outlier based on IQR - returns TRUE or FALSE based on the outlier condition
outlier <- function(x) {
ifelse(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x),
TRUE,
FALSE)
}
library(data.table)
#apply the function per group
setDT(df)[, out := outlier(Price), by = 'BRMA_Name']
df
# BRMA_No BRMA_Name Price out
# 1: 1 A 5 FALSE
# 2: 1 A 6 FALSE
# 3: 1 A 100 FALSE
# 4: 1 A 90 FALSE
# 5: 2 B 50 FALSE
# 6: 2 B 60 FALSE
# 7: 2 B 40 FALSE
# 8: 2 B 400 TRUE
# 9: 2 B 4 TRUE
#10: 3 C 4 FALSE
#11: 3 C 2 FALSE
Then just select the rows where out is FALSE (e.g. df[out == FALSE]).
Here's an option using boxplot to determine the outliers:
library(data.table)
setDT(mydf)[, rm := !Price %in% boxplot(Price, plot = FALSE)$out, BRMA_Name][(rm)]
# BRMA_No BRMA_Name Price rm
# 1: 1 A 5 TRUE
# 2: 1 A 6 TRUE
# 3: 1 A 100 TRUE
# 4: 1 A 90 TRUE
# 5: 2 B 50 TRUE
# 6: 2 B 60 TRUE
# 7: 2 B 40 TRUE
# 8: 3 C 4 TRUE
# 9: 3 C 2 TRUE
I suppose the more appropriate approach would be:
setDT(mydf)[, rm := !Price %in% boxplot.stats(Price)$out, BRMA_Name][(rm)]
From the help page for boxplot.stats, the function's default for the coef argument is 1.5. If you wanted to change your outlier detection rule, you can change that value.
Define the wrapper:
TukeyRangeFilter <- function(x) {
normrange <- quantile(x, c(0.25, 0.75)) + c(-1.5, 1.5) * IQR(x)
findInterval(x, normrange)==1
}
Then loop across the elements of BRMA using by:
by(df, df$BRMA_Name, function(x) x[TukeyRangeFilter(x$Price), ])
Concatenate with do.call(rbind, <output>).
BRMA_No BRMA_Name Price
A.1 1 A 5
A.2 1 A 6
A.3 1 A 100
A.4 1 A 90
B.5 2 B 50
B.6 2 B 60
B.7 2 B 40
C.10 3 C 4
C.11 3 C 2
Given a data.frame:
foo <- data.frame(ID=1:10, x=1:10)
rownames(foo) <- LETTERS[1:10]
I would like to reorder a subset of rows, defined by their row names. However, I would like to swap the row names of foo as well. I can do
sel <- c("D", "H") # rows to reorder
foo[sel,] <- foo[rev(sel),]
sel.wh <- match(sel, rownames(foo))
rownames(foo)[sel.wh] <- rownames(foo)[rev(sel.wh)]
but that is long and complicated. Is there a simpler way?
We can replace the sel values in rownames with the reverse of sel.
x <- rownames(foo)
foo[replace(x, x %in% sel, rev(sel)), ]
# ID x
#A 1 1
#B 2 2
#C 3 3
#H 8 8
#E 5 5
#F 6 6
#G 7 7
#D 4 4
#I 9 9
#J 10 10
Not as concise as ronak-shah's answer, but you could also use order.
# extract row names
temp <- row.names(foo)
# reset of vector
temp[which(temp %in% sel)] <- temp[rev(which(temp %in% sel))]
# reset order of data.frame
foo[order(temp),]
ID x
A 1 1
B 2 2
C 3 3
H 8 8
E 5 5
F 6 6
G 7 7
D 4 4
I 9 9
J 10 10
As noted in the comments, this relies on the row names following a lexicographical order. In instances where this is not true, we can use match.
# set up
set.seed(1234)
foo <- data.frame(ID=1:10, x=1:10)
row.names(foo) <- sample(LETTERS[1:10])
sel <- c("D", "H")
Now, the rownames are
# initial data.frame
foo
ID x
B 1 1
F 2 2
E 3 3
H 4 4
I 5 5
D 6 6
A 7 7
G 8 8
J 9 9
C 10 10
# grab row names
temp <- row.names(foo)
# reorder vector containing row names
temp[which(temp %in% sel)] <- temp[rev(which(temp %in% sel))]
Using, match along with order
foo[order(match(row.names(foo), temp)),]
ID x
B 1 1
F 2 2
E 3 3
D 6 6
I 5 5
H 4 4
A 7 7
G 8 8
J 9 9
C 10 10
your data frame is small so you can duplicate it then change the value of each raw:
footmp<-data.frame(foo)
foo[4,]<-footemp[8,]
foot{8,]<-footemp[4,]
Bob