R combine rows with similar values - r

I have a dataframe and the row values are first ordered from smallest to largest. I compute row value differences between adjacent rows, combine rows with similar differences (e.g., smaller than 1), and return averaged values of combined rows. I could check each row differences with a for loop, but seems a very inefficient way. Any better ideas? Thanks.
library(dplyr)
DF <- data.frame(ID=letters[1:12],
Values=c(1, 2.2, 3, 5, 6.2, 6.8, 7, 8.5, 10, 12.2, 13, 14))
DF <- DF %>%
mutate(Diff=c(0, diff(Values)))
The expected output of DF would be
ID Values
a 1.0
b/c 2.6 # (2.2+3.0)/2
d 5.0
e/f/g 6.67 # (6.2+6.8+7.0)/3
h 8.5
i 10.0
j/k 12.6 # (12.2+13.0)/2
i 14.0

Here is an option with data.table
library(data.table)
setDT(DF)[, .(ID = toString(ID), Values = round(mean(Values), 2)),
by = .(Diff = cumsum(c(TRUE, diff(Values)>=1)))][, -1, with = FALSE]
# ID Values
#1: a 1.00
#2: b, c 2.60
#3: d 5.00
#4: e, f, g 6.67
#5: h 8.50
#6: i 10.00
#7: j, k 12.60
#8: l 14.00

Calculate difference between Values of every row and check if those are >= 1. Cumulative sum of that >=1 will provide you distinct group on which one can summarize to get desired result.
library(dplyr)
DF %>% arrange(Values) %>%
group_by(Diff = cumsum(c(1,diff(Values)) >= 1) ) %>%
summarise(ID = paste0(ID, collapse = "/"), Values = mean(Values)) %>%
ungroup() %>% select(-Diff)
# # A tibble: 8 x 2
# ID Values
# <chr> <dbl>
# 1 a 1.00
# 2 b/c 2.60
# 3 d 5.00
# 4 e/f/g 6.67
# 5 h 8.50
# 6 i 10.0
# 7 j/k 12.6
# 8 l 14.0

library(magrittr)
df <- DF[order(DF$Values),]
df$Values %>%
#Find groups for each row
outer(., ., function(x, y) x >= y & x < y + 1) %>%
# Remove sub-groups
`[<-`(apply(., 1, cumsum) > 1, F) %>%
# Remove sub-group columns
.[, colSums(.) > 0] %>%
# select these groups from data
apply(2, function(x) data.frame(ID = paste(df$ID[x], collapse = '/')
, Values = mean(df$Values[x]))) %>%
# bind results by row
do.call(what = rbind)
# ID Values
# 1 a 1.000000
# 2 b/c 2.600000
# 4 d 5.000000
# 5 e/f/g 6.666667
# 8 h 8.500000
# 9 i 10.000000
# 10 j/k 12.600000
# 12 l 14.000000
Note:
This method is different from those using diff because it groups rows together only if all Values are within < 1 of each other.
Example:
Change the dataset so that Value is 7.3 at ID g.
Above method: The IDs e, f, and g are no longer grouped together because the value at ID e is 6.2 and 7.2 - 6.2 > 1.
Diff Method: IDs e, f, and g are still grouped together because the diff of IDs at e and f is < 1 and the diff of IDs F and G is < 1

Related

Mutate a dynamic subset of variable

a = tibble(x = runif(1000,0,10),
t = rpois(1000,4)
) %>% arrange(t)
I want a column l that averages the subset of x for the values associated to a t < t(x).
Expected result:
for x[t=0], l = NaN
for x[t=1], l = mean(x[t<1])
for x[t=2], l = mean(x[t<2])
etc.
A code that does not work:
a %>%
mutate(
l = mean(x[a$t < .$t])
) -> a
Now this could would work:
for (i in c(1:1000)) {
a$l[i] = mean(a$x[a$t < a$t[i]])
}
But is not a mutate. I'd like a mutate so I can apply it to groups etc.
To understand better the issue: imagine that you have to average all the x before a date. Now: this, dynamically, in a mutate.
I think that purrr may be necessary but I hate it.
You can use map with mutate:
library(tidyverse)
f <- function(lim) mean(a$x[a$t < lim])
a %>% mutate(l = map_dbl(t, f))
Testing against OP solution:
res <- a %>% mutate(l = map_dbl(t, f))
l <- vector(mode = "numeric", length = 1000)
for (i in c(1:1000)) l[i] = mean(a$x[a$t < a$t[i]])
assertthat::are_equal(res$l, l) # TRUE
For each t value you can calculate average value of x and then calculate lag value of cumulative mean.
library(dplyr)
a %>%
group_by(t) %>%
summarise(l = mean(x)) %>%
mutate(l = lag(cummean(l)))
# t l
# <int> <dbl>
# 1 0 NA
# 2 1 5.33
# 3 2 5.45
# 4 3 5.36
# 5 4 5.26
# 6 5 5.16
# 7 6 5.10
# 8 7 5.07
# 9 8 5.12
#10 9 4.96
#11 10 4.98
#12 11 5.15
#13 12 4.93
If you want to maintain number of rows in the dataframe add %>% left_join(a, by = 't') to the above answer.
data
set.seed(123)
a = tibble(x = runif(1000,0,10),
t = rpois(1000,4)
) %>% arrange(t)

How do I use condition with rows in a dataframe (weighted average, R language)?

I need to calculate the weighted average of each row in the dataframe, where:
Does anyone know how to do it using the R language?
regards
t1 <- c(1, 2, 4, 6, 7, 9)
t2 <- c(6, 6, 5, 3, 3, 7)
df <- data.frame(t1 = t1, t2=t2, stringsAsFactors = FALSE)
if value <= 5 , weight is 1
if value > 5 and <= 8 , weight is 2
if value > 8 , weight is 3
A solution using tidyverse to calculate the weighted mean for each row.
library(tidyverse)
df2 <- df %>%
# Add row numbers
rowid_to_column() %>%
# Convert to long format
gather(Group, Value, -rowid) %>%
# Assign weight
mutate(Weight = case_when(
Value <= 5 ~1,
Value > 5 & Value <= 8 ~2,
Value > 8 ~3,
TRUE ~NA_real_
)) %>%
# Calculated weighted average
group_by(rowid) %>%
summarize(Weighted_Mean = weighted.mean(Value, Weight)) %>%
ungroup()
df2
# rowid Weighted_Mean
# <int> <dbl>
# 1 1 4.33
# 2 2 4.67
# 3 3 4.5
# 4 4 5
# 5 5 5.67
# 6 6 8.2
If you'd like to multiply by weights & then divide by their sum (equal to weighted.mean function in R):
df %>%
mutate_at(vars(t1, t2),
list(weights = ~ case_when(. <= 5 ~ 1,
. > 5 & . <= 8 ~ 2,
TRUE ~ 3))) %>%
mutate(rowMeanWeighted = rowSums(.[, 1:2] * .[, 3:4]) / rowSums(.[, 3:4])) %>%
select(-contains("weights"))
Output:
t1 t2 rowMeanWeighted
1 1 6 4.333333
2 2 6 4.666667
3 4 5 4.500000
4 6 3 5.000000
5 7 3 5.666667
6 9 7 8.200000
A base R solution, function findInterval does the main part. Then multiply the result with df and get the row means.
t1 <- c(1, 2, 4, 6, 7, 9)
t2 <- c(6, 6, 5, 3, 3, 7)
df <- data.frame(t1 = t1, t2=t2, stringsAsFactors = FALSE)
cp <- c(-Inf, 5, 8, Inf)
Edit.
If the weights are normalized to sum to 1, then the right answer will be any of the following two.
wt <- sapply(df, findInterval, cp)
rowSums(df*(wt/apply(wt, 1, sum)))
#[1] 4.333333 4.666667 4.666667 5.000000 5.666667 8.200000
sapply(1:nrow(df), function(i)
weighted.mean(df[i,], sapply(df, findInterval, cp)[i,]))
#[1] 4.333333 4.666667 4.666667 5.000000 5.666667 8.200000
These results are now equal to the results in the other answers.

How do I subset a vector using a list of index of variable lengths

I am trying to determine an efficient way to gather the means and standard deviations of subsections of a variables in a dataframe based on a list of lengths of the sections within the variable. This is a small example of the type of data I have.
X1 <- c(1, 2.5, 3, .5, 1, 1.5, 3, 3.5, 4, 6, 8, 8, 6, 3, 4)
X2 <- c(0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)
df <- c(X1, X2)
X3 <- list(3, 5, 4, 3)
I will note that the data I have are the result of preprocessing of a large data set of continuous time data that I inherited and the list (X3) represents the length of each sequence of X2. Unfortunately, I need a mean and SD of each section of the vector. I tried using the aggregate() function
aggregate(X1, by = list(X2), FUN = mean)
but this code aggregates all of the "0" and "1" values together into two values. I need 4 values.
I can't quite figure out how capture some function of each unique sequence of X2 without doing it locally by creating a factor for each separate sequence.
Another option is to create grouping variable by comparing the adjacent elements and get the cumulative sum
library(dplyr)
data_frame(X1, X2) %>%
group_by(ID = cumsum(X2 != lag(X2, default = X1[1]))) %>%
summarise(MEAN = mean(X1), SD = sd(X1))
# A tibble: 4 x 3
# ID MEAN SD
# <int> <dbl> <dbl>
#1 1 2.17 1.04
#2 2 1.90 1.29
#3 3 6.50 1.91
#4 4 4.33 1.53
Or if we have the number of elements in 'X3', create the grouping variable with rep and use aggregate from base R
do.call(data.frame, aggregate(X1 ~cbind(ID = rep(seq_along(X3),
unlist(X3))), FUN = function(x) c(MEAN = mean(x), SD = sd(x))))
# ID X1.MEAN X1.SD
#1 1 2.166667 1.040833
#2 2 1.900000 1.294218
#3 3 6.500000 1.914854
#4 4 4.333333 1.527525
First of all, I assume that you want to create a data frame with two columns, X1 and X2. Here is how to create the data frame.
df <- data.frame(X1, X2)
df
# X1 X2
# 1 1.0 0
# 2 2.5 0
# 3 3.0 0
# 4 0.5 1
# 5 1.0 1
# 6 1.5 1
# 7 3.0 1
# 8 3.5 1
# 9 4.0 0
# 10 6.0 0
# 11 8.0 0
# 12 8.0 0
# 13 6.0 1
# 14 3.0 1
# 15 4.0 1
We can then use the data.table package to calculate the mean and standard deviation of each group. The key is to use the rleid function to create the ID of each group. After that, we can summarize the data. df2 is the final output. X3 is actually not needed as long as you have the X2 column in your data frame.
# Load the package
library(data.table)
# Convert df to a data.table
setDT(df)
# Perform rhe analysis
df2 <- df[, ID := rleid(X2)][, .(MEAN = mean(X1), SD = sd(X1)), by = ID]
df2[]
# ID MEAN SD
# 1: 1 2.166667 1.040833
# 2: 2 1.900000 1.294218
# 3: 3 6.500000 1.914854
# 4: 4 4.333333 1.527525

R dplyr join by range or virtual column

I want to join two tibbles by a range or a virtual column. but it seems the by - parameter just allow to handle chr oder vector(chr) of existing column names.
In my example I have a tibble d with a column value, and a tibble r with a from and a to column.
d <- tibble(value = seq(1,6, by = 0.2))
r <- tibble(from = seq(1,6), to = c(seq(2,6),Inf), class = LETTERS[seq(1,6)])
> d
# A tibble: 26 x 1
value
<dbl>
1 1.0
2 1.2
3 1.4
4 1.6
5 1.8
6 2.0
7 2.2
8 2.4
9 2.6
10 2.8
# ... with 16 more rows
> r
# A tibble: 6 x 3
from to class
<int> <dbl> <chr>
1 1 2 A
2 2 3 B
3 3 4 C
4 4 5 D
5 5 6 E
6 6 Inf F
now I want to join the value column in d within the range of from and to in r:
d %>% inner_join(r, by = "value between from and to") # >= and <
I can't find a way to do this so decided to join the floor of value in d with the from column in r
d %>% inner_join(r, by = c("floor(value)" = "from"))
of course i can create a second column to solve that:
d %>%
mutate(join_value = floor(value)) %>%
inner_join(r, by = c("join_value" = "from")) %>%
select(value, class)
# A tibble: 26 x 2
value class
<dbl> <chr>
1 1.0 A
2 1.2 A
3 1.4 A
4 1.6 A
5 1.8 A
6 2.0 B
7 2.2 B
8 2.4 B
9 2.6 B
10 2.8 B
# ... with 16 more rows
but isn't there a more comfortable way?
Thanks
I don't think inequality joins is implemented in dplyr yet, or it ever will (see this discussion on Join on inequality constraints), but this is a good situation to use an SQL join:
library(tibble)
library(sqldf)
as.tibble(sqldf("select d.value, r.class from d
join r on d.value >= r.'from' and
d.value < r.'to'"))
Alternatively, if you want to integrate the join into your dplyr chain, you can use fuzzyjoin::fuzzy_join:
library(dplyr)
library(fuzzyjoin)
d %>%
fuzzy_join(r, by = c("value" = "from", "value" = "to"),
match_fun = list(`>=`, `<`)) %>%
select(value, class)
Result:
# A tibble: 31 x 2
value class
<dbl> <chr>
1 1.0 A
2 1.2 A
3 1.4 A
4 1.6 A
5 1.8 A
6 2.0 A
7 2.0 B
8 2.2 B
9 2.4 B
10 2.6 B
# ... with 21 more rows
Notice I added single quotes around from and to since those are reserved words for the SQL language.
Ok thanks for advices, this was pretty interesting. I finally wrote a function range_join (inspired by #ycw's code) and compared all described solution in view of runtime.
I like fuzzy_join but with only 50k rows in d it needs more than 40sec. Thats too slow.
Here the result with 5k rows in d
library(dplyr)
library(fuzzyjoin)
library(sqldf)
#join by range by #WiWeber
range_join <- function(x, y, value, left, right){
x_result <- tibble()
for (y_ in split(y, 1:nrow(y)))
x_result <- x_result %>% bind_rows(x[x[[value]] >= y_[[left]] & x[[value]] < y_[[right]],] %>% cbind(y_))
return(x_result)
}
#dynamic join by #ycw
dynamic_join <- function(d, r){
d$type <- NA_character_
for (r_ in split(r, r$type))
d <- d %>% mutate(type = ifelse(value >= r_$from & value < r_$to, r_$type, type))
return(d)
}
d <- tibble(value = seq(1,6, by = 0.001), join = TRUE)
r <- tibble(from = seq(1,6), to = c(seq(2,6),Inf), type = LETTERS[seq(1,6)], join = TRUE)
# #useR sqldf - fast and intuitive but extra library with horrible code
start <- Sys.time()
d2 <- tbl_df(sqldf("select d.value, r.type from d
join r on d.value >= r.'from' and
d.value < r.'to'"))
Sys.time() - start
# #useR fuzzy_join .... very cool but veeeeeeeeeeeeeeeery slow
start <- Sys.time()
d2 <- d %>%
fuzzy_join(r, by = c("value" = "from", "value" = "to"), match_fun = list(`>=`, `<`)) %>%
select(value, type)
Sys.time() - start
# #jonathande4 cut pretty fast
start <- Sys.time()
d2 <- d
d2$type <- cut(d$value, unique(c(r$from, r$to)), r$type, right = FALSE)
Sys.time() - start
# #WiWeber floor
start <- Sys.time()
d2 <- d %>%
mutate(join_value = floor(value)) %>%
inner_join(r, by = c("join_value" = "from")) %>%
select(value, type)
Sys.time() - start
# #WiWeber cross join - filter
start <- Sys.time()
d2 <- d %>%
inner_join(r, by = "join") %>%
filter(value >= from, value < to) %>%
select(value, type)
Sys.time() - start
# #hardik-gupta sapply
start <- Sys.time()
d2 <- d %>%
mutate(
type = unlist(sapply(value, function (x) r[which(x >= r$from & x < r$to), "type"]))
) %>%
select(value, type)
Sys.time() - start
# #ycw re-dynamic join
start <- Sys.time()
d2 <- d %>% dynamic_join(r)
Sys.time() - start
# #WiWeber range_join
start <- Sys.time()
d2 <- d %>%
range_join(r, "value", "from", "to") %>%
select(value, type)
Sys.time() - start
Results:
# #useR sqldf - fast and intuitive but extra library with horrible code
Time difference of 0.06221986 secs
# #useR fuzzy_join .... very cool but veeeeeeeeeeeeeeeery slow
Time difference of 4.765595 secs
# #jonathande4 cut pretty fast
Time difference of 0.004637003 secs
# #WiWeber floor
Time difference of 0.02223396 secs
# #WiWeber cross join - filter
Time difference of 0.0201931 secs
# #hardik-gupta sapply
Time difference of 5.166633 secs
# #ycw dynamic join
Time difference of 0.03124094 secs
# #WiWeber range_join
Time difference of 0.02691698 secs
greez WiWeber
You use the cut function to create a "class" in object d and then use a left join.
d <- tibble(value = seq(1,6, by = 0.2))
r <- tibble(from = seq(1,6), to = c(seq(2,6),Inf), class = LETTERS[seq(1,6)])
d[["class"]] <- cut(d[["value"]], c(0,2,3,4,5,6,Inf), c('A',"B", "C", "D", "E", "F"), right = FALSE)
d <- left_join(d, r)
To get the right buckets, you just need to work with the cut function to get what you want.
We can use sapply for this
library(tibble)
d <- tibble(value = seq(1,6, by = 0.2))
r <- tibble(from = seq(1,6), to = c(seq(2,6),Inf), class = LETTERS[seq(1,6)])
d <- cbind(d, data.frame(class = (unlist(sapply(d$value, function (x) r[which(x >= r$from & x < r$to), "class"]))) ) )
d
value class
1 1.0 A
2 1.2 A
3 1.4 A
4 1.6 A
5 1.8 A
6 2.0 B
7 2.2 B
8 2.4 B
9 2.6 B
10 2.8 B
11 3.0 C
12 3.2 C
13 3.4 C
14 3.6 C
15 3.8 C
16 4.0 D
17 4.2 D
18 4.4 D
19 4.6 D
20 4.8 D
21 5.0 E
22 5.2 E
23 5.4 E
24 5.6 E
25 5.8 E
26 6.0 F
We can use mutate and case_when from dplyr.
library(dplyr)
d2 <- d %>%
mutate(class = case_when(
value >= 1 & value < 2 ~ "A",
value >= 2 & value < 3 ~ "B",
value >= 3 & value < 4 ~ "C",
value >= 4 & value < 5 ~ "D",
value >= 5 & value < 6 ~ "E",
value >= 6 ~ "F"
))
d2
# A tibble: 26 x 2
value class
<dbl> <chr>
1 1.0 A
2 1.2 A
3 1.4 A
4 1.6 A
5 1.8 A
6 2.0 B
7 2.2 B
8 2.4 B
9 2.6 B
10 2.8 B
# ... with 16 more rows
Update
Here is a workaround by defining a function for this task.
d <- tibble(value = seq(1,6, by = 0.2))
r <- tibble(from = seq(1,6), to = c(seq(2,6),Inf), class = LETTERS[seq(1,6)])
library(dplyr)
# Define a function for dynamic join
dynamic_join <- function(d, r){
if (!("class" %in% colnames(d))){
d[["class"]] <- NA_character_
}
d <- d %>%
mutate(class = ifelse(value >= r$from & value < r$to, r$class, class))
return(d)
}
re_dynamic_join <- function(d, r){
r_list <- split(r, r$class)
for (i in 1:length(r_list)){
d <- dynamic_join(d, r_list[[i]])
}
return(d)
}
# Apply the function
d2 <- d %>% re_dynamic_join(r)
d2
# A tibble: 26 x 2
value class
<dbl> <chr>
1 1.0 A
2 1.2 A
3 1.4 A
4 1.6 A
5 1.8 A
6 2.0 B
7 2.2 B
8 2.4 B
9 2.6 B
10 2.8 B
# ... with 16 more rows
I really liked #WiWeber's range_join function, but it gives an error if a record is not within range. Here's a modification
library(dplyr)
d <- tibble(value = c(seq(1,4, by = 0.2),9))
r <- tibble(from = seq(1,5), to = c(seq(2,5),8), class = LETTERS[seq(1,5)])
range_join <- function(x, y, value, left, right){
all_matches <- tibble()
x = as.data.frame(x)
y = as.data.frame(y)
x$index=x[,value]
for (i in 1:nrow(y)){
matches = x %>% filter(index>=y[i,left] & index<= y[i,right])
if (nrow(matches)>0){
all_matches = all_matches %>% bind_rows(matches %>% cbind(y[i,]))
}
}
all_matches = all_matches %>% select(-index)
return(all_matches)
}
data <- d %>%
range_join(r, "value", "from", "to")
data

to calculate summary of multipl. two column in dataset in R, loops

I have a large data table with over 300 columns. I would like to get by each letter column
-- summary of (each observation in column * weight of observation).
-- summary of weight if obs. in a letter column is more than 0.
Here I provided a example for a column.
id <- c("0001", "0002", "0003", "0004")
a <- c(0, 9, 8, 5)
b <- c(0,5,5,0)
c <- c(1.5, 0.55, 0, 0.06)
weight <- c(102.354, 34.998, 84.664, .657)
data <- data.frame(id, a, b, c, weight)
data
id a b c weight
1 0001 0 0 1.50 102.354
2 0002 9 5 0.55 34.998
3 0003 8 5 0.00 84.664
4 0004 5 0 0.06 0.657
sum(data$a * data$weight)
[1] 995.579
sum(data$weight[data$a >0])
[1] 120.319​
Any idea?
A possible data.table solution
You could define an helper function
tempfunc <- function(x) c(sum(x * data$weight), sum(data$weight[x > 0]))
Then do either
library(data.table)
setDT(data)[, lapply(.SD, tempfunc), .SDcols = -c("id", "weight")]
# a b c
# 1: 995.579 598.310 172.8193
# 2: 120.319 119.662 138.0090
Or
library(dplyr)
setDT(data) %>% summarise_each(funs(tempfunc), -c(id, weight))
## a b c
## 1: 995.579 598.310 172.8193
## 2: 120.319 119.662 138.0090
The following code should solve your question:
my.names <- names(data)[names(data) %in% letters]
res <- lapply(my.names, function(x){
c(sum(data[[x]]*data[["weight"]]), sum(data[["weight"]][data[[x]]>0]))
})
names(res) <- my.names
or directly to data.frame:
do.call("rbind", lapply(my.names, function(letter){
data.frame(letter, "sum1_name" = sum(data[[letter]]*data[["weight"]]),
"sum2_name" = sum(data[["weight"]][data[[letter]]>0]))
}))
# letter sum1_name sum2_name
# 1 a 995.5790 120.319
# 2 b 598.3100 119.662
# 3 c 172.8193 138.009

Resources