Apply a variable operation, depending on groups using tidyverse [R] - r

I'm not quite sure even how to google my question, so I think an example is best to explain what I want to achieve. In summary, I'd like to multiply each value of a data frame, grouped by some variable, and the multiplication value will depend on which group is it. I'll place down an example:
data <- data.frame(group = c("a", "b", "c"), value = c(1, 2, 3))
multiplier <- c(a = 1, b = 2, c = 3)
data %>%
group_by(group) %>%
// Something that multiplies the value column by the corresponding multiplier contained in the vector
EDIT:
The expected returned values to replace the value column should be 1, 4, 9 respectively to the order.

I think this should do it:
library(dplyr)
data %>%
mutate(value = value * multiplier[as.character(group)])
# group value
#1 a 1
#2 b 4
#3 c 9
Alternatively, you could append multiplier as a column of data and then calculate.
data %>%
mutate(mutiplier = multiplier[as.character(group)]) %>%
mutate(new.value = value * multiplier)
# group value mutiplier new.value
#1 a 1 1 1
#2 b 2 2 4
#3 c 3 3 9

In base R, we can do
transform(merge(data, stack(multiplier), by.x = 'group', by.y = 'ind'),
value = value * values)[-3]
# group value
#1 a 1
#2 b 4
#3 c 9

Related

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

Find the top n largest values from a dataframe (or matrix) in r

I have a dataframe like below:
df = data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
How can I find the n largest values from this dataframe?
We can easily find top n from a vector. Is there any good way to find the top n from a dataframe?
Thanks a lot.
Maybe you can check for stack
N=2
sort(stack(df)$values, decreasing=TRUE)[1:N]
[1] 10.884644 9.912067
You can use tidyr::gather() and dplyr::top_n().
First gather every column in one column using gather(key, value), and filter top n elements using top_n(). For example, top-5.
library(tidyverse) # dplyr and tidyr
set.seed(10)
mydf <-
data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
In gather(), freely specify the name of key and value.
You should name wt of top_n() as value you have given.
mydf %>%
gather(key = "key", value = "value") %>%
top_n(5, wt = value) %>%
arrange(desc(value)) # sort by value
#> key value
#> 1 c 10.38
#> 2 c 10.06
#> 3 c 9.30
#> 4 c 9.25
#> 5 b 8.53
You can get the output of top_n values with corresponding column names.
However, if you just want only values, you can use unlist().
unlist(mydf) %>% # optionally, use.names = FALSE
sort(decreasing = TRUE) %>%
.[1:5]
#> c1 c7 c3 c9 b10
#> 10.38 10.06 9.30 9.25 8.53
unlist and convert it into a vector, sort them and find top values. So for top 2 values we can do
tail(sort(unlist(df, use.names = FALSE)), 2)
#[1] 9.581705 9.591726
If it's a matrix you'll not require unlist
tail(sort(as.matrix(df)), 2)
data
set.seed(1233)
df = data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
I suspect you're looking for slice_max().
Given, for example, the data below:
> df = data.frame(a = runif(5,0,10),
+ b = runif(5,1,10),
+ c = runif(5,-1,9))
> df
a b c
1 1.953615 6.663370 6.95084517
2 1.564794 2.376268 1.46826979
3 5.052276 3.609657 0.84467786
4 3.800541 5.506710 5.64018236
5 9.823815 9.158154 -0.03483406
We can get the three topmost rows (defined by the parameter n) sorted by the column a...
> slice_max(df, n=3, order_by=a)
a b c
1 9.823815 9.158154 -0.03483406
2 5.052276 3.609657 0.84467786
3 3.800541 5.506710 5.64018236
...column b...
> slice_max(df, n=3, order_by=b)
a b c
1 9.823815 9.158154 -0.03483406
2 1.953615 6.663370 6.95084517
3 3.800541 5.506710 5.64018236
...or column c:
> slice_max(df, n=3, order_by=c)
a b c
1 1.953615 6.663370 6.950845
2 3.800541 5.506710 5.640182
3 1.564794 2.376268 1.468270

How to mutate a column given a dataframe that has the conditions?

I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}
We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b

Grouping R data multiple times before summing

I'm trying to group my data by a number of variables before providing a summary table showing the sum of the values within each group.
I have created the below data as an example.
Value <- c(21000,10000,50000,60000,2000, 4000, 5500, 10000, 35000, 40000)
Group <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "C")
Type <- c(1, 2, 1, 2, 1, 1, 1, 2, 2, 1)
Matrix <- cbind(Value, Group, Type)
I want to group the above data first by the 'Group' variable, and then by the 'Type' variable to then sum the values and get an output similar to the attached example I worked on Excel. I would usually use the aggregate function if I just wanted to group by one variable, but am not sure whether I can translate this for multiple variables?
Further to this I then need to provide an identical table but with the values being calculated with a "count" function rather than a "sum".
Many thanks in advance!
You can supply multiple groupings to aggregate:
df <- data.frame(Value, Group, Type)
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), sum)
Type Group x
1 1 A 26500
2 2 A 20000
3 1 B 50000
4 2 B 95000
5 1 C 46000
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), length)
Type Group x
1 1 A 2
2 2 A 2
3 1 B 1
4 2 B 2
5 1 C 3
There are other packages which may be easier to use such as data.table:
>library(data.table)
>dt <- as.data.table(df)
>dt[, .(Count = length(Value), Sum = sum(Value)),
by = .(Type, Group)]
Type Group Count Sum
1: 1 A 2 26500
2: 2 A 2 20000
3: 1 B 1 50000
4: 2 B 2 95000
5: 1 C 3 46000
dplyr is another option and #waskuf has good example of that.
Using dplyr (note that "Matrix" needs to be a data.frame):
library(dplyr)
Matrix <- data.frame(Value, Group, Type)
Matrix %>% group_by(Group, Type) %>% summarise(Sum = sum(Value),
Count = n()) %>% ungroup()

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Resources