How to pull values by reference number - r

I have a df of paired values and I want to be able to subset it by accessing only one value. This is my data:
df1 %>% head()
values pair_num
<ch> <int>
1 apple 1
2 pb 1
3 apple 2
4 ranch 2
5 apple 3
6 sauce 3
7 orange 4
8 soda 4
9 grape 5
10 juice 5
So for example I would like to access all values associated with apple without knowing what they are and end up with something like this:
df1 %>% head()
values pair_num
<ch> <int>
1 apple 1
2 pb 1
3 apple 2
4 ranch 2
5 apple 3
6 sauce 3

I'm not sure I understand the question, as I would have thought this would be the output (with row 6) that you'd want.
library(dplyr)
df1 %>%
filter(values == "apple") %>%
select(pair_num) %>%
left_join(df1)
Joining, by = "pair_num"
pair_num values
1 1 apple
2 1 pb
3 2 apple
4 2 ranch
5 3 apple
6 3 sauce

Related

Ranking observations within groups that are tied

I'm trying to rank the certain groups by their counts using dense_rank, it doesn't make a distinct rank for groups that are tied. And any ranking function I try that has some sort of ties.method doesn't give me the rankings in a consecutive 1,2,3 order. Example:
library(dplyr)
id <- c(rep(1, 8),
rep(2, 8))
fruit <- c(rep('apple', 4), rep('orange', 1), rep('banana', 2), 'orange',
rep('orange', 4), rep('banana', 1), rep('apple', 2), 'banana')
df <- data.frame(id, fruit, stringsAsFactors = FALSE)
df2 <- df %>%
mutate(counter = 1) %>%
group_by(id, fruit) %>%
mutate(fruitCnt = sum(counter)) %>%
ungroup() %>%
group_by(id) %>%
mutate(fruitCntRank = dense_rank(desc(fruitCnt))) %>%
select(id, fruit, fruitCntRank)
df2
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 2
7 1 banana 2
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 2
15 2 apple 2
16 2 banana 2
It doesn't matter which of orange or banana are ranked 3, and it doesn't even need to be consistent. I just need the groups to be ranked 1, 2, 3.
Desired result:
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 3
7 1 banana 3
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 3
15 2 apple 3
16 2 banana 2
We can add count for each id and fruit combination, arrange them in descending order of count and get the rank using match.
library(dplyr)
df %>%
add_count(id, fruit) %>%
arrange(id, desc(n)) %>%
group_by(id) %>%
mutate(n = match(fruit, unique(fruit)))
#Another option with cumsum and duplicated
#mutate(n = cumsum(!duplicated(fruit)))
# id fruit n
# <dbl> <chr> <int>
# 1 1 apple 1
# 2 1 apple 1
# 3 1 apple 1
# 4 1 apple 1
# 5 1 orange 2
# 6 1 banana 3
# 7 1 banana 3
# 8 1 orange 2
# 9 2 orange 1
#10 2 orange 1
#11 2 orange 1
#12 2 orange 1
#13 2 banana 2
#14 2 apple 3
#15 2 apple 3
#16 2 banana 2

Renaming and merging columns based on an old/new name dataset

A confusing title.
Best explained by an example.
I have the following data
df <- "Green.Apple Red.Apple Pears Orange Lemon Lime
1 3 5 4 4 0 5
2 3 0 2 7 2 11
3 2 7 8 0 3 1
4 0 6 3 5 6 0 "
df <-read.table(text=df,header=T)
I would like to rename the columns based on an old / new names, and then merge those columns based on the old and new names. If a column being renamed is also the same as another column they would be summed. I bring the names into the workspace:
names <- "Original New
1 Green.Apple Apple
2 Red.Apple Apple
3 Pears Pear
4 Orange Orange
5 Lemon Cirtus
6 Lime Cirtus"
#
names <-read.table(text=names,header=T)
I have tried various work around methods. e.g. they will always have the same length of names so one could simply rename the columns by a list, but this is not proper and could result in errors in the larger task I am trying to accomplish.
This is what I am looking for:
yay <- "Apple Pear Orange Cirtus
1 8 4 4 5
2 3 2 7 13
3 9 8 0 4
4 6 3 5 6"
Many thanks
Jim
(controversial: Also open to a Pandas alternative)
You could also do:
names(df) <- names$New[match(names(df), names$Original)]
t(rowsum(t(df), group = colnames(df), na.rm = T))
# > t(rowsum(t(df), group = colnames(df), na.rm = T))
# Apple Cirtus Orange Pear
# 1 8 5 4 4
# 2 3 13 7 2
# 3 9 4 0 8
# 4 6 6 5 3
Use match to match old names with new names and rename df. Then use split.default to split based on similar names and sum similar columns.
names(df) <- names$New[match(names(df), names$Original)]
sapply(split.default(df, names(df)), rowSums)
# Apple Cirtus Orange Pear
#1 8 5 4 4
#2 3 13 7 2
#3 9 4 0 8
#4 6 6 5 3

ordered grouping of rows in R

I would like to create a new column that sequentially labels groups of rows. Original data:
> dt = data.table(index=(1:10), group = c("apple","apple","orange","orange","orange","orange","apple","apple","orange","apple"))
> dt
index group
1: 1 apple
2: 2 apple
3: 3 orange
4: 4 orange
5: 5 orange
6: 6 orange
7: 7 apple
8: 8 apple
9: 9 orange
10: 10 apple
Desired output:
index group id
1: 1 apple 1
2: 2 apple 1
3: 3 orange 1
4: 4 orange 1
5: 5 orange 1
6: 6 orange 1
7: 7 apple 2
8: 8 apple 2
9: 9 orange 2
10: 10 apple 3
dplyr attempt:
dt %>% group_by(group) %>% mutate( id= row_number())
# A tibble: 10 x 3
# Groups: group [2]
index group id
<int> <chr> <int>
1 1 apple 1
2 2 apple 2
3 3 orange 1
4 4 orange 2
5 5 orange 3
6 6 orange 4
7 7 apple 3
8 8 apple 4
9 9 orange 5
10 10 apple 5
How can I edit this to get the first group of apples as 1, then the first group of oranges as 1, then the second group of apples as 2 etc (see desired output above). Also open to data.table solution.
library(data.table)
dt[, id := cumsum(c(TRUE, diff(index) > 1)), by="group"]
dt
# index group id
# 1: 1 apple 1
# 2: 2 apple 1
# 3: 3 orange 1
# 4: 4 orange 1
# 5: 5 orange 1
# 6: 6 orange 1
# 7: 7 apple 2
# 8: 8 apple 2
# 9: 9 orange 2
# 10: 10 apple 3
Starting from original dt:
library(dplyr)
dt %>%
group_by(group) %>%
mutate(id = cumsum(c(TRUE, diff(index) > 1))) %>%
ungroup()
# # A tibble: 10 x 3
# index group id
# <int> <chr> <int>
# 1 1 apple 1
# 2 2 apple 1
# 3 3 orange 1
# 4 4 orange 1
# 5 5 orange 1
# 6 6 orange 1
# 7 7 apple 2
# 8 8 apple 2
# 9 9 orange 2
# 10 10 apple 3
Base R, perhaps a little clunky:
out <- do.call(rbind, by(dt, dt$group,
function(x) transform(x, id = cumsum(c(TRUE, diff(index) > 1)))))
out[order(out$index),]
# index group id
# apple.1 1 apple 1
# apple.2 2 apple 1
# orange.3 3 orange 1
# orange.4 4 orange 1
# orange.5 5 orange 1
# orange.6 6 orange 1
# apple.7 7 apple 2
# apple.8 8 apple 2
# orange.9 9 orange 2
# apple.10 10 apple 3
The names can be removed easily with rownames(out) <- NULL. The order part isn't necessary, but I wanted to present it in the same order as the other solutions, and do.call/by does not preserve the original order.
Another option using data.table::rleid twice:
dt[, gid := rleid(group)][, id := rleid(gid), .(group)]
We can also use rle from base R
with(dt, with(rle(group), rep(ave(seq_along(values),
values, FUN = seq_along), lengths)))
#[1] 1 1 1 1 1 1 2 2 2 3

R, ggvis graphing from two data frames that both need to be grouped by

I'm creating a stacked bar graph with multiple horizontal lines running through it. This is done in a Shiny app. The user picks an option and depending on what it is, there could be either 2 or 3 horizontal lines.
here is a minimal reproducible example:
df1 <- data.frame(a=as.factor(rep(1:10,2)),
b=sample(1:5,20, replace=T),
c=c(rep("apple",10), rep("banana",10)) )
df1 <- df1[order(df1$a, df1$c),]
df2 <- data.frame(a=as.factor(rep(1:10,2)),
i=c(rep(3,10),rep(4,10)),
j=c(rep("red",10), rep("green",10)) )
> df1
a b c
1 1 5 apple
11 1 2 banana
2 2 3 apple
12 2 3 banana
3 3 1 apple
13 3 2 banana
4 4 3 apple
14 4 1 banana
5 5 4 apple
15 5 3 banana
6 6 4 apple
16 6 2 banana
7 7 3 apple
17 7 4 banana
8 8 5 apple
18 8 1 banana
9 9 5 apple
19 9 2 banana
10 10 1 apple
20 10 3 banana
> df2
a i j
1 1 3 red
2 2 3 red
3 3 3 red
4 4 3 red
5 5 3 red
6 6 3 red
7 7 3 red
8 8 3 red
9 9 3 red
10 10 3 red
11 11 3 red
12 1 4 green
13 2 4 green
14 3 4 green
15 4 4 green
16 5 4 green
17 6 4 green
18 7 4 green
19 8 4 green
20 9 4 green
21 10 4 green
22 11 4 green
ggvis(data=df1, x=~a, y=~b) %>%
group_by(c) %>%
layer_bars(fill=~c) %>%
layer_paths(data=df2, x=~a, y=~i, strokeWidth:=2)
which gives me the following graph (it'll look different each time because of sample() ).
But I don't want the inverse Z in the middle. What I want is two parallel lines that are grouped by df2$j. But I'm not sure how to go about that with two data frames in my ggvis.
The reason I have df2 in a long form is because the user could choose an option that would create more than 2 horizontal lines. I don't want to use if and else to control for that. In my actual code, df1 and df2 are both reactives.
Thank you for your help in advance.
You can give layer_paths a dataset grouped on your y variable so the horizontal lines will be drawn separately for each group.
To do this, you can use data = group_by(df2, i) instead of data = df2.
And your code and plot would look like:
ggvis(data=df1, x=~a, y=~b) %>%
group_by(c) %>%
layer_bars(fill=~c) %>%
layer_paths(data = group_by(df2, i), x = ~a, y = ~i, strokeWidth:=2)

How to rank within groups in R?

OK, check out this data frame...
customer_name order_dates order_values
1 John 2010-11-01 15
2 Bob 2008-03-25 12
3 Alex 2009-11-15 5
4 John 2012-08-06 15
5 John 2015-05-07 20
Lets say I want to add an order variable that Ranks the highest order value, by name, by max order date, using the last order date at the tie breaker. So, ultimately the data should look like this:
customer_name order_dates order_values ranked_order_values_by_max_value_date
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
Where everyone's single order gets 1, and all subsequent orders are ranked based on the value, and the tie breaker is the last order date getting priority.
In this example, John's 8/6/2012 order gets the #2 rank because it was placed after 11/1/2010. The 5/7/2015 order is 1 because it was the biggest. So, even if that order was placed 20 years ago, it should be the #1 Rank because it was John's highest order value.
Does anyone know how I can do this in R? Where I can Rank within a group of specified variables in a data frame?
Thanks for your help!
The top rated answer (by cdeterman) is actually incorrect. The order function provides the location of the 1st, 2nd, 3rd, etc ranked values not the ranks of the values in their current order.
Let’s take a simple example where we want to rank, starting with the largest, grouping by customer name. I have included a manual ranking so we can check the values
> df
customer_name order_values manual_rank
1 John 2 5
2 John 5 2
3 John 9 1
4 John 1 6
5 John 4 3
6 John 3 4
7 Lucy 4 4
8 Lucy 9 1
9 Lucy 6 3
10 Lucy 2 6
11 Lucy 8 2
12 Lucy 3 5
If I run the code suggested by cdeterman I get the following incorrect ranks:
> df %>%
+ group_by(customer_name) %>%
+ mutate(my_ranks = order(order_values, decreasing=TRUE))
Source: local data frame [12 x 4]
Groups: customer_name [2]
customer_name order_values manual_rank my_ranks
<fctr> <dbl> <dbl> <int>
1 John 2 5 3
2 John 5 2 2
3 John 9 1 5
4 John 1 6 6
5 John 4 3 1
6 John 3 4 4
7 Lucy 4 4 2
8 Lucy 9 1 5
9 Lucy 6 3 3
10 Lucy 2 6 1
11 Lucy 8 2 6
12 Lucy 3 5 4
Order is used to re-order dataframes into decreasing or increasing order. What we actually want is to run the order function twice, with the second order function giving us the actual ranks we want.
> df %>%
+ group_by(customer_name) %>%
+ mutate(good_ranks = order(order(order_values, decreasing=TRUE)))
Source: local data frame [12 x 4]
Groups: customer_name [2]
customer_name order_values manual_rank good_ranks
<fctr> <dbl> <dbl> <int>
1 John 2 5 5
2 John 5 2 2
3 John 9 1 1
4 John 1 6 6
5 John 4 3 3
6 John 3 4 4
7 Lucy 4 4 4
8 Lucy 9 1 1
9 Lucy 6 3 3
10 Lucy 2 6 6
11 Lucy 8 2 2
12 Lucy 3 5 5
You can do this pretty cleanly with dplyr
library(dplyr)
df %>%
group_by(customer_name) %>%
mutate(my_ranks = order(order(order_values, order_dates, decreasing=TRUE)))
Source: local data frame [5 x 4]
Groups: customer_name
customer_name order_dates order_values my_ranks
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
This can be achieved with ave and rank. ave passes the proper groups to rank. The result from rank is reversed due to the requested order:
with(x, ave(as.numeric(order_dates), customer_name, FUN=function(x) rev(rank(x))))
## [1] 3 1 1 2 1
In base R you can do this with the slightly unwieldy
transform(df,rank=ave(1:nrow(df),customer_name,
FUN=function(x) order(order_values[x],order_dates[x],decreasing=TRUE)))
customer_name order_dates order_values rank
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
where order is provided both the primary and tie-breaker values for each group.
df %>%
group_by(customer_name) %>%
arrange(customer_name,desc(order_values)) %>%
mutate(rank2=rank(order_values))
Similar to #t-himmel's answer, you can get the ranks with data.table.
dt[ , rnk := order(order(order_values, decreasing = TRUE)), customer_name ]

Resources