Dense ranking of column based on order of second column - r

I am beating my brains out on something that is probably straight forward. I want to get a "dense" ranking (as defined for the data.table::frank function), on a column in a data frame, but not based on the columns proper order, the order should be given by another column (val in my example)
I managed to get the dense ranking with #Prasad Chalasani 's solution, like that:
library(dplyr)
foo_df <- data.frame(id = c(4,1,1,3,3), val = letters[1:5])
foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id)))
#> id val id_fac
#> 1 4 a 3
#> 2 1 b 1
#> 3 1 c 1
#> 4 3 d 2
#> 5 3 e 2
But I would like the factor levels to be ordered based on val. Desired output:
foo_desired <- foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id, levels = c(4,1,3))))
foo_desired
#> id val id_fac
#> 1 4 a 1
#> 2 1 b 2
#> 3 1 c 2
#> 4 3 d 3
#> 5 3 e 3
I tried data.table::frank
I tried both methods by #Prasad Chalasani.
I tried setting the order of id using id[rank(val)] (and sort(val), and order(val)).
Finally, I also tried to sort the levels using rank(val) etc, but this throws an error (Evaluation error: factor level [3] is duplicated.)
I know that one can specify the level order, I used this for creation of the desired output. This solution is however not great as my data has way more rows and levels
I need that for convenience, in order to produce a table with a specific order, not for computations.
Created on 2018-12-19 by the reprex package (v0.2.1)

You can check with first
foo_df %>% arrange(val) %>%
group_by(id)%>%mutate(id_fac = first(val))%>%
ungroup()%>%
mutate(id_fac=as.integer(factor(id_fac)))
# A tibble: 5 x 3
id val id_fac
<dbl> <fctr> <int>
1 4 a 1
2 1 b 2
3 1 c 2
4 3 d 3
5 3 e 3

Why do you even need factors ? Not sure if I am missing something but this gives your desired output.
You can use match to get id_fac based on the occurrence of ids.
library(dplyr)
foo_df %>%
mutate(id_fac = match(id, unique(id)))
# id val id_fac
#1 4 a 1
#2 1 b 2
#3 1 c 2
#4 3 d 3
#5 3 e 3

Related

How to remove variables from a dataset that are actually named NA?

I have an excel dataset that I have to work on. The problem is that empty cells were named NA instead of leaving them empty.
I'm trying to remove NA values from the dataset and usually I could use is.na() to omit them but now they have a name so I don't know how to go about this.
Any ideas to point me in the right direction?
You can try something like below:
library(dplyr)
df %>% mutate(across(everything(), ~ na_if(., 'NA'))) %>% na.omit()
# A tibble: 3 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 C 2
3 C 1
Data used:
df
# A tibble: 5 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 B NA
3 NA 1
4 C 2
5 C 1

R count() using dynamically generated list of variables/columns

If I have a tibble called observations with the following variables/columns:
category_1_red_length
category_1_red_width
category_1_red_depth
category_1_blue_length
category_1_blue_width
category_1_blue_depth
category_1_green_length
category_1_green_width
category_1_green_depth
category_2_red_length
category_2_red_width
category_2_red_depth
category_2_blue_length
category_2_blue_width
category_2_blue_depth
category_2_green_length
category_2_green_width
category_2_green_depth
Plus a load more. Is there a way to dynamically generate the following count()?
count(observations,
category_1_red_length,
category_1_red_width,
category_1_red_depth,
category_1_blue_length,
category_1_blue_width,
category_1_blue_depth,
category_1_green_length,
category_1_green_width,
category_1_green_depth,
category_2_red_length,
category_2_red_width,
category_2_red_depth,
category_2_blue_length,
category_2_blue_width,
category_2_blue_depth,
category_2_green_length,
category_2_green_width,
category_2_green_depth,
sort=TRUE)
I can create the list of columns I want to count with:
columns_to_count = list()
column_prefix = 'category'
aspects = c('red', 'blue', 'green')
dimensions = c('length', 'width', 'depth')
for (x in 1:2) {
for (aspect in aspects) {
for (dimension in dimensions) {
columns_to_count = append(columns_to_count, paste(column_prefix, x, aspect, dimension, sep='_'))
}
}
}
But then how do I pass my list of columns in columns_to_count to the count() function?
In my actual data set there are about 170 columns like this that I want to count so creating the list of columns without loops doesn't seem sensible.
Struggling to think of the name for what I'm trying to do so unable to find useful search results.
Thanks.
You can use non-standard evaluation using syms and !!!. For example, using mtcars dataset
library(dplyr)
library(rlang)
cols <- c('am', 'cyl')
mtcars %>% count(!!!syms(cols), sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
This is same as doing
mtcars %>% count(am, cyl, sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
You don't need to include names in cols one by one by hand. You can use regex if the column contains a specific pattern or use position to get appropriate column name.
You can use .dots to receive strings as variables:
count(observations, .dots=columns_to_count, sort=TRUE)
r$> d
V1 V2
1 1 4
2 2 5
3 3 6
r$> count(d, .dots=list('V1', 'V2'))
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1
r$> count(d, V1, V2)
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

R group by key get max value for multiple columns

I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.
Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.
to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1
The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)
dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1

Resources