Minimum number of distinct elements after removing m items using R Scripting - r

Need guidance on how to start writing a code to find minimum number of distinct elements after removing m items using R scripting.
I don't know where to start.
For example, there are n = 6 items, ids = [1, 1, 1, 2, 3, 2], and the maximum number of items that can be deleted is m = 2. Two possible actions are delete two items with ID = 2 or one with ID = 2 and one with ID = 3. Either way she will have 2 item IDs in the final bag: either ids = [1, 1, 1, 3] or ids = [1, 1, 1, 2].
Any guidance would really help me.
Thanks.

Maybe you can try to define a function f
f <- function(ids, m) {
u <- table(ids)
while (m > 0) {
k <- which.min(u)
u <- subset(v <- replace(u, k, u[k] - 1), v > 0)
m <- m - 1
}
rep(as.numeric(names(u)), u)
}
which removes the first m "low-occurrence" elements from ids, there table helps to summarize the occurrence of each unique elements in ids.
Example
> f(ids,1)
[1] 1 1 1 2 2
> f(ids,2)
[1] 1 1 1 2
> f(ids,3)
[1] 1 1 1

if I understand your question rigth you can solve this as follow :
(I'm using the library dplyr)
First step is to count how many times each id appears :
library(dplyr)
m <- 2
dat <- tibble(id = c(1,1,1,2,3,2))
comptage <- dat %>%
group_by(id) %>%
count() %>%
ungroup()
Witch gives you :
> comptage
# A tibble: 3 x 2
id n
<dbl> <int>
1 1 3
2 2 2
3 3 1
Then to find how many id you can remove, you arrange your table by "n" (the count variable) and keep only the id having a cumulative sum over or egal to the limit m of removal number :
comptage <- comptage %>%
arrange(n) %>%
mutate(sum_cum = cumsum(n))
> comptage
# A tibble: 3 x 3
id n sum_cum
<dbl> <int> <int>
1 3 1 1
2 2 2 3
3 1 3 6
result <- comptage %>%
filter(!sum_cum < m)
> result
# A tibble: 2 x 3
id n sum_cum
<dbl> <int> <int>
1 2 2 3
2 1 3 6
length(result$id)
This is probably not the easiest way to do it but it should work.

Related

Re-order columns: split the second half of columns up so that they come as every second

I want to rearrange a dataframe so that 1) the first column always stays first, and 2) that the second half of the remaining columns are split up to come every second instead.
Note that in the example below
Please see example data:
# Example data
N <- "AB"
l_x_1 <- 1
l_x_2 <- 2
l_x_3 <- 3
# ... not it should be able to handle different number of columns
s_x_1 <- 1
s_x_2 <- 2
s_x_3 <- 3
# ... not it should be able to handle different number of columns (although always equal N of s_ and l_.
# Current state
df <- tibble(N, l_x_1, l_x_2, l_x_3, s_x_1, s_x_2, s_x_3)
df
# What I want (but potential to handle different number of As and Bs)
df <- tibble(N, l_x_1, s_x_1, l_x_2, s_x_2, l_x_3, s_x_3)
df
df[c(1, matrix(seq_along(df)[-1], 2, byrow = TRUE))]
# # A tibble: 1 × 7
# N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB 1 1 2 2 3 3
The idea is to reorder column indices with matrix(). E.g.
c(matrix(1:6, 2, byrow = TRUE))
# [1] 1 4 2 5 3 6
A solution based on stringi::stri_reverse:
cbind(df[1], df[-1][order(stringi::stri_reverse(colnames(df[-1])))])
N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
1 AB 1 1 2 2 3 3
This relies on the fact that if you reverse every names, you recover the desired order:
> sort(stringi::stri_reverse(colnames(df[-1])))
[1] "1_x_l" "1_x_s" "2_x_l" "2_x_s" "3_x_l" "3_x_s"
library(tidyverse)
(preferred_order <- str_split(names(df)[-1],"_",
n = 3, # because 2 underscores
simplify = TRUE) |>
data.frame() |>
arrange(X3,X1) |> # also because 2 underscores
rowwise() |>
mutate(nm=paste0(c_across(),
collapse = "_")) |> pull(nm))
relocate(df,
preferred_order,.after = "N")

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

Creating new rows for each group using values from first row of group

I need to create a new row for each group of a grouped tibble based on values from the first row of each group.
I am trying to use do(add_row()) to create the new row and use top_n to access the value from the first row of each group.
df = tibble(ID = rep(1:2, each = 2), x = rep(1:2, each = 2), y = seq(1:4))
gb_df <- group_by(df, ID, x)
new_df <- gb_df %>% do(add_row(., ID = top_n(.,1, wt=y)[,"ID"], x = 0, y =
top_n(.,1, wt=y)[,"y"]-1, .before=0))
However I get the following error message.
Error: Columns `ID`, `y` must be 1d atomic vectors or lists
I want the following output.
> new_df
# A tibble: 6 x 3
# Groups: ID [3]
ID x y
<dbl> <dbl> <dbl>
1 1 0 0
2 1 4 1
3 1 4 2
4 2 0 2
5 2 5 3
6 2 5 4

Adding a column that counts sequential numbers

I would like to add a column that counts the number of consecutive values. Most of what I am seeing on here is how to count duplicate values (1,1,1,1,1) and I would like to count a when the number goes up by 1 ( 5,6,7,8,9). The ID column is what I have and the counter column is what I would like to create. Thanks!
ID Counter
5 1
6 2
7 3
8 4
10 1
11 2
13 1
14 2
15 3
16 4
A solution using the dplyr package. The idea is to calculate the difference between each number to create a grouping column, and then assign counter to each group.
library(dplyr)
dat2 <- dat %>%
mutate(Diff = ID - lag(ID, default = 0),
Group = cumsum(Diff != 1)) %>%
group_by(Group) %>%
mutate(Counter = row_number()) %>%
ungroup() %>%
select(-Diff, -Group)
dat2
# # A tibble: 10 x 2
# ID Counter
# <int> <int>
# 1 5 1
# 2 6 2
# 3 7 3
# 4 8 4
# 5 10 1
# 6 11 2
# 7 13 1
# 8 14 2
# 9 15 3
# 10 16 4
DATA
dat <- read.table(text = "ID
5
6
7
8
10
11
13
14
15
16",
header = TRUE, stringsAsFactors = FALSE)
A loop version is simple:
for (i in 2:length(ID))
if (diff(ID)[i-1] == 1)
counter[i] <- counter[i-1] +1
else
counter[i] <- 1
But this loop will perform very bad for n > 10^4! I'll try to think of a vector-solution!
You can using
s=df$ID-shift(df$ID)
s[is.na(s)]=1
ave(s,cumsum(s!=1),FUN=seq_along)
[1] 1 2 3 4 1 2 1 2 3 4
This one makes use solely of highly efficient vector-arithmetic. Idea goes as follows:
1.take the cumulative sum of the differences of ID
2.subtract the value if jump is bigger than one
cum <- c(0, cumsum(diff(ID))) # take the cumulative difference of ID
ccm <- cum * c(1, (diff(ID) > 1)) # those with jump > 1 will remain its value
# subtract value with jump > 1 for all following numbers (see Link for reference)
# note: rep(0, n) is because ccm[...] starts at first non null value
counter <- cum - c(rep(0, which(diff(dat) != 1)[1]),
ccm[which(ccm != 0)][cumsum(ccm != 0)]) + 1
enter code here
Notes:
Reference for highliy efficient fill-function by nacnudus: Fill in data frame with values from rows above
Restriction: Id must be monotonically increasing
That should deal with your millions of data efficiently!
Another solution:
breaks <- c(which(diff(ID)!=1), length(ID))
x <- c(breaks[1], diff(breaks))
unlist(sapply(x, seq_len))

Get most frequent value(s) from a list of lists

I am trying to convert an LDA prediction result, which is a list object containing hundred of list (of topics (in numeric) assigned to each token in a document), such as the following example
assignments <- list(
as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
as.integer(c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3)),
as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2))
)
where each list of the list object has different length corresponding to the length of each tokenized document.
What I want to do are to 1) get the most frequent topic (1, 2, 3) out of each list, and 2) convert them into tbl or data.frame format like this
document topic freq
1 1 6
2 2 5
3 3 6
such that I can use inner_join() to merge this "consensus" prediction with the topic assignment results generated by tm or topicmodels applications and compare their precision, etc. Since the assignments is in list format, I cannot apply top_n() function to get the most frequent topic for each list. I tried sing lapply(unlist(assignments), count), but it didn't give me what I want.
You can iterate over the list with sapply, get frequency with table and extract first value from sorted result:
result <- sapply(assignments, function(x) sort(table(x), decreasing = TRUE)[1])
data.frame(document = seq_along(assignments),
topic = as.integer(names(result)),
freq = result)
document topic freq
1 1 1 6
2 2 2 5
3 3 3 6
We can loop through the list, get the frequency of elements with tabulate, find the index of maximum elements, extract those along with the frequency as a data.frame and rbind the list elements
do.call(rbind, lapply(seq_along(assignments), function(i) {
x <- assignments[[i]]
ux <- unique(x)
i1 <- tabulate(match(x, ux))
data.frame(document = i, topic = ux[which.max(i1)], freq = max(i1))})
)
# document topic freq
#1 1 1 6
#2 2 2 5
#3 3 3 6
Or another option is to convert it to a two column dataset and then do group by to find the index of max values
library(data.table)
setDT(stack(setNames(assignments, seq_along(assignments))))[,
.(freq = .N), .(document = ind, topic = values)][, .SD[freq == max(freq)], document]
# document topic freq
#1: 1 1 6
#2: 2 2 5
#3: 3 3 6
Or we can use tidyverse
library(tidyverse)
map(assignments, as_tibble) %>%
bind_rows(.id = 'document') %>%
count(document, value) %>%
group_by(document) %>%
filter(n == max(n)) %>%
ungroup %>%
rename_at(2:3, ~c('topic', 'freq'))
# A tibble: 3 x 3
# document topic freq
# <chr> <int> <int>
#1 1 1 6
#2 2 2 5
#3 3 3 6
using purrr::imap_dfr :
library(tidyverse)
imap_dfr(assignments,~ tibble(
document = .y,
Topic = names(which.max(table(.x))),
freq = max(tabulate(.x))))
# # A tibble: 3 x 3
# document Topic freq
# <int> <chr> <int>
# 1 1 1 6
# 2 2 2 5
# 3 3 3 6

Resources