Get most frequent value(s) from a list of lists - r

I am trying to convert an LDA prediction result, which is a list object containing hundred of list (of topics (in numeric) assigned to each token in a document), such as the following example
assignments <- list(
as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
as.integer(c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3)),
as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2))
)
where each list of the list object has different length corresponding to the length of each tokenized document.
What I want to do are to 1) get the most frequent topic (1, 2, 3) out of each list, and 2) convert them into tbl or data.frame format like this
document topic freq
1 1 6
2 2 5
3 3 6
such that I can use inner_join() to merge this "consensus" prediction with the topic assignment results generated by tm or topicmodels applications and compare their precision, etc. Since the assignments is in list format, I cannot apply top_n() function to get the most frequent topic for each list. I tried sing lapply(unlist(assignments), count), but it didn't give me what I want.

You can iterate over the list with sapply, get frequency with table and extract first value from sorted result:
result <- sapply(assignments, function(x) sort(table(x), decreasing = TRUE)[1])
data.frame(document = seq_along(assignments),
topic = as.integer(names(result)),
freq = result)
document topic freq
1 1 1 6
2 2 2 5
3 3 3 6

We can loop through the list, get the frequency of elements with tabulate, find the index of maximum elements, extract those along with the frequency as a data.frame and rbind the list elements
do.call(rbind, lapply(seq_along(assignments), function(i) {
x <- assignments[[i]]
ux <- unique(x)
i1 <- tabulate(match(x, ux))
data.frame(document = i, topic = ux[which.max(i1)], freq = max(i1))})
)
# document topic freq
#1 1 1 6
#2 2 2 5
#3 3 3 6
Or another option is to convert it to a two column dataset and then do group by to find the index of max values
library(data.table)
setDT(stack(setNames(assignments, seq_along(assignments))))[,
.(freq = .N), .(document = ind, topic = values)][, .SD[freq == max(freq)], document]
# document topic freq
#1: 1 1 6
#2: 2 2 5
#3: 3 3 6
Or we can use tidyverse
library(tidyverse)
map(assignments, as_tibble) %>%
bind_rows(.id = 'document') %>%
count(document, value) %>%
group_by(document) %>%
filter(n == max(n)) %>%
ungroup %>%
rename_at(2:3, ~c('topic', 'freq'))
# A tibble: 3 x 3
# document topic freq
# <chr> <int> <int>
#1 1 1 6
#2 2 2 5
#3 3 3 6

using purrr::imap_dfr :
library(tidyverse)
imap_dfr(assignments,~ tibble(
document = .y,
Topic = names(which.max(table(.x))),
freq = max(tabulate(.x))))
# # A tibble: 3 x 3
# document Topic freq
# <int> <chr> <int>
# 1 1 1 6
# 2 2 2 5
# 3 3 3 6

Related

Setting missing values using labelled package across multiple columns?

I am using the labelled package and trying to set user-defined missing values. I have a dataframe where I want to set missing values for a list of specific columns rather than the entire dataset.
Currently I have to type out each column (s2 and s3). Is there a more efficient way? My full dataset has dozens of columns.
df <- tibble(s1 = c(1, 2, 3, 9), s2 = c(1, 1, 2, 9), s3 = c(1, 1, 2, 9))
df <- df %>%
set_na_values(., s2 = 9) %>%
set_na_values(., s3 = 9)
na_values(df$s1)
na_values(df$s2)
na_values(df$s3)
The set_na_values() function takes multiple pairs so you don't need to call it more than once:
library(labelled)
library(dplyr)
df %>%
set_na_values(s2 = 9, s3 = 9)
If you were dealing with a lot of variables you could programatically build a named vector or list (if there are multiple missing values per variable) and splice it inside the function. If, from your comment you wanted to apply it to everything except the s1 variable, you can do:
nm <- setdiff(names(df), "s1")
df %>%
set_na_values(!!!setNames(rep(9, length(nm)), nm))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
Alternatively, you can use labelled_spss() and take advantage of across() which allows tidyselect semantics (but this will overwrite any existing labelled values):
df %>%
mutate(across(-s1, labelled_spss, na_values = 9))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
To reset any existing values use:
df %>%
mutate(across(-s1, ~ labelled_spss(.x, labels = val_labels(.x), na_values = 9)))

R: sequential ranking (1,1,1,2,2,2,etc) based on date of entry for each patient? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I'm trying to create a column that ranks each person based on their date of entry, but since everyone's date of entry is unique, it's been challenging.
here's a reprex:
df <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"))
What I want:
df_desired <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"),
day_at_facility = c(1, 2, 3, 1, 2, 1, 2, 3))
basically, i want to order the days at facility, but I need it to restart based on each unique ID. let me know if this is not clear.
(This is a dupe of something, haven't found it yet, but in the interim ...)
base R
ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
# [1] 1 2 3 1 2 1 2 3
so therefore
df$day_at_facility <- ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
dplyr
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(day_at_facility = row_number())
# # A tibble: 8 x 3
# # Groups: unique_id [3]
# unique_id date_of_entry day_at_facility
# <dbl> <chr> <int>
# 1 1 3-12-2001 1
# 2 1 3-13-2001 2
# 3 1 3-14-2001 3
# 4 2 4-1-2001 1
# 5 2 4-2-2001 2
# 6 3 3-28-2001 1
# 7 3 3-29-2001 2
# 8 3 3-30-2001 3

Minimum number of distinct elements after removing m items using R Scripting

Need guidance on how to start writing a code to find minimum number of distinct elements after removing m items using R scripting.
I don't know where to start.
For example, there are n = 6 items, ids = [1, 1, 1, 2, 3, 2], and the maximum number of items that can be deleted is m = 2. Two possible actions are delete two items with ID = 2 or one with ID = 2 and one with ID = 3. Either way she will have 2 item IDs in the final bag: either ids = [1, 1, 1, 3] or ids = [1, 1, 1, 2].
Any guidance would really help me.
Thanks.
Maybe you can try to define a function f
f <- function(ids, m) {
u <- table(ids)
while (m > 0) {
k <- which.min(u)
u <- subset(v <- replace(u, k, u[k] - 1), v > 0)
m <- m - 1
}
rep(as.numeric(names(u)), u)
}
which removes the first m "low-occurrence" elements from ids, there table helps to summarize the occurrence of each unique elements in ids.
Example
> f(ids,1)
[1] 1 1 1 2 2
> f(ids,2)
[1] 1 1 1 2
> f(ids,3)
[1] 1 1 1
if I understand your question rigth you can solve this as follow :
(I'm using the library dplyr)
First step is to count how many times each id appears :
library(dplyr)
m <- 2
dat <- tibble(id = c(1,1,1,2,3,2))
comptage <- dat %>%
group_by(id) %>%
count() %>%
ungroup()
Witch gives you :
> comptage
# A tibble: 3 x 2
id n
<dbl> <int>
1 1 3
2 2 2
3 3 1
Then to find how many id you can remove, you arrange your table by "n" (the count variable) and keep only the id having a cumulative sum over or egal to the limit m of removal number :
comptage <- comptage %>%
arrange(n) %>%
mutate(sum_cum = cumsum(n))
> comptage
# A tibble: 3 x 3
id n sum_cum
<dbl> <int> <int>
1 3 1 1
2 2 2 3
3 1 3 6
result <- comptage %>%
filter(!sum_cum < m)
> result
# A tibble: 2 x 3
id n sum_cum
<dbl> <int> <int>
1 2 2 3
2 1 3 6
length(result$id)
This is probably not the easiest way to do it but it should work.

function will not work with dplyr's select wrappers (contains, ends_with) [duplicate]

This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 3 years ago.
I'm trying to calculate row means on a dataset. I found a helpful function someone made here (dplyr - using mutate() like rowmeans()), and it works when I try out every column but not when I try to use a dplyr helper function.
Why does this work:
#The rowmeans function that works
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
#The data
library(tidyverse)
data <- tibble(id = c(1:4),
turn_intent_1 = c(5, 1, 1, 4),
turn_intent_2 = c(5, 1, 1, 3),
turn_intent_3R = c(5, 5, 1, 3))
#The code that is cumbersome but works
data %>%
mutate(turn_intent_agg = my_rowmeans(turn_intent_1, turn_intent_2, turn_intent_3R))
#The output
# A tibble: 4 x 5
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
But this does not work:
#The code
data %>%
mutate(turn_intent_agg = select(., contains("turn")) %>%
my_rowmeans())
#The output
Error in class1Def#contains[[class2]] : no such index at level 1
Of course, I can type each column, but this dataset has many columns. It'd be much easier to use these wrappers.
I need the output to look like the correct one shown that contains all columns (such as id).
Thank you!
I think that you can simplify it to:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn"))))
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
And you can indeed add also the na.rm = TRUE parameter:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn")), na.rm = TRUE))

Extracting corresponding dataframe values from multiple records using a function

I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)

Resources