Setting missing values using labelled package across multiple columns? - r

I am using the labelled package and trying to set user-defined missing values. I have a dataframe where I want to set missing values for a list of specific columns rather than the entire dataset.
Currently I have to type out each column (s2 and s3). Is there a more efficient way? My full dataset has dozens of columns.
df <- tibble(s1 = c(1, 2, 3, 9), s2 = c(1, 1, 2, 9), s3 = c(1, 1, 2, 9))
df <- df %>%
set_na_values(., s2 = 9) %>%
set_na_values(., s3 = 9)
na_values(df$s1)
na_values(df$s2)
na_values(df$s3)

The set_na_values() function takes multiple pairs so you don't need to call it more than once:
library(labelled)
library(dplyr)
df %>%
set_na_values(s2 = 9, s3 = 9)
If you were dealing with a lot of variables you could programatically build a named vector or list (if there are multiple missing values per variable) and splice it inside the function. If, from your comment you wanted to apply it to everything except the s1 variable, you can do:
nm <- setdiff(names(df), "s1")
df %>%
set_na_values(!!!setNames(rep(9, length(nm)), nm))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
Alternatively, you can use labelled_spss() and take advantage of across() which allows tidyselect semantics (but this will overwrite any existing labelled values):
df %>%
mutate(across(-s1, labelled_spss, na_values = 9))
# A tibble: 4 x 3
s1 s2 s3
<dbl> <dbl+lbl> <dbl+lbl>
1 1 1 1
2 2 1 1
3 3 2 2
4 9 9 (NA) 9 (NA)
To reset any existing values use:
df %>%
mutate(across(-s1, ~ labelled_spss(.x, labels = val_labels(.x), na_values = 9)))

Related

Conditional cumulative sum from two columns

I can't get my head around the following problem.
Assuming the follwoing data:
library(tidyverse)
df <- tibble(source = c("A", "A", "B", "B", "B", "C"),
value = c(5, 10, NA, NA, NA, 20),
add = c(1, 1, 1, 2, 3, 4))
What I want to do is: for all cases where source == "B", I want to calculate the cumulative sum of the previous row's value and the current row's add. Of course, for the first "B" row, I need to provide a starting value for value. Note: in this case, it would be fine if we just take the value from the last "A" row.
So for row 3, the result would be 10 + 1 = 11.
For row 4, the result would be 11 + 2 = 13.
For row 5, the results would be 13 + 3 = 16.
I tried to use purrr::accumulate, but I failed in many different ways, e.g. I thought I can do:
df %>%
mutate(test = accumulate(add, .init = 10, ~.x + .y))
But this leads to error:
Error: Problem with `mutate()` column `test`.
i `test = accumulate(add, .init = 10, ~.x + .y)`.
i `test` must be size 6 or 1, not 7.
Same if I use .init = value
And I also didn't manage to do the job only on group B (although this is probably no issue, I think I can probably performa on the full data frame and then just replace values for all non-B rows).
Expected output:
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
You were essentially in the right direction. Since you provide an .init value to accumulate, the resulting vector is of size n+1, with the first value being .init. You have to remove the first value to get a vector that fit to your column size.
Then, if you want NAs on the remaining values, here's a way to do it. Also, since the "starting row" is the third, .init has to be set to 8.
df %>%
mutate(test =
ifelse(source == "B", accumulate(add, .init = 8, ~.x + .y)[-1], NA))
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
#tmfmnk provided an awesome answer and they deserve full credit (NOT ME)
Below is the same code from their comment (for more visibility, while also setting an initial value)
init_value = 10
df = df %>%
mutate(test = lag(value)) %>%
group_by(source) %>%
mutate(test = init_value + cumsum(add))

How do you read or assign a value to a single cell in a tibble, using the name of the column?

I'm learning the tidyverse and ran into a problem with the simplest of operations:reading and assigning value to a single cell. I need to do this by matching a specific value in another column and calling the name of the column whose value I'd like to change (so I can't use numeric row and column numbers).
I've searched online and on SO and read the tibble documentation (this seems the most applicable https://tibble.tidyverse.org/reference/subsetting.html?q=cell) and haven't found the answer. (I'm probably missing something - apologies for the simplicity of this question and if it's been answered elsewhere)
test<-tibble(x = 1:5, y = 1, z = x ^ 2 + y)
Yields:
A tibble: 5 x 3
x y z
<int> <dbl> <dbl>
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
test["x"==3,"z"]
Yields:
A tibble: 0 x 1
… with 1 variable: z <dbl>
But doesn't tell me the value of that cell.
And when I try to assign a value...
test["x"==3,"z"]<-20
...it does not work.
test[3,3] This works, but as stated above I need to call the cell by names not numbers.
What is the right way to do this?
It is not a data.table. If we are using base R methods, the columns 'x' is extracted with test$x or test[["x"]]
test[test$x == 3, "z"]
# A tibble: 1 x 1
# z
# <dbl>
#1 10
Or use subset
subset(test, x == 3, select = 'z')
Or with dplyr
library(dplyr)
test %>%
filter(x == 3) %>%
select(z)
Or if we want to pass a string as column name, convert to symbol and evaluate
test %>%
filter(!! rlang::sym("x") == 3) %>%
select(z)
Or with data.table
library(data.table)
as.data.table(test)[x == 3, .(z)]
# z
#1: 10

function will not work with dplyr's select wrappers (contains, ends_with) [duplicate]

This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 3 years ago.
I'm trying to calculate row means on a dataset. I found a helpful function someone made here (dplyr - using mutate() like rowmeans()), and it works when I try out every column but not when I try to use a dplyr helper function.
Why does this work:
#The rowmeans function that works
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
#The data
library(tidyverse)
data <- tibble(id = c(1:4),
turn_intent_1 = c(5, 1, 1, 4),
turn_intent_2 = c(5, 1, 1, 3),
turn_intent_3R = c(5, 5, 1, 3))
#The code that is cumbersome but works
data %>%
mutate(turn_intent_agg = my_rowmeans(turn_intent_1, turn_intent_2, turn_intent_3R))
#The output
# A tibble: 4 x 5
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
But this does not work:
#The code
data %>%
mutate(turn_intent_agg = select(., contains("turn")) %>%
my_rowmeans())
#The output
Error in class1Def#contains[[class2]] : no such index at level 1
Of course, I can type each column, but this dataset has many columns. It'd be much easier to use these wrappers.
I need the output to look like the correct one shown that contains all columns (such as id).
Thank you!
I think that you can simplify it to:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn"))))
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
And you can indeed add also the na.rm = TRUE parameter:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn")), na.rm = TRUE))

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to split dataframes into different dataframes based on one column name say ## sensor_name ## values that starts with some prefix like "RI_", "AI_" in R so that I can have two dataframes one for RI and another for AI?
I have tried the following code but this works well when I pivot my dataframe.
map(set_names(c("RI", "AI","FI")),~select(temp_df,starts_with(.x),starts_with("time_stamp")))
I expect the output to have two different dataframes,
RI_df:
AI_df:
It would be great if anyone help me with this since I just started to work on R programming language.
An option is split from base R
lst1 <- split(df1, substr(df1$sensor_name, 1,2))
names(lst1) <- paste0(names(lst1), "_df")
If the prefix length is variable
lst1 <- split(df1, sub("_.*", "", df1$sensor_name))
Or using tidyverse
library(dplyr)
df1 %>%
group_split(grp = str_remove(sensor_name, "_.*"), keep = FALSE)
NOTE: It is not recommended to have multiple objects in the global env. For that reason, keep it in the list and do all thee analysis on that list itself
Another approach from base R
df <- data.frame(sensor_name=c("R1_111","R1_113","A1_124","A1_2444"),
A=c(1,2,24,4),B=c(2,2,1,2),C=c(3,4,4,2))
df[grepl("R1",df$sensor_name),]
sensor_name A B C
1 R1_111 1 2 3
2 R1_113 2 2 4
df[grepl("A1",df$sensor_name),]
sensor_name A B C
3 A1_124 24 1 4
4 A1_2444 4 2 2
Create a variable to identify each group. After that you can subset the data to separate the groups. Functions from the stringr package can extract the relevant text from the longer sensor name.
library(stringr)
library(dplyr)
# Sample data
X <- tibble(
sensor = c("RI_1", "RI_2", "AI_1", "AI_2"),
A = c(1, 2, 3, 4),
B = c(5, 6, 7, 8),
C = c(9, 10, 11, 12)
)
# Extract text to identify groups
X <- X %>%
mutate(prefix = str_replace(sensor, "_.*", ""))
# Subset for desired group
X %>% filter(prefix == "AI")
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
# Or, split all the groups
lapply(unique(X$prefix), function(x) {
X %>% filter(prefix == x)
})
[[1]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 RI_1 1 5 9 RI
2 RI_2 2 6 10 RI
[[2]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
Depending on what you are doing with these groups you may do better to use group_by() form the dplyr package

Get most frequent value(s) from a list of lists

I am trying to convert an LDA prediction result, which is a list object containing hundred of list (of topics (in numeric) assigned to each token in a document), such as the following example
assignments <- list(
as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
as.integer(c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3)),
as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2))
)
where each list of the list object has different length corresponding to the length of each tokenized document.
What I want to do are to 1) get the most frequent topic (1, 2, 3) out of each list, and 2) convert them into tbl or data.frame format like this
document topic freq
1 1 6
2 2 5
3 3 6
such that I can use inner_join() to merge this "consensus" prediction with the topic assignment results generated by tm or topicmodels applications and compare their precision, etc. Since the assignments is in list format, I cannot apply top_n() function to get the most frequent topic for each list. I tried sing lapply(unlist(assignments), count), but it didn't give me what I want.
You can iterate over the list with sapply, get frequency with table and extract first value from sorted result:
result <- sapply(assignments, function(x) sort(table(x), decreasing = TRUE)[1])
data.frame(document = seq_along(assignments),
topic = as.integer(names(result)),
freq = result)
document topic freq
1 1 1 6
2 2 2 5
3 3 3 6
We can loop through the list, get the frequency of elements with tabulate, find the index of maximum elements, extract those along with the frequency as a data.frame and rbind the list elements
do.call(rbind, lapply(seq_along(assignments), function(i) {
x <- assignments[[i]]
ux <- unique(x)
i1 <- tabulate(match(x, ux))
data.frame(document = i, topic = ux[which.max(i1)], freq = max(i1))})
)
# document topic freq
#1 1 1 6
#2 2 2 5
#3 3 3 6
Or another option is to convert it to a two column dataset and then do group by to find the index of max values
library(data.table)
setDT(stack(setNames(assignments, seq_along(assignments))))[,
.(freq = .N), .(document = ind, topic = values)][, .SD[freq == max(freq)], document]
# document topic freq
#1: 1 1 6
#2: 2 2 5
#3: 3 3 6
Or we can use tidyverse
library(tidyverse)
map(assignments, as_tibble) %>%
bind_rows(.id = 'document') %>%
count(document, value) %>%
group_by(document) %>%
filter(n == max(n)) %>%
ungroup %>%
rename_at(2:3, ~c('topic', 'freq'))
# A tibble: 3 x 3
# document topic freq
# <chr> <int> <int>
#1 1 1 6
#2 2 2 5
#3 3 3 6
using purrr::imap_dfr :
library(tidyverse)
imap_dfr(assignments,~ tibble(
document = .y,
Topic = names(which.max(table(.x))),
freq = max(tabulate(.x))))
# # A tibble: 3 x 3
# document Topic freq
# <int> <chr> <int>
# 1 1 1 6
# 2 2 2 5
# 3 3 3 6

Resources