R: Encoding categorical data using across()

R: Encoding categorical data using across() - r

I have a dataset with features of type character (not all are binary and one of them represents a region).
In order to avoid having to use the function several times, I was trying to use a pipeline and across() to identify all of the columns of character type and encode them with the function created.
encode_ordinal <- function(x, order = unique(x)) {
x <- as.numeric(factor(x, levels = order, exclude = NULL))
x
}
dataset <- dataset %>%
encode_ordinal(across(where(is.character)))
However, it seems that I am not using across() correctly as I get the error:
Error: across() must only be used inside dplyr verbs.
I wonder if I am overcomplicating myself and there is an easier way of achieving this, i.e., identifying all of the features of character type and encode them.

You should call across and encode_ordinal inside mutate, as illustrated in the following example:
dataset <- tibble(x = 1:3, y = c('a', 'b', 'b'), z = c('A', 'A', 'B'))
# # A tibble: 3 x 3
# x y z
# <int> <chr> <chr>
# 1 1 a A
# 2 2 b A
# 3 3 b B
dataset %>%
mutate(across(where(is.character), encode_ordinal))
# # A tibble: 3 x 3
# x y z
# <int> <dbl> <dbl>
# 1 1 1 1
# 2 2 2 1
# 3 3 2 2

Related

R dplyr find all mutated rows

I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2

You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2

#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2

We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))

How do you read or assign a value to a single cell in a tibble, using the name of the column?

I'm learning the tidyverse and ran into a problem with the simplest of operations:reading and assigning value to a single cell. I need to do this by matching a specific value in another column and calling the name of the column whose value I'd like to change (so I can't use numeric row and column numbers).
I've searched online and on SO and read the tibble documentation (this seems the most applicable https://tibble.tidyverse.org/reference/subsetting.html?q=cell) and haven't found the answer. (I'm probably missing something - apologies for the simplicity of this question and if it's been answered elsewhere)
test<-tibble(x = 1:5, y = 1, z = x ^ 2 + y)
Yields:
A tibble: 5 x 3
x y z
<int> <dbl> <dbl>
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
test["x"==3,"z"]
Yields:
A tibble: 0 x 1
… with 1 variable: z <dbl>
But doesn't tell me the value of that cell.
And when I try to assign a value...
test["x"==3,"z"]<-20
...it does not work.
test[3,3] This works, but as stated above I need to call the cell by names not numbers.
What is the right way to do this?

It is not a data.table. If we are using base R methods, the columns 'x' is extracted with test$x or test[["x"]]
test[test$x == 3, "z"]
# A tibble: 1 x 1
# z
# <dbl>
#1 10
Or use subset
subset(test, x == 3, select = 'z')
Or with dplyr
library(dplyr)
test %>%
filter(x == 3) %>%
select(z)
Or if we want to pass a string as column name, convert to symbol and evaluate
test %>%
filter(!! rlang::sym("x") == 3) %>%
select(z)
Or with data.table
library(data.table)
as.data.table(test)[x == 3, .(z)]
# z
#1: 10

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to split dataframes into different dataframes based on one column name say ## sensor_name ## values that starts with some prefix like "RI_", "AI_" in R so that I can have two dataframes one for RI and another for AI?
I have tried the following code but this works well when I pivot my dataframe.
map(set_names(c("RI", "AI","FI")),~select(temp_df,starts_with(.x),starts_with("time_stamp")))
I expect the output to have two different dataframes,
RI_df:
AI_df:
It would be great if anyone help me with this since I just started to work on R programming language.

An option is split from base R
lst1 <- split(df1, substr(df1$sensor_name, 1,2))
names(lst1) <- paste0(names(lst1), "_df")
If the prefix length is variable
lst1 <- split(df1, sub("_.*", "", df1$sensor_name))
Or using tidyverse
library(dplyr)
df1 %>%
group_split(grp = str_remove(sensor_name, "_.*"), keep = FALSE)
NOTE: It is not recommended to have multiple objects in the global env. For that reason, keep it in the list and do all thee analysis on that list itself

Another approach from base R
df <- data.frame(sensor_name=c("R1_111","R1_113","A1_124","A1_2444"),
A=c(1,2,24,4),B=c(2,2,1,2),C=c(3,4,4,2))
df[grepl("R1",df$sensor_name),]
sensor_name A B C
1 R1_111 1 2 3
2 R1_113 2 2 4
df[grepl("A1",df$sensor_name),]
sensor_name A B C
3 A1_124 24 1 4
4 A1_2444 4 2 2

Create a variable to identify each group. After that you can subset the data to separate the groups. Functions from the stringr package can extract the relevant text from the longer sensor name.
library(stringr)
library(dplyr)
# Sample data
X <- tibble(
sensor = c("RI_1", "RI_2", "AI_1", "AI_2"),
A = c(1, 2, 3, 4),
B = c(5, 6, 7, 8),
C = c(9, 10, 11, 12)
)
# Extract text to identify groups
X <- X %>%
mutate(prefix = str_replace(sensor, "_.*", ""))
# Subset for desired group
X %>% filter(prefix == "AI")
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
# Or, split all the groups
lapply(unique(X$prefix), function(x) {
X %>% filter(prefix == x)
})
[[1]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 RI_1 1 5 9 RI
2 RI_2 2 6 10 RI
[[2]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
Depending on what you are doing with these groups you may do better to use group_by() form the dplyr package

How to apply functions sequentially with purrr and pipes

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)

Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))

The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

How to get a frequency table of all columns of complete data frame in R?

I want to create a frequency table from a data frame and save it in excel. Using table() function i can only create frequency of a particular column. But I want to create frequency table for all the columns altogether, and for each column the levels or type of variables may differ too. Like kind of summary of a data frame but there will not be mean or other measures, only frequencies.
I was trying something like this
for(i in 1:230){
rm(tb)
tb<-data.frame(table(mydata[i]))
tb2<-cbind(tb2,tb)
}
But it's showing the following Error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 15, 12
In place of cbind() I also used data.frame() but the Error didn't changed.

You are getting an error because you are trying to combine the data frames that have different dimensions. From what I understand, your problem is two-fold: (1) you want to get the frequency distribution of each column regardless of type; and, (2) you want to save all of the results in a single Excel sheet.
For the first problem, you can use the mapply() function.
set.seed(1)
dat <- data.frame(
x = sample(LETTERS[1:5], 15, replace = TRUE),
y = rbinom(5, 15, prob = 0.4)
)
mylist <- mapply(table, dat); mylist
# $x
#
# A B C D E
# 2 5 1 4 3
#
# $y
#
# 5 6 7 11
# 3 3 6 3
You can also use purrr::map().
library(purrr)
dat %>% map(table)
The second problem has several solutions in this question: Export a list into a CSV or TXT file in R. In particular, LyzandeR's answer will enable you to do just what you intended. If you prefer to save the outputs in separate files, you can do:
mapply(write.csv, mylist, file=paste0(names(mylist), '.csv'))

Maybe an rbind solution is better as it allows you to handle variables with different levels:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
# x y
# 1 A 1
# 2 A 1
# 3 B 2
# 4 C 1
dt_res = data.frame()
for (i in 1:ncol(dt)){
dt_temp = data.frame(t(table(dt[,i])))
dt_temp$Var1 = names(dt)[i]
dt_res = rbind(dt_res, dt_temp)
}
names(dt_res) = c("Variable","Levels","Freq")
dt_res
# Variable Levels Freq
# 1 x A 2
# 2 x B 1
# 3 x C 1
# 4 y 1 3
# 5 y 2 1
And an alternative (probably faster) process using apply:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
ff = function(x){
y = data.frame(t(table(x)))
y$Var1 = NULL
names(y) = c("Levels","Freq")
return(y)
}
dd = do.call(rbind, apply(dt, 2, ff))
dd
# Levels Freq
# x.1 A 2
# x.2 B 1
# x.3 C 1
# y.1 1 3
# y.2 2 1
# extract variable names from row names
dd$Variable = sapply(row.names(dd), function(x) unlist(strsplit(x,"[.]"))[1])
dd
# Levels Freq Variable
# x.1 A 2 x
# x.2 B 1 x
# x.3 C 1 x
# y.1 1 3 y
# y.2 2 1 y

Edit (2021-03-29): tidyverse Principles
Here is some updated code that utilizes tidyverse, specifically functions from dplyr, tibble, and purrr. The code is a bit more readable and easier to carry out as well. Example data set is provided.
tibble(
a = rep(c(1:3), 2),
b = factor(rep(c("Jan", "Feb", "Mar"), 2)),
c = factor(rep(LETTERS[1:3], 2))
) ->
dat
dat #print df
# A tibble: 6 x 3
a b c
<int> <fct> <fct>
1 1 Jan A
2 2 Feb B
3 3 Mar C
4 1 Jan A
5 2 Feb B
6 3 Mar C
Get counts and proportions across columns.
library(purrr)
library(dplyr)
library(tibble)
#library(tidyverse) #to load assortment of pkgs
#output tables - I like to use parentheses & specifying my funs
purrr::map(
dat, function(.x) {
count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100))
})
#here is the same code but more concise (tidy eval)
purrr::map(dat, ~ count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100)))
$a
# A tibble: 6 x 3
x n pct
<int> <int> <dbl>
1 1 1 16.7
2 2 1 16.7
3 3 1 16.7
4 4 1 16.7
5 5 1 16.7
6 6 1 16.7
$b
# A tibble: 3 x 3
x n pct
<fct> <int> <dbl>
1 Feb 2 33.3
2 Jan 2 33.3
3 Mar 2 33.3
$c
# A tibble: 2 x 3
x n pct
<fct> <int> <dbl>
1 A 3 50
2 B 3 50
Old code...
The table() function returns a "table" object, which is nigh impossible to manipulate using R in my experience. I tend to just write my own function to circumvent this issue. Let's first create a data frame with some categorical variables/features (wide formatted data).
We can use lapply() in conjunction with the table() function found in base R to create a list of frequency counts for each feature.
freqList = lapply(select_if(dat, is.factor),
function(x) {
df = data.frame(table(x))
names(df) = c("x", "y")
return(df)
}
)
This approach allows each list object to be easily indexed and further manipulated if necessary, which can be really handy with data frames containing a lot of features. Use print(freqList) to view all of the frequency tables.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Encoding categorical data using across() - r

Related

R dplyr find all mutated rows

How do you read or assign a value to a single cell in a tibble, using the name of the column?

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to apply functions sequentially with purrr and pipes

How to get a frequency table of all columns of complete data frame in R?

Categories

Resources