Advice on writing generic function to encode variables in R - r

I have a data.frame:
mydata = data.frame(v1 = c("A", "A", "A", "B", "B", "C", "D"),
v2 = c("XY", "XY", "ZZ", "BB", "ZZ", NA, "ZZ"),
v3 = 5)
And I would like to encode each of the characters in the data frame to integers corresponding to each of the levels. I also want to "ignore" NA values. The expected output would be equal to:
output = data.frame(v1 = c(1, 1, 1, 2, 2, 3, 4),
v2 = c(1, 1, 2, 3, 2, NA, 2),
v3 = 5)
My hope is to write a function that accepts a data.frame object AND a list specifying the columns on which I want to perform the operation, something like:
my_function = function(df, vars){
...
}
EDIT: in the example above, "vars" would be = c("v1", "v2")
Any suggestions for how to approach this? I'm open to using packages such as dplyr to help.
Thanks,
D

We can convert to factor and then coerce to numeric
mydata[1:2] <- lapply(mydata[1:2], function(x)
as.numeric(factor(x, levels=unique(x))))
This can be converted to a function
myfunction <- function(df, vars) {
df[vars] <- lapply(df[vars], function(x)
as.numeric(factor(x, levels=unique(x))))
df
}
myfunction(mydata, c('v1', 'v2'))
# v1 v2 v3
#1 1 1 5
#2 1 1 5
#3 1 2 5
#4 2 3 5
#5 2 2 5
#6 3 NA 5
#7 4 2 5
If we need it to be further generalized, we may need to check the column classes i.e. whether it is a numeric column and if not, then change to factor with levels specified and coerce to numeric.
mydata[] <- lapply(mydata, function(x)
if(!is.numeric(x)) as.numeric(factor(x, levels=unique(x)))
else x)

Related

How to replace factor NA's with the level of the cell above

I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)

Updating old dataframe with new dataframe in R

I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12

dplyr mutate to replace specific values in a data frame

I have a data frame that consists of characters "a", "b", "x", "y".
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"))
Now I want to replace all values with the following scheme and also convert the whole data frame to numeric.
"a" -> 0
"b" -> 1
"x" -> 1
"y" -> 2
I know this must be somehow possible with mutate_all but I cannot figure out how
df %>% mutate_all(replace("a", 1)) %>%
mutate_all(is.character, as.numeric)
One solution could be with case_when:
df %>%
mutate_all(funs(case_when(. == "a" ~ 0,
. %in% c("b", "x") ~ 1,
. == "y" ~ 2,
TRUE ~ NA_real_)))
# v1 v2
# 1 0 0
# 2 1 1
# 3 1 0
# 4 2 2
Create a named vector with mappings and then subset it using mutate_all
vec <- c(a = 0, b = 1, x = 1, y = 2)
library(dplyr)
df %>% mutate_all(~vec[.])
# v1 v2
#1 0 0
#2 1 1
#3 1 0
#4 2 2
In base R that would be just
df[] <- vec[unlist(df)]
data
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"), stringsAsFactors = FALSE)

Get elements by position from one data frame to another

Let's say we have two data frames:
df1 <- data.frame(A = letters[1:3], B = letters[4:6], C = letters[7:9], stringsAsFactors = FALSE)
A B C
1 a d g
2 b e h
3 c f i
df2 <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
I need to build a function that takes as input a single value or a vector containing elements from one of the data frames and returns the elements from the other data frame according to their positional indexes.
The function should work like this:
> matchdf(values = c("a", "e", "i"), dfin = df1, dfout = df2)
[1] 1 5 9
> matchdf(values = c(1, 5, 9), dfin = df2, dfout = df1)
[1] "a" "e" "i"
> matchdf(values = c(1, 1, 1), dfin = df2, dfout = df1)
[1] "a" "a" "a"
This is what I have tried so far:
requiere(dplyr)
toVec <- function(df) df %>% as.matrix %>% as.vector
matchdf <- function(values, dfin, dfout) toVec(dfout)[toVec(dfin) %in% values]
# But sometimes the output values aren't in correct order:
> matchdf(c("c", "i", "h"), dt1, dt2)
[1] 3 8 9
# should output 3 9 8
> matchdf(values = c("a", "a", "a"), dfin = dt1, dfout = dt2)
[1] 1
# Should output 1 1 1
Feel free to use data.table or/and dplyr if it eases the task. I would prefer a solution without for loops.
Assumptions:
elements from df1 are different from df2
dim(df1) = dim(df2)
matchdf <- function(values, dfin, dfout){
unlist(sapply(values,
function(val) dfout[dfin == val],
USE.NAMES = F)
)
}
matchdf(c("c", "i", "h"), df1, df2)
#should output 3 9 8
[1] 3 9 8
matchdf(values = c("a", "a", "a"), dfin = df1, dfout = df2)
#should output 1 1 1
[1] 1 1 1
matchdf(values = c("X", "Y", "a"), dfin = df1, dfout = df2)
#should output vector, not list
[1] 1

add a counter column based on categories of other columns in R

I am trying to add a counter column to my dataframe based on the combination of two categorical values. e.g:
dat <- data.frame(cat1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b"),
cat2 = c("x", "x", "x", "y", "y", "j", "j", "k", "l"),
Result = c(1, 1, 1, 2, 2, 1, 1, 2, 3))
I have used this:
dat$Result <- ave(dat$cat1, dat$cat2, FUN=function(x) match(x,sort(unique(x))))
but I have errors. I have checked similar suggestions in other threads but the answers only apply to numeric columns. Could anybody please offer me a suggestion? Thanks you.
We can use
with(dat, as.numeric(ave(as.character(cat2), cat1,
FUN = function(x) match(x, unique(x)))))
If the factor levels are already in the same order for 'cat2', then coercing to numeric can also be done
with(dat, ave(as.numeric(cat2), cat1, FUN = function(x) match(x, unique(x))))
Update
With the new dataset,
with(dat, as.numeric(ave(as.character(cat2), cat1, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values))))))
#[1] 1 1 1 2 2 1 1 2 3 4
You can use rleid from data.table,
library(data.table)
setDT(dat)[, Result := rleid(cat2), by = cat1]
dat
# cat1 cat2 Result
#1: a x 1
#2: a x 1
#3: a x 1
#4: a y 2
#5: a y 2
#6: b j 1
#7: b j 1
#8: b k 2
#9: b l 3

Resources