I have a data.frame:
mydata = data.frame(v1 = c("A", "A", "A", "B", "B", "C", "D"),
v2 = c("XY", "XY", "ZZ", "BB", "ZZ", NA, "ZZ"),
v3 = 5)
And I would like to encode each of the characters in the data frame to integers corresponding to each of the levels. I also want to "ignore" NA values. The expected output would be equal to:
output = data.frame(v1 = c(1, 1, 1, 2, 2, 3, 4),
v2 = c(1, 1, 2, 3, 2, NA, 2),
v3 = 5)
My hope is to write a function that accepts a data.frame object AND a list specifying the columns on which I want to perform the operation, something like:
my_function = function(df, vars){
...
}
EDIT: in the example above, "vars" would be = c("v1", "v2")
Any suggestions for how to approach this? I'm open to using packages such as dplyr to help.
Thanks,
D
We can convert to factor and then coerce to numeric
mydata[1:2] <- lapply(mydata[1:2], function(x)
as.numeric(factor(x, levels=unique(x))))
This can be converted to a function
myfunction <- function(df, vars) {
df[vars] <- lapply(df[vars], function(x)
as.numeric(factor(x, levels=unique(x))))
df
}
myfunction(mydata, c('v1', 'v2'))
# v1 v2 v3
#1 1 1 5
#2 1 1 5
#3 1 2 5
#4 2 3 5
#5 2 2 5
#6 3 NA 5
#7 4 2 5
If we need it to be further generalized, we may need to check the column classes i.e. whether it is a numeric column and if not, then change to factor with levels specified and coerce to numeric.
mydata[] <- lapply(mydata, function(x)
if(!is.numeric(x)) as.numeric(factor(x, levels=unique(x)))
else x)
Related
I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)
I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12
I have a data frame that consists of characters "a", "b", "x", "y".
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"))
Now I want to replace all values with the following scheme and also convert the whole data frame to numeric.
"a" -> 0
"b" -> 1
"x" -> 1
"y" -> 2
I know this must be somehow possible with mutate_all but I cannot figure out how
df %>% mutate_all(replace("a", 1)) %>%
mutate_all(is.character, as.numeric)
One solution could be with case_when:
df %>%
mutate_all(funs(case_when(. == "a" ~ 0,
. %in% c("b", "x") ~ 1,
. == "y" ~ 2,
TRUE ~ NA_real_)))
# v1 v2
# 1 0 0
# 2 1 1
# 3 1 0
# 4 2 2
Create a named vector with mappings and then subset it using mutate_all
vec <- c(a = 0, b = 1, x = 1, y = 2)
library(dplyr)
df %>% mutate_all(~vec[.])
# v1 v2
#1 0 0
#2 1 1
#3 1 0
#4 2 2
In base R that would be just
df[] <- vec[unlist(df)]
data
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"), stringsAsFactors = FALSE)
Let's say we have two data frames:
df1 <- data.frame(A = letters[1:3], B = letters[4:6], C = letters[7:9], stringsAsFactors = FALSE)
A B C
1 a d g
2 b e h
3 c f i
df2 <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
I need to build a function that takes as input a single value or a vector containing elements from one of the data frames and returns the elements from the other data frame according to their positional indexes.
The function should work like this:
> matchdf(values = c("a", "e", "i"), dfin = df1, dfout = df2)
[1] 1 5 9
> matchdf(values = c(1, 5, 9), dfin = df2, dfout = df1)
[1] "a" "e" "i"
> matchdf(values = c(1, 1, 1), dfin = df2, dfout = df1)
[1] "a" "a" "a"
This is what I have tried so far:
requiere(dplyr)
toVec <- function(df) df %>% as.matrix %>% as.vector
matchdf <- function(values, dfin, dfout) toVec(dfout)[toVec(dfin) %in% values]
# But sometimes the output values aren't in correct order:
> matchdf(c("c", "i", "h"), dt1, dt2)
[1] 3 8 9
# should output 3 9 8
> matchdf(values = c("a", "a", "a"), dfin = dt1, dfout = dt2)
[1] 1
# Should output 1 1 1
Feel free to use data.table or/and dplyr if it eases the task. I would prefer a solution without for loops.
Assumptions:
elements from df1 are different from df2
dim(df1) = dim(df2)
matchdf <- function(values, dfin, dfout){
unlist(sapply(values,
function(val) dfout[dfin == val],
USE.NAMES = F)
)
}
matchdf(c("c", "i", "h"), df1, df2)
#should output 3 9 8
[1] 3 9 8
matchdf(values = c("a", "a", "a"), dfin = df1, dfout = df2)
#should output 1 1 1
[1] 1 1 1
matchdf(values = c("X", "Y", "a"), dfin = df1, dfout = df2)
#should output vector, not list
[1] 1
I am trying to add a counter column to my dataframe based on the combination of two categorical values. e.g:
dat <- data.frame(cat1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b"),
cat2 = c("x", "x", "x", "y", "y", "j", "j", "k", "l"),
Result = c(1, 1, 1, 2, 2, 1, 1, 2, 3))
I have used this:
dat$Result <- ave(dat$cat1, dat$cat2, FUN=function(x) match(x,sort(unique(x))))
but I have errors. I have checked similar suggestions in other threads but the answers only apply to numeric columns. Could anybody please offer me a suggestion? Thanks you.
We can use
with(dat, as.numeric(ave(as.character(cat2), cat1,
FUN = function(x) match(x, unique(x)))))
If the factor levels are already in the same order for 'cat2', then coercing to numeric can also be done
with(dat, ave(as.numeric(cat2), cat1, FUN = function(x) match(x, unique(x))))
Update
With the new dataset,
with(dat, as.numeric(ave(as.character(cat2), cat1, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values))))))
#[1] 1 1 1 2 2 1 1 2 3 4
You can use rleid from data.table,
library(data.table)
setDT(dat)[, Result := rleid(cat2), by = cat1]
dat
# cat1 cat2 Result
#1: a x 1
#2: a x 1
#3: a x 1
#4: a y 2
#5: a y 2
#6: b j 1
#7: b j 1
#8: b k 2
#9: b l 3