Recoding specific column values using reference list - r

My dataframe looks like this
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Gender=c('Male','Female','Female','Female','Male','Female','Male','Male','Female','Female'))
And I have a reference list that looks like this -
ref=list(Male=1,Female=2)
I'd like to replace values in the Gender column using this reference list, without adding a new column to my dataframe.
Here's my attempt
do.call(dplyr::recode, c(list(data), ref))
Which gives me the following error -
no applicable method for 'recode' applied to an object of class
"data.frame"
Any inputs would be greatly appreciated

An option would be do a left_join after stacking the 'ref' list to a two column data.frame
library(dplyr)
left_join(data, stack(ref), by = c('Gender' = 'ind')) %>%
select(ID, Gender = values)
A base R approach would be
unname(unlist(ref)[as.character(data$Gender)])
#[1] 1 2 2 2 1 2 1 1 2 2

In base R:
data$Gender = sapply(data$Gender, function(x) ref[[x]])

You can use factor, i.e.
factor(data$Gender, levels = names(ref), labels = ref)
#[1] 1 2 2 2 1 2 1 1 2 2

You can unlist ref to give you a named vector of codes, and then index this with your data:
transform(data,Gender=unlist(ref)[as.character(Gender)])
ID Gender
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
7 7 1
8 8 1
9 9 2
10 10 2

Surprisingly, that one works as well:
data$Gender <- ref[as.character(data$Gender)]
#> data
# ID Gender
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 2
# 5 5 1
# 6 6 2
# 7 7 1
# 8 8 1
# 9 9 2
# 10 10 2

Related

Is there an R function to sequentially assign a code to each value in a dataframe, in the order it appears within the dataset?

I have a table with a long list of aliased values like this:
> head(transmission9, 50)
# A tibble: 50 x 2
In_Node End_Node
<chr> <chr>
1 c4ca4238 2838023a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a
I would like to have R go through both columns and assign a number sequentially to each value, in the order that an aliased value appears in the dataset. I would like R to read across rows first, then down columns. For example, for the dataset above:
In_Node End_Node
<chr> <chr>
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
Is this possible? Ideally, I'd also love to be able to generate a "key" which would match each sequential code to each aliased value, like so:
Code Value
1 c4ca4238
2 2838023a
3 d82c8d16
4 a684ecee
5 fc490ca4
Thank you in advance for the help!
You could do:
df1 <- df
df1[]<-as.numeric(factor(unlist(df), unique(c(t(df)))))
df1
In_Node End_Node
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
You can match against the unique values. For a single vector, the code is straightforward:
match(vec, unique(vec))
The requirement to go across columns before rows makes this slightly tricky: you need to transpose the values first. After that, match them.
Finally, use [<- to assign the result back to a data.frame of the same shape as your original data (here x):
y = x
y[] = match(unlist(x), unique(c(t(x))))
y
V2 V3
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
c(t(x)) is a bit of a hack:
t first converts the tibble to a matrix and then transposes it. If your tibble contains multiple data types, these will be coerced to a common type.
c(…) discards attributes. In particular, it drops the dimensions of the transposed matrix, i.e. it converts the matrix into a vector, with the values now in the correct order.
A dplyr version
Let's first re-create a sample data
library(tidyverse)
transmission9 <- read.table(header = T, text = " In_Node End_Node
1 c4ca4238 283802d3a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a")
Do this simply
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data()))))))
#> In_Node End_Node
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> 4 1 5
#> 5 6 1
#> 6 7 8
use .names argument if you want to create new columns
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data())))),
.names = '{.col}_code'))
In_Node End_Node In_Node_code End_Node_code
1 c4ca4238 2838023a 1 2
2 c4ca4238 d82c8d16 1 3
3 c4ca4238 a684ecee 1 4
4 c4ca4238 fc490ca4 1 5
5 28dd2c79 c4ca4238 6 1
6 f899139d 3def184a 7 8

How to assign values in one column to other columns in wide data using R

There is a wide data set, a simple example is
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
I'm looking for a way to assign the values in "ax" to column "bx" and "cx". Here, imagine we have thousands of columns we intend to replace with "ax", so I want this to be done in an automated approach using R. The expected output look like
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(1,2,2,3,4,4),
"cx"=c(1,2,2,3,4,4))
I've thought of, and tried using mutate_at and ends_with, but this has not work for me. For example, I tried
df %>%
mutate_at(vars(ends_with("x")), labels = "ax")
and this prints an error. Not sure what's wrong or what's to be added to get this working, so I would like to request your help on this. Thank you very much!
A simple way using base R would be :
change_cols <- grep('x$', names(df))
df[change_cols] <- df$ax
df
# id ax bx cx
#1 1 1 1 1
#2 2 2 2 2
#3 3 2 2 2
#4 4 3 3 3
#5 5 4 4 4
#6 6 4 4 4
I would suggest this tidyverse approach using across() to select the range of variables you want:
library(tidyverse)
#Data
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
#Mutate
df %>% mutate(across(c(bx:cx), ~ ax))
Output:
id ax bx cx
1 1 1 1 1
2 2 2 2 2
3 3 2 2 2
4 4 3 3 3
5 5 4 4 4
6 6 4 4 4
Another option with mutate_at()
df %>%
mutate_at(vars(matches("x$")), ~ax)
# id ax bx cx
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 2 2 2
# 4 4 3 3 3
# 5 5 4 4 4
# 6 6 4 4 4

Order/Sort/Rank a table

I have a table like this
table(mtcars$gear, mtcars$cyl)
I want to rank the rows by the ones with more observations in the 4 cylinder. E.g.
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
I have been playing with order/sort/rank without much success. How could I order tables output?
We can convert table to data.frame and then order by the column.
sort_col <- "4"
tab <- as.data.frame.matrix(table(mtcars$gear, mtcars$cyl))
tab[order(-tab[sort_col]), ]
# OR tab[order(tab[sort_col], decreasing = TRUE), ]
# 4 6 8
#4 8 4 0
#5 2 1 2
#3 1 2 12
If we don't want to convert it into data frame and want to maintain the table structure we can do
tab <- table(mtcars$gear, mtcars$cyl)
tab[order(-tab[,dimnames(tab)[[2]] == sort_col]),]
# 4 6 8
# 4 8 4 0
# 5 2 1 2
# 3 1 2 12
Could try this. Use sort for the relevant column, specifying decreasing=TRUE; take the names of the sorted rows and subset using those.
table(mtcars$gear, mtcars$cyl)[names(sort(table(mtcars$gear, mtcars$cyl)[,1], dec=T)), ]
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
In the same scope as Milan, but using the order() function, instead of looking for names() in a sort()-ed list.
The [,1] is to look at the first column when ordering.
table(mtcars$gear, mtcars$cyl)[order(table(mtcars$gear, mtcars$cyl)[,1], decreasing=T),]

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Return row number(s) for a particular value in a column in a dataframe

I have a data frame (df) and I was wondering how to return the row number(s) for a particular value (2585) in the 4th column (height_chad1) of the same data frame?
I've tried:
row(mydata_2$height_chad1, 2585)
and I get the following error:
Error in factor(.Internal(row(dim(x))), labels = labs) :
a matrix-like object is required as argument to 'row'
Is there an equivalent line of code that works for data frames instead of matrix-like objects?
Any help would be appreciated.
Use which(mydata_2$height_chad1 == 2585)
Short example
df <- data.frame(x = c(1,1,2,3,4,5,6,3),
y = c(5,4,6,7,8,3,2,4))
df
x y
1 1 5
2 1 4
3 2 6
4 3 7
5 4 8
6 5 3
7 6 2
8 3 4
which(df$x == 3)
[1] 4 8
length(which(df$x == 3))
[1] 2
count(df, vars = "x")
x freq
1 1 2
2 2 1
3 3 2
4 4 1
5 5 1
6 6 1
df[which(df$x == 3),]
x y
4 3 7
8 3 4
As Matt Weller pointed out, you can use the length function.
The count function in plyr can be used to return the count of each unique column value.
which(df==my.val, arr.ind=TRUE)

Resources