Replacing values in one table from a corresponding key in another table by specific column - r

I am processing a large dataset from a questionnaire that contains coded responses in some but not all columns. I would like to replace the coded responses with actual values. The key/dictionary is stored in another database. The complicating factor is that different questions (stored as columns in original dataset) used the same code (typically numeric), but the code has different meanings depending on the column (question).
How can I replace the coded values in the original dataset with different valuse from a corresponding key stored in the dictionary table, but do it by specific column name (also stored in the dictionary table)?
Below is an example of the original dataset and the dictionary table, as well as desired result.
original <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c(1,3,4,2),
car = c('b','b','a','b'),
shirt = c(3,2,1,1),
shoes = c('Black','Black','Black','Brown')
)
keymap <- data.frame(
column_name=c('home','home','home','home','car','car','shirt','shirt','shirt'),
value_old=c('1','2','3','4','a','b','1','2','3'),
value_new=c('Single family','Duplex','Condo','Apartment','Sedan','SUV','White','Red','Blue')
)
result <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c('Single family','Condo','Apartment','Duplex'),
car = c('SUV','SUV','Sedan','SUV'),
shirt = c('Blue','Red','White','White'),
shoes = c('Black','Black','Black','Brown')
)
> original
name home car shirt shoes
1 Jane 1 b 3 Black
2 Mary 3 b 2 Black
3 John 4 a 1 Black
4 Billy 2 b 1 Brown
> keymap
column_name value_old value_new
1 home 1 Single family
2 home 2 Duplex
3 home 3 Condo
4 home 4 Apartment
5 car a Sedan
6 car b SUV
7 shirt 1 White
8 shirt 2 Red
9 shirt 3 Blue
> result
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
I have tried different approaches using dplyr but have not gotten far as I do not have a robust understanding of the mutate/join syntax.

We may loop across the unique values from the 'column_name' column of 'keymap' in the original, subset the keymap that matches the column name (cur_column()), select the columns 2 and 3, deframe to a named vector and match with the values of the column for replacement
library(dplyr)
library(tibble)
original %>%
mutate(across(all_of(unique(keymap$column_name)), ~
(keymap %>%
filter(column_name == cur_column()) %>%
select(-column_name) %>%
deframe)[as.character(.x)]))
-output
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
Or an approach in base R
lst1 <- split(with(keymap, setNames(value_new, value_old)), keymap$column_name)
original[names(lst1)] <- Map(\(x, y) y[as.character(x)],
original[names(lst1)], lst1)

Please check below code where we can use the factor to replace the values in one column with data from another dataframe here in this case with keymap
library(tidyverse)
original %>% mutate(home=factor(home, keymap$value_old, keymap$value_new),
car=factor(car, keymap$value_old, keymap$value_new),
shirt=factor(shirt, keymap$value_old, keymap$value_new)
)
Created on 2023-02-04 with reprex v2.0.2
name home car shirt shoes
1 Jane Single family SUV Condo Black
2 Mary Condo SUV Duplex Black
3 John Apartment Sedan Single family Black
4 Billy Duplex SUV Single family Brown

Related

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

How to see/count the number of unique rows in a dataframe

For reference
library(vcd)
data(Arthritis)
Art = Arthritis[c("Treatment", "Sex", "Age")]
I want to find out the number of matching attributes in a data frame.
For example
Adj Name Verb
Red John Jumps
Blue John Sleeps
Red John Jumps
Red Smith Jumps
Red Smith Walks
In the end, I want to see:
Adj Name Verb Freq
Red John Jumps 2
Blue John Sleeps 1
Red Smith Jumps 1
Red Smith Walks 1
Is there a way to do this in R?
You can do this with aggregate.
DAT = read.table(text="Adj Name Verb
Red John Jumps
Blue John Sleeps
Red John Jumps
Red Smith Jumps
Red Smith Walks",
header=TRUE)
aggregate(rep(1, nrow(DAT)), DAT, length)
Adj Name Verb x
1 Red John Jumps 2
2 Red Smith Jumps 1
3 Blue John Sleeps 1
4 Red Smith Walks 1
You could also use sum instead of length.
Slightly clunkier than #G5W's, but:
## cross-tabulate
t1 <- with(dd,table(Adj,Name,Verb))
## convert to long format
res <- as.data.frame(t1)
## drop zeros
subset(res,Freq>0)

Multiple ifelse statements in R - Conditionally create new variable [duplicate]

I have a dataframe with a few columns, one of those columns is ranks, an integer between 1 and 20. I want to create another column that contains a bin value like "1-4", "5-10", "11-15", "16-20".
What is the most effective way to do this?
the data frame that I have looks like this(.csv format):
rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink
and I want to add another column to the dataframe, so it would be like this:
rank,name,info,binValue
1,steve,red,"1-4"
3,joe,blue,"1-4"
6,john,green, "5-10"
3,liz,yellow,"1-4"
15,jon,pink,"11-15"
The way I am doing it now is not working, as I would like to keep the data.frame intact, and just add another column if the value of df$ranked is within a given range. thank you.
See ?cut and specify breaks (and maybe labels).
x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15
dat <- "rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink"
x <- read.table(textConnection(dat), header=TRUE, sep=",", stringsAsFactors=FALSE)
x$bins <- cut(x$rank, breaks=seq(0, 20, 5), labels=c("1-5", "6-10", "11-15", "16-20"))
x
rank name info bins
1 1 steve red 1-5
2 3 joe blue 1-5
3 6 john green 6-10
4 3 liz yellow 1-5
5 15 jon pink 11-15
We can use smart_cut from package cutr :
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Andrie's sample data:
x$bins <- smart_cut(x$rank,
c(1,5,11,16),
labels = ~paste0(.y[1],'-',.y[2]-1),
simplify = FALSE)
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 11-15
more on cutr and smart_cut

Unpack embedded vectors in columns of a data frame and spread the values to multiple rows in R

I have a data frame created by this chunk of code:
df <- dplyr::data_frame(
id = c(1, 2, 3),
name = c('Jack', 'Peter', 'Sam'),
role = list(
c('Manager', 'Analyst'),
c('Analyst', 'Advisor'),
c('Analyst')
),
fav_color = list(
c('White', 'Blue'),
c('Black', 'Red'),
c('White', 'Red')
)
)
Each row in role and fav_color columns contain a vector or characters instead of a single string. I want to spread the values into separate rows like this:
id name role fav_color
------------------------------
1 Jack Manager White
1 Jack Manager Blue
1 Jack Analyst White
1 Jack Analyst Blue
2 Peter Analyst Black
2 Peter Analyst Red
2 Peter Advisor Black
2 Peter Advisor Red
3 Sam Analyst White
3 Sam Analyst Red
I tried purrr and tidyjson but still didn't get very far.
Anyone gives me some advice? much appreciated.
You can do this with dplyr fairly easily
df %>% rowwise %>% do(expand.grid(., stringsAsFactors=FALSE))
# id name role fav_color
# * <dbl> <chr> <chr> <chr>
# 1 1 Jack Manager White
# 2 1 Jack Analyst White
# 3 1 Jack Manager Blue
# 4 1 Jack Analyst Blue
# 5 2 Peter Analyst Black
# 6 2 Peter Advisor Black
# 7 2 Peter Analyst Red
# 8 2 Peter Advisor Red
# 9 3 Sam Analyst White
# 10 3 Sam Analyst Red
Here we use the base function expand.grid to find all combinations of the list values,

Creating new column in R based on data in spreadsheet, Using IF statement [duplicate]

I have a dataframe with a few columns, one of those columns is ranks, an integer between 1 and 20. I want to create another column that contains a bin value like "1-4", "5-10", "11-15", "16-20".
What is the most effective way to do this?
the data frame that I have looks like this(.csv format):
rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink
and I want to add another column to the dataframe, so it would be like this:
rank,name,info,binValue
1,steve,red,"1-4"
3,joe,blue,"1-4"
6,john,green, "5-10"
3,liz,yellow,"1-4"
15,jon,pink,"11-15"
The way I am doing it now is not working, as I would like to keep the data.frame intact, and just add another column if the value of df$ranked is within a given range. thank you.
See ?cut and specify breaks (and maybe labels).
x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15
dat <- "rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink"
x <- read.table(textConnection(dat), header=TRUE, sep=",", stringsAsFactors=FALSE)
x$bins <- cut(x$rank, breaks=seq(0, 20, 5), labels=c("1-5", "6-10", "11-15", "16-20"))
x
rank name info bins
1 1 steve red 1-5
2 3 joe blue 1-5
3 6 john green 6-10
4 3 liz yellow 1-5
5 15 jon pink 11-15
We can use smart_cut from package cutr :
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Andrie's sample data:
x$bins <- smart_cut(x$rank,
c(1,5,11,16),
labels = ~paste0(.y[1],'-',.y[2]-1),
simplify = FALSE)
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 11-15
more on cutr and smart_cut

Resources