More efficient approach to create a dummy coding - r

Question: In Python, I would work with dictonaries and use lots of map/apply functions. However, with R I started with this simple approach using lists, and I would like to know whether there is a more efficient/more elegant approach to doing the following.
In statistics, you use dummy variables to represent levels of a nominal attribute. E.g., A/B/C would become 00, 01, 10. A/B/C/D would become 000, 001, 010, 100. Thus, only one 1 per item is allowed. You therefore need n-1 numbers to represent n variables/letters.
Here I create some data:
data <- data.frame(
"upper" = c(1,1,1,2,2,2,3,3,3), # var 1
"country" = c(1,2,3,1,2,3,1,2,3), # var 2
"price" = c(1,2,3,2,3,1,3,1,2) # var 3
)
Create a list with keys (attributes) and values (lists of unique attribute levels):
lst <- list()
for (attribute in colnames(data)) {
lst[[attribute]] = unique(data[[attribute]])
}
Create dummy coding, i is used to consider only n-1 items:
dummy <- list()
for (attribute in colnames(data)) {
i <- 1
for (level in lst[[attribute]]) {
if (length(lst[[attribute]])!=i) {
dummy[[paste0(attribute, level)]] <- ifelse(
data[[attribute]]==level,
1,
0
)
}
i <- i + 1
}
}
Results:
dummy
$upper1
[1] 1 1 1 0 0 0 0 0 0
$upper2
[1] 0 0 0 1 1 1 0 0 0
$country1
[1] 1 0 0 1 0 0 1 0 0
$country2
[1] 0 1 0 0 1 0 0 1 0
$price1
[1] 1 0 0 0 0 1 0 1 0
$price2
[1] 0 1 0 1 0 0 0 0 1

We create a design matrix using model.matrix, split the columns to create a list of list, finally, concatenate the list elements together (do.call(c,..).
res <- do.call("c",lapply(data, function(x) {
x1 <- model.matrix(~0+factor(x))
split(x1, col(x1))}))
As we only need the first two levels, we can subset the 'res' using c(TRUE, TRUE, FALSE) which will recycle to the end of the list.
res[c(TRUE, TRUE, FALSE)]
#$upper.1
#[1] 1 1 1 0 0 0 0 0 0
#$upper.2
#[1] 0 0 0 1 1 1 0 0 0
#$country.1
#[1] 1 0 0 1 0 0 1 0 0
#$country.2
#[1] 0 1 0 0 1 0 0 1 0
#$price.1
#[1] 1 0 0 0 0 1 0 1 0
#$price.2
#[1] 0 1 0 1 0 0 0 0 1

Related

values changes (avoid 0 1 to 1 2)

I want to transform factor to numeric to be able to take the mean of it as.numeric changes the value, numeric doesn't work.
mtcars$vec <- factor(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
num.cols <- c("vec" )
mtcars[num.cols] <- lapply(mtcars[num.cols], as.numeric)
str(mtcars)
mtcars$vec
expected results should be numeric and consist of only 0 and 1
mtcars$vec
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
many thanks in advance
We need to convert to character and then to numeric because if we directly apply as.numeric, it gets coerced to the integer storage values instead of the actual values which starts from 1. In this case, there is a confusion because the values are binary
mtcars[num.cols] <- lapply(mtcars[num.cols],
function(x) as.numeric(as.character(x)))
mtcars$vec
#[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Or a faster option is also
mtcars[num.cols] <- lapply(mtcars[num.cols], function(x) as.numeric(levels(x)[x]))
If it is a single column, we can do this more easily
mtcars[[num.cols]] <- as.numeric(levels(mtcars[[num.cols]])[mtcars[[num.cols]]])
As an example
v1 <- factor(c(15, 15, 3, 3))
as.numeric(v1)
#[1] 2 2 1 1
as.numeric(as.character(v1))
#[1] 15 15 3 3

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Replace a sequence in data frame column

I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!
Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0
You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.
Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<
Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.

Showing missing levels in model matrix

I would like to know if there a way to insert a column into a matrix such that..
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
this would give me for p1mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
and for p2mat
a b c e
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 0 0 0 1
6 1 0 0 0
7 0 0 1 0
My question is, is there a way to sneak in a column vector d consisting of only zeros into the matrix p2mat? such that
d
0
0
0
0
0
0
0
and the vector is automatically ordered and placed between columns c and e resulting in to following matrix for p2mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
Basically I want matrix p2mat to look into every column in p1mat to create an identical size matrix and to keep track of the data via dummy matrices.
Thank you.
You can factor both your inputs, making sure they both have the same levels. Then model.matrix should work as you expected.
Example:
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
levs <- sort(unique(c(p1, p2)))
f1 <- factor(p1, levs)
f2 <- factor(p2, levs)
model.matrix(~f1 + 0)
# f1a f1b f1c f1d f1e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 1 0
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f1
# [1] "contr.treatment"
model.matrix(~f2 + 0)
# f2a f2b f2c f2d f2e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 0 1
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f2
# [1] "contr.treatment"
If you're really looking to write a function, you might want to look at something like the following:
myfun <- function(..., overwrite = FALSE) {
l <- setNames(list(...), sapply(substitute(list(...))[-1], deparse))
cols <- sort(unique(unlist(lapply(l, colnames), use.names = FALSE)))
out <- lapply(l, function(x) {
cols_x <- c(colnames(x), setdiff(cols, colnames(x)))
temp <- `colnames<-`(x[, match(cols, colnames(x))], cols_x)[, cols]
replace(temp, is.na(temp), 0)
})
if (isTRUE(overwrite)) list2env(out, envir = .GlobalEnv)
out
}
This will take any number of items as inputs, compare the columns in all of them, and add missing columns where necessary. The output is stored as a list, which is a convenient structure to keep if you want to continue doing similar operations on all of the matrices. If you want to overwrite the original object, then you can change the "overwrite" argument to TRUE.
Here's some more sample data to work with.
set.seed(1)
p1 <- c("a","b","c","e","d","a","c"); p2 <-c("a","b","x","e","e","a","x")
p3 <- sample(c(cols, "z"), 7, TRUE)
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
p3mat <- model.matrix(~p3 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
colnames(p3mat) <- gsub("p3","",colnames(p3mat))
Try the function out:
myfun(p1mat, p2mat)
myfun(p2mat, p1mat)
myfun(p3mat, p1mat)
myfun(p3mat, p1mat, p2mat)
This function takes 2 matrices, and compares their dimensions. If their dimensions differ, it inserts a new column of zeros into the matrix with fewer columns, at the exact column position that is lacking. It thus produces a new matrix with the same dimensions as the other.
match_matrices <- function(matrix1, matrix2) {
if(ncol(matrix1) != ncol(matrix2)) {
get_cols <- function(x) { l <- list(); for(i in 1:ncol(x)) { l[i] <- list(as.numeric(x[,i])) }; return(l) }
k <- get_cols(matrix2)
odd_one_out <- setdiff(colnames(matrix1), colnames(matrix2))
insert_at <- which(colnames(matrix1) == odd_one_out)
res <- t(do.call('rbind', append(k, list(rep(0, nrow(matrix2))), insert_at-1)))
colnames(res) <- colnames(matrix1)
}
return(res)
}
Using your matrices:
match_matrices(p1mat, p2mat)

changing values in vector given a location and condition with R

i'm having trouble manipulating vectors in R. i have a vector that looks like this:
stack <- append(append(rep(0,8),c(1,0,0,0,0,1)),rep(0,6))
[1] 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
my overall goal is to the manipulate the vector as such:
*when there is a 1, make the next three values in the vector 1.
*change the original 1 to 0.
so ultimately the vector would look like:
[1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
the second part I can do by:
replace(stack,which(stack == 1),0)
but I can't figure out how to do the first one efficiently. any help would be greatly appreciated.
You can use filter here :
c(filter(sx,c(0,0,0,0,1,1,1),circular=TRUE))
## [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
Here's a possible base R option
temp <- which(stack == 1)
stack[as.vector(mapply(`:`, temp, temp + 3))] <- c(0, rep(1, 3))
stack
# [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
I would go with regular expressions
stack <- paste0(stack, collapse="")
stack <- gsub("1.{3}", "0111", stack)
stack <- strsplit(stack, "+")

Resources