How can I create dummy variables from a numeric variable in R?
I want to create N dummy variables. In such a way the numeric variable means how many zeros will come, counting from the first column. Imagine N=6. Like this:
x
a 5
b 2
c 4
d 1
e 9
It must become:
1 2 3 4 5 6
a 0 0 0 0 0 1
b 0 0 1 1 1 1
c 0 0 0 0 1 1
d 0 1 1 1 1 1
e 0 0 0 0 0 0
Thank you!
Here's a hacky solution for you
x = c(5,2,4,1,9)
N = 6
out = matrix(1, length(x), N)
for (i in 1:length(x))
out[i,1:min(x[i], N)] = 0
> out
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 1
[2,] 0 0 1 1 1 1
[3,] 0 0 0 0 1 1
[4,] 0 1 1 1 1 1
[5,] 0 0 0 0 0 0
We could do this in a vectorized manner by creating row/column index and assigning an already created matrix of 1s to 0 based on the index
m1 <- matrix(1, ncol = N, nrow = length(x),
dimnames = list(letters[seq_along(x)], seq_len(N)))
x1 <- pmin(x, ncol(m1))
m1[cbind(rep(seq_len(nrow(m1)), x1), sequence(x1))] <- 0
m1
# 1 2 3 4 5 6
#a 0 0 0 0 0 1
#b 0 0 1 1 1 1
#c 0 0 0 0 1 1
#d 0 1 1 1 1 1
#e 0 0 0 0 0 0
data
x <- c(5,2,4,1,9)
N <- 6
I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!
Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0
You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.
Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<
Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.
Hi I'm pretty much stumped on on trying to figure this out and could use a little help. Basically, I have a n x n matrix where the diagonal is set to a value k and every other value is 0.
1 2 3 4 5
1 k 0 0 0 0
2 0 k 0 0 0
3 0 0 k 0 0
4 0 0 0 k 0
5 0 0 0 0 k
Basically, I need to be able to make two other diagonals in this matrix with the value of 1 so it ends up looking like this:
1 2 3 4 5
1 k 1 0 0 0
2 1 k 1 0 0
3 0 1 k 1 0
4 0 0 1 k 1
5 0 0 0 1 k
So far all I have for code is being able to make the diagonal matrix
m=diag(k,n,n) but I have no idea on how to add the two other diagonals. Would I use apply() and cbind() or rbind()?
You can use col and row to create and index to subset and assign the upper and lower diagonals.
k=3
m <- k* diag(6)
m[abs(row(m) - col(m)) == 1] <- 1
m
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 3 1 0 0 0 0
#[2,] 1 3 1 0 0 0
#[3,] 0 1 3 1 0 0
#[4,] 0 0 1 3 1 0
#[5,] 0 0 0 1 3 1
#[6,] 0 0 0 0 1 3
If you wanted reverse diagonals you could use col(m) - row(m)
Try this function, it will make a matrix of dimensions row X col and diagonal of the numeric n.
matfun <- function(diag=n, row=4,col=4){
x = diag(1,row,col)
diag*x+rbind(as.vector(rep(0,col)),x[1:(row-1),])+cbind(as.vector(rep(0,row)),x[,1:(col-1)])
}
HTH
I have data from a barter economy. I am trying to create a matrix that counts how frequently items act as counterparties with other items.
As an example:
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
TradeID Origin ItemID
1 1 1 1
2 1 0 2
3 1 0 3
4 2 1 4
5 2 1 5
6 2 0 1
7 3 1 1
8 3 0 6
9 4 1 7
10 4 0 1
11 5 1 1
12 5 0 8
13 6 1 7
14 6 0 5
15 7 1 1
16 7 0 1
17 8 1 2
18 8 0 3
19 8 0 4
20 9 1 1
21 9 0 8
Where TradeID indicates a specific transaction. ItemID indicates an item, and Origin indicates which direction the item went.
For example, given my data the matrix I'd create would look something like this:
For example, the value 2 at [1,8] indicates that item 1 & 8 were counterparties in two trades. (Note that it's a symmetric matrix, and so [8,1] also has the value 2).
While the value of 1 at [1,2] indicates that item 1 and 2 were counterparties in only one trade (all the other 1s throughout the matrix indicate the same)
As an odd example, note at [1,1], the value of 1 indicates that item 1 was a counterparty to itself once (trade number 7)
A little extra insight into my motivation, note in my simple example that item 1 tends to act as counterparty with many different items. In a barter economy (one without explicit money) we might expect a commodity currency to be a counterparty relatively more frequently than non-commodity-currencies. A matrix like this would be the first step at one way of discovering which item was a commodity currency.
I've been struggling with this for a while. But I think I'm nearly done with an overly complicated solution, which I'll post shortly.
I'm curious if y'all might offer a bit of help also.
Alright, I think I've got this figured out. The short answer is:
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
Which gives the following matrix, matching the desired result:
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
Here's the long answer. You can get a list of matrices for each TradeID using the by and outer (%o%) and table functions. But this double-counts Trade 7, where item 1 is traded for item 1, so I use the pmax function to fix this. Then I sum across the list by using the Reduce function.
And here's the steps to get there. Note the addition of TradeID # 9, which was left out of the question's code.
# Data
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,9,9)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4,1,8))
)
# Sum in 1 direction
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Sum in both directions
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]) + table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Remove double-count in trade 7
by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))
# Sum across lists
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
One way to speed this up would be to sum in only 1 direction (taking advantage of symmetry) and then clean up the results.
result = Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1])))
result2 = result + t(result)
diag(result2) = diag(result)
result2
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
This appears to run nearly twice as fast.
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))))
Unit: milliseconds
min lq median uq max neval
7.489092 7.733382 7.955861 8.536359 9.83216 100
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))))
Unit: milliseconds
min lq median uq max neval
4.023964 4.18819 4.277767 4.452824 5.801171 100
This will give you the number of observations per TradeID and ItemID
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), length)
result[is.na(result)] = 0
result["1","7"]
result will then be:
> result
1 2 3 4 5 6 7 8
1 1 1 1 1 1 0 2 0
2 1 0 0 0 0 0 0 1
3 1 0 0 0 0 0 0 1
4 0 1 0 0 0 0 0 1
5 0 1 0 0 0 1 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 1 0 1 0 0
8 0 0 0 0 1 0 0 0
This will give you the proportion of 1 Origin per TradeID and ItemID
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), function(x) { sum(as.numeric(as.character(x)))/length(x) })
You can set the NA values in the last matrix to 0 using result[is.na(result)] = 0 but that would confuse no observations with nothing but 0 Origin trades.
This will give you the number of observations per consecutive ItemIDs:
idxList <- with(myDat, tapply(ItemID, TradeID, FUN = function(items)
lapply(seq(length(items) - 1),
function(i) sort(c(items[i], items[i + 1])))))
# indices of observations
idx <- do.call(rbind, unlist(idxList, recursive = FALSE))
# create a matrix
ids <- unique(myDat$ItemID)
mat <- matrix(0, length(ids), length(ids))
# place values in matrix
for (i in seq(nrow(idx))) {
mat[idx[i, , drop = FALSE]] <- mat[idx[i, , drop = FALSE]] + 1
}
# create symmatric marix
mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 1 0 0 1 1 1 1
[2,] 1 0 2 0 0 0 0 0
[3,] 0 2 0 1 0 0 0 0
[4,] 0 0 1 0 1 0 0 0
[5,] 1 0 0 1 0 0 1 0
[6,] 1 0 0 0 0 0 0 0
[7,] 1 0 0 0 1 0 0 0
[8,] 1 0 0 0 0 0 0 0
Is there a native R function that will take an input vector and return the corresponding binary matrix where the matrix has the same number of columns as unique values in the input vector?
For example, given x <- 1:3, I want to return the following matrix:
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
The functions contrasts gets close, but I can't seem to get around the n-1 columns returned:
> contrasts(as.factor(x))
2 3
1 0 0
2 1 0
3 0 1
Actually, contrasts is what you want.
contrasts(as.factor(1:3), contrasts=FALSE)
1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
model.matrix() might help here, but you need to suppress the intercept:
> model.matrix(~ factor(1:3) - 1)
factor(1:3)1 factor(1:3)2 factor(1:3)3
1 1 0 0
2 0 1 0
3 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(1:3)`
[1] "contr.treatment"
Something slightly more complex:
> set.seed(1)
> fac <- factor(sample(1:3, 10, replace = TRUE))
> model.matrix(~ fac - 1)
fac1 fac2 fac3
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 1 0 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$fac
[1] "contr.treatment"