Adding columns to data.frame in R - r

Having 2 vectors like the following:
vec1<-c("x", "y")
vec2<-c(rep(0, 5))
I would like to create a data.frame object where vec1 becomes the 1st of column of data.frame DF and vec2 becomes the row with column names too. Visually talking, it may be like this.
vec1 1 2 3 4 5
x 0 0 0 0 0
y 0 0 0 0 0
I have tried the following code, but it adds both vectors as columns:
DF<-data.frame(vec1, vec2)

Instead of generating a vector for your rows, you can generate a whole matrix, and then use data.frame to bind it to your first vector. Something like this :
mat <- matrix(0, nrow=2, ncol=5)
vec <- c("x","y")
data.frame(vec, mat)
Which gives :
vec X1 X2 X3 X4 X5
1 x 0 0 0 0 0
2 y 0 0 0 0 0

You can use rbind() inside the data.frame() function to put vec2 values in both rows of new data frame.
vec1<-c("x", "y")
vec2<-c(rep(0, 5))
data.frame(vec1,rbind(vec2,vec2))
vec1 X1 X2 X3 X4 X5
1 x 0 0 0 0 0
2 y 0 0 0 0 0

Related

Create a new variable based on any 2 conditions being true

I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)

How to count how many conditions an observation meets using R?

If I have a date set with lots of binary variables, all with values o/1. I want to create a new column, and add by one if the observation is 1 of one binary variable, add by two if it has 1 of two binary variables...
Such as:
x1 x2 x3 x4 x5
1 1 1 0 1
0 0 1 0 0
0 0 0 0 0
I want to have
x1 x2 x3 x4 x5 count
1 1 1 0 1 4
0 0 1 0 0 1
0 0 0 0 0 0
If your dataset contains only the binary variables you are interested in, you can use
df$count <- rowSums(df)
Otherwise, please provide a more detailed description of your data.
Another option is Reduce with +
df$count <- Reduce(`+`, df)

Convert long table of linked observations to wide adjacency matrix [duplicate]

This question already has an answer here:
How to convert two factors to adjacency matrix in R?
(1 answer)
Closed 4 years ago.
I am facing a challenge that I cannot manage to solve. I have a list of observations x_i (dimension is large, something around 30k) and a list of observations y_j (also large). x_i and y_i are id of the same units (say firms).
I have a dataframe of two columns that links x_i and y_j: if they appear on the same line, it means that they are connected. What I would like is to convert this network into a large matrix M of size (unique(union(x, y))) and which takes the value 1 if the two firms are connected.
Here is an example in small dimensions:
x1 x2
x3 x6
x4 x5
x1 x5
What I would like is a matrix:
0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Right now, the only solution I could think of is a double loop combined with a search in the initial dataframe:
list_firm = union(as.vector(df[1]), as.vector(df[2]))
list_firm <- sort(list_firm[[1]])
list_firm <- unique(list_firm)
M <- Matrix(nrow = length(list_firm), ncol = length(list_firm))
for (i in list_firm) {
for (j in list_firm) {
M[i, j] = !is.null(which(df$col1 == i & df$col2 == j))
}
}
Where df is the two columns data frame. This is obviously much too long to run.
Any suggestion? This would be very welcome
We convert the columns to factor with levels specified as the unique elements of both columns and get the frequency with table
lvls <- sort(unique(unlist(df)))
df[] <- lapply(df, factor, levels = lvls)
table(df)
# col2
#col1 x1 x2 x3 x4 x5 x6
# x1 0 1 0 0 1 0
# x2 0 0 0 0 0 0
# x3 0 0 0 0 0 1
# x4 0 0 0 0 1 0
# x5 0 0 0 0 0 0
# x6 0 0 0 0 0 0
data
df <- structure(list(col1 = c("x1", "x3", "x4", "x1"), col2 = c("x2",
"x6", "x5", "x5")), class = "data.frame", row.names = c(NA, -4L
))
The answer provided by #akrun in the comments is a good one. However, this is a good scenario to take advantage of a different data structure than data frames. Basically, what you're looking for is an adjacency matrix, which is a data structure in social network analysis. To achieve this, we can use the igraph package in R.
library(igraph)
library(dplyr)
df = data_frame(source=c('x1', 'x3', 'x4', 'x1'), target=c('x2', 'x6', 'x5', 'x5'))
g = graph_from_data_frame(df, directed=FALSE)
output = as.matrix(get.adjacency(g))
x1 x3 x4 x2 x6 x5
x1 0 0 0 1 0 1
x3 0 0 0 0 1 0
x4 0 0 0 0 0 1
x2 1 0 0 0 0 0
x6 0 1 0 0 0 0
x5 1 0 1 0 0 0
The output columns aren't in the exact order as your example, but this is a trivial problem to solve if needed.

extract rows for which first non-zero element is one

I would like to extract every row from the data frame my.data for which the first non-zero element is a 1.
my.data <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
0 2 1 1
2 1 2 1
1 1 1 2
0 0 0 0
0 1 0 0
', header = TRUE)
my.data
desired.result <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
1 1 1 2
0 1 0 0
', header = TRUE)
desired.result
I am not even sure where to begin. Sorry if this is a duplicate. Thank you for any suggestions or advice.
Here's one approach:
# index of rows
idx <- apply(my.data, 1, function(x) any(x) && x[as.logical(x)][1] == 1)
# extract rows
desired.result <- my.data[idx, ]
The result:
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Probably not the best answer, but:
rows.to.extract <- apply(my.data, 1, function(x) {
no.zeroes <- x[x!=0] # removing 0
to.return <- no.zeroes[1] == 1 # finding if first number is 0
# if a row is all 0, then to.return will be NA
# this fixes that problem
to.return[is.na(to.return)] <- FALSE # if row is all 0
to.return
})
my.data[rows.to.extract, ]
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Use apply to iterate over all rows:
first.element.is.one <- apply(my.data, 1, function(x) x[x != 0][1] == 1)
The function passed to apply compares the first [1] non-zero [x != 0] element of x to == 1. It will be called once for each row, x will be a vector of four in your example.
Use which to extract the indices of the candidate rows (and remove NA values, too):
desired.rows <- which(first.element.is.one)
Select the rows of the matrix -- you probably know how to do this.
Bonus question: Where do the NA values mentioned in step 2 come from?

Join and sum not compatible matrices

My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names.
I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors.
An example:
> M1
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
This is my code:
M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
A <- DT2[DT1, roll = TRUE]
A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}
That outputs:
rn X1 X3 X4 X5 X7 X8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
Then I can convert back this data.table to a matrix and fix row and column names.
The questions are:
how to generalize this procedure?
I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance.
In summary I need a merge procedure that behaves as described.
there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me)
to what kind of join (inner, outer, etc. etc.) is assimilable this procedure?
Thanks in advance.
p.s.: I'm using data.table version 1.8.2
EDIT - SOLUTIONS
#Aaron solution. No external libraries, only base R. It works also on list of matrices.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
out
}
#MadScone solution. Use reshape2 package. It works only on two matrices per call.
add_matrices_2 <- function(m1, m2) {
m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
m
}
#Aaron solution. Use Matrix package. It work only on sparse matrices, also on list of them.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i < j
sparseMatrix(
i = ifelse(ilj, i, j),
j = ifelse(ilj, j, i),
x = s$x,
dims = c(nrows, ncols),
dimnames = list(rows, cols),
symmetric = TRUE
)
})
Reduce(`+`, newms)
}
BENCHMARK (100 runs with microbenchmark package)
Unit: microseconds
expr min lq median uq max
1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448
No need to comment the benchmark: #Aaron solution wins.
Details
For insights about performances (that depend of the size and the sparsity of the matrices) see #Aaron's edit (and the solution for sparse matrices: add_matrices_3).
I'd just line up the names and go to town with base R.
Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
out
}
It then works like this:
# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)
add_matrices_1(M1, M2)
# 1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0
For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)
makeM <- function(N, n, k) {
s1 <- sample(N, n)
M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
r1 <- sample(n,k, replace=TRUE)
c1 <- sample(n,k, replace=TRUE)
M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
M1
}
Then here's another version that uses sparse matrices.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i<j
sparseMatrix(i=ifelse(ilj, i, j),
j=ifelse(ilj, j, i),
x=s$x,
dims=c(nrows, ncols),
dimnames=list(rows, cols), symmetric=TRUE)
})
Reduce(`+`, newms)
}
This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)
set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
# user system elapsed
# 2.987 0.841 4.133
system.time(add_matrices_3(mm1, mm2))
# user system elapsed
# 0.042 0.012 0.504
But when the matrices are small, my first solution is still faster.
set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
# expr min lq median uq max
# 1 add_matrices_1(M1, M2) 398.495 406.543 423.825 544.0905 43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24
Moral of the story: Size and sparsity matter.
Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.
Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.
# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))
#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]
# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))
# voila!
result
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother
Edit, an ugly approach using eval
This is made harder because you have non-syntatic names (`1` etc)
inBoth <- intersect(nonkey(DT1), nonKey(DT2))
backquote <- function(x){paste0('`', x, '`')}
bqBoth <- backquote(inBoth)
charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=', bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))
result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
setcolorder(result2, names(DT1))
# voila!
result2
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
I think I managed to do it with this single disgusting line:
cast(aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum), X1 ~ X2)[,-1]
This makes use of the reshape package. Returned as a data frame so convert to matrix as necessary.
If you want it in the format you suggested in your example, try this:
"%ms%" <- function(m1, m2) {
m <- as.matrix(cast(aggregate(value ~ X1 + X2, rbind(melt(m1), melt(m2)), sum), X1 ~ X2)[,-1])
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
return (m)
}
Then you can do:
M1 %ms% M2
EDIT:
EXPLANATION
Obviously should have some explanation sorry.
melt(M1)
Converts M1 from its original form into a format like this (row, col, value). E.g.
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
Is converted to:
X1 X2 value
1 1 1 0
2 3 1 0
3 4 1 1
etc. Combining M1 and M2 lists every possible (row, col, value) across both matrix into one single matrix. Now this:
aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum)
Sums values where the row and column are the same. So it will sum (1, 1) across both matrices for example. And (3, 1) etc. It won't do anything that doesn't exist e.g. M2 doesn't have a 7th column/row.
Finally cast transforms the matrix so that it is written with the result of aggregate's first column as rows, its second column as columns. Effectively undoing the melt from earlier. The [,-1] is taking off an unnecessary column leftover from cast (I think there is probably a better way of doing that but I don't know how).
As I said, it's returned as a data frame so use as.matrix() on the result if that's what you wish.

Resources