Insert nonexistent columns in matrix or dataframe in given order - r

I am on the lookout for a function in R that would check for the presence of particular columns, e.g.
cols=c("a","b","c","d")
in a matrix or dataframe that would insert a column with NAs in case any columns did not exist (in the position in which the columns are given in vector cols). Say if you had a matrix or dataframe with named columns "a", "d", that it would insert a column "b" and "c" filled up with NAs before column "d", and that any columns not listed in cols would be deleted (e.g. column "e"). What would be the easiest and fastest way to achieve this (I am dealing with a fairly large dataset of ca. 1 million rows)? Or is there already some function that does this?

I would separate the creation step and the ordering step. Here is an example:
cols <- letters[1:4]
## initialize test data set
my.df <- data.frame(a = rnorm(100), d = rnorm(100), e = rnorm(100))
## exclude columns not in cols
my.df <- my.df[ , colnames(my.df) %in% cols]
## add missing columns filled with NA
my.df[, cols[!(cols %in% colnames(my.df))]] <- NA
## reorder
my.df <- my.df[, cols]

Other approach I also just discovered using match, but only works for matrices:
# original matrix
matrix=cbind(a = 1:2, d = 3:4)
# required columns
coln=c("a","b","c","d")
colnmatrix=colnames(matrix)
matrix=matrix[,match(coln,colnmatrix)]
colnames(matrix)=coln
matrix
a b c d
[1,] 1 NA NA 3
[2,] 2 NA NA 4

Another possibility if your data is in a matrix
# original matrix
m1 <- cbind(a = 1:2, d = 3:4)
m1
# a d
# [1,] 1 3
# [2,] 2 4
# matrix will all columns, filled with NA
all.cols <- letters[1:4]
m2 <- matrix(nrow = nrow(m1), ncol = length(all.cols), dimnames = list(NULL, all.cols))
m2
# a b c d
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# replace columns in 'NA matrix' with values from original matrix
m2[ , colnames(m1)] <- m1
m2
# a b c d
# [1,] 1 NA NA 3
# [2,] 2 NA NA 4

Related

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

how to divide column and this following in a data frame

I tried to create a function that divide every column by this following in data frame for example if I have a data frame like that:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
I would like to create a function that divide col1 by col2, col3 by col4 ....col(n-1) by col(n) to the end of the data frame and print a data frame that bind all the output lists.
I created a function that divide column and this following but isn't a loop function.
bigfunction<-function(data,n){
n<-1
data[,1]
data[,n+1]
d<-(data[,n]/data[,n+1])
print(as.list(d))}
Vectorise that calculation!
df <- data.frame(a = 1:3, b = 4:6, c = 7:9, d = 10:12)
df[c(1,3)]/df[c(2,4)]
# a c
#1 0.25 0.7000000
#2 0.40 0.7272727
#3 0.50 0.7500000
divdf <- function(data) {
data[seq(1,ncol(data),2)]/data[seq(2,ncol(data),2)]
}
divdf(df)
# a c
#1 0.25 0.7000000
#2 0.40 0.7272727
#3 0.50 0.7500000
You could add some further error checking to this to make sure you always have an even number of columns etc, but this is the basic logic that you can add to.
You could try something like this:
fun1 <- function(df){
for (i in 1:ncol(df)){
if (i%%2 == 1){next}
else{
temp <- df[, i-1]/df[, i]
temp_df <- cbind(temp_df, temp)
}
}
return(temp_df)
}
df <- data.frame(a = 1:3, b = 4:6, c = 7:9, d = 10:12)
temp_df <- data.frame(id = 1:nrow(df))
new_df <- fun1(df)
I have created a temporary dataframe to keep cbinding the vectors. You can remove the id column later on and change the column names as per requirement. This assumes that you have an even number of columns (if not then it will simply ignore the last one)

merge two vectors using one as the header for the other

I have two vectors.
x <- c("a","b","c")
y <- c(NA, 1, NA)
I want to combine to get the following where x is the column heading:
a b c
NA 1 NA
You can data frame the transpose of y and then use x to assign the column names.
df <- data.frame(t(y))
names(df) <- x
df
a b c
1 NA 1 NA
Demo

Return Column Names when True in R

I am using R for a project and I have a data frame in in the following format:
A B C
1 1 0 0
2 0 1 1
I want to return a data frame that gives the Column Name when the value is 1.
i.e.
Impair1 Impair2
1 A NA
2 B C
Is there a way to do this for thousands of records? The max impairment number is 4.
Note: There are more than 3 columns. Only 3 were listed to make it easier.
You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end:
`colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))),
paste("Impair", 1:4))
# Impair1 Impair2 Impair3 Impair4
# 1 "A" NA NA NA
# 2 "B" "C" NA NA
Using the apply family of functions, here is a general solution that should work for your larger dataset:
res <- apply(df, 1, function(x) {
out <- character(4) # create a 4-length vector of NAs
tmp <- colnames(df)[which(x==1)] # store the column names in a tmp field
out[1:length(tmp)] <- tmp # overwrite the relevant positions
out
})
# transpose and turn it into a data.frame
> data.frame(t(res))
X1 X2 X3 X4
1 A
2 B C

R: Mean of subsets of dataframe on both row and column labels

Let's say I have:
set.seed(42)
d = data.frame(replicate(6,rnorm(10)))
col_labels = c("a", "a", "b", "b", "c", "c")
row_labels = c(1,1,1,2,2,3,3,4,4,4)
I now want to calculate the mean value of a subset of d corresponding to each combination of col_labels and row_labels, ie:
s = subset(d, row_labels==1, select=col_labels=="a")
s_mean = mean(as.matrix(s))
In the end, I would like a dataframe, with rows corresponding to row_labels and columns corresponding to col_labels and values the mean value of the subset. How do I do this without a large number of for-loops?
Here's another option:
res <- lapply(split.default(d, col_labels), FUN=by, INDICES=list(row_labels), function(x) mean(unlist(x)))
do.call(rbind, res)
# 1 2 3 4
# a 0.56201 0.1563 0.4393 -0.3193
# b -0.01075 0.7515 -0.7973 -0.8620
# c 0.28615 -0.3406 0.1443 -0.1583
Try:
set.seed(42)
d <- data.frame(replicate(6,rnorm(10)))
indx <- expand.grid(unique(row_labels), unique(col_labels))
val1 <- apply(indx, 1, function(x)
mean(as.matrix(subset(d, row_labels==x[1], select=col_labels==x[2]))))
val1
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or
fun1 <- function(x,y) mean(as.matrix(subset(d, row_labels==x, select=col_labels==y)))
mapply(fun1, indx[,1], indx[,2])
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or using outer
outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
# [,1] [,2] [,3]
#[1,] 0.5620072 -0.01074557 0.2861531
#[2,] 0.1562552 0.75147423 -0.3405815
#[3,] 0.4392737 -0.79730155 0.1443161
#[4,] -0.3192931 -0.86200887 -0.1583452
cbind the indx and val
res <- cbind(indx, val1)
head(res,3)
#Var1 Var2 val1
#1 1 a 0.5620072
#2 2 a 0.1562552
#3 3 a 0.4392737
mean(as.matrix(subset(d, row_labels==1, select=col_labels=="a")))
#[1] 0.5620072
mean(as.matrix(subset(d, row_labels==2, select=col_labels=="a")))
#[1] 0.1562552
Update
You can also change the formatting
res1 <- outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
dimnames(res1) <- list(unique(row_labels), unique(col_labels))
res1
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452
Or you could use reshape2
library(reshape2)
acast(res, Var1~Var2, value.var="val1")
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452
You're going to need to change the data to long format. You should consider why you imported the data in this format, and better ways of cleaning it.
Firstly, set the column names
colnames(d) <- col_labels
Secondly, you cannot have duplicate rownames, so you can't simply do rownames(d) <- row_labels.
Instead, we're going to have to split them up another way. You could use
split(d, rowlabels)
Now we're going to get it all into long format. The melt function in the package reshape2 is commonly used for this.
require(reshape2)
dMelt <- melt(split(d, row_labels))
Now look at dMelt. Is there any reason you couldn't have organised the data in this way?
In order to find the subsetted means, use the function aggregate()
aggregate(dMelt$value, FUN=mean, by=list(dMelt$variable, dMelt$L1))
Here an option using data.table. It should be very fast and with any loop
library(data.table)
library(reshape2)
set.seed(42)
merge(
setkey(data.table(variable=colnames(d),x=col_labels),variable),
setkey(melt(setDT(d)[,row:=row_labels,],id.vars="row"),variable))[
,mean(value),c("row","x")]
row x V1
1: 1 a 0.56200717
2: 2 a 0.15625521
3: 3 a 0.43927374
4: 4 a -0.31929307
5: 1 b -0.01074557
6: 2 b 0.75147423
7: 3 b -0.79730155
8: 4 b -0.86200887
9: 1 c 0.28615306
10: 2 c -0.34058148
11: 3 c 0.14431610
12: 4 c -0.15834522
The idea is to :
put the d data.frame in the long format after adding row labels as a row
merge it with another data table to to have correspondence between previous column names and your repeated column names
Compute the mean by group of row and x ( resulted from the merge)

Resources