subset of data frame on based on multiple conditions - r

I'm actually having a trouble with a particular task of my code. I have a data frame as
n <- 6
set.seed(123)
df <- data.frame(x=paste0("x",seq_along(1:n)), A=sample(c(-2:2),n,replace=TRUE), B=sample(c(-1:3),n,replace=TRUE))
#
# x A B
# 1 x1 -1 1
# 2 x2 1 3
# 3 x3 0 1
# 4 x4 2 1
# 5 x5 2 3
# 6 x6 -2 1
and a decision tree as
A>0;Y;Y;N;N
B>1;Y;N;Y;N
C;1;2;2;1
that I load by
dt <- read.csv2("tmp.csv", header=FALSE)
I'd like to create a loop for all the possible combinations of (A>0) and (B>1) and set the C value to the subset x column that satisfy that condition. So, here's what I did
nr <- 3
nc <- 5
cond <- dt[1:(nr-1),1,drop=FALSE]
rule <- dt[nr,1,drop=FALSE]
subdf <- vector(mode="list",2^(nr-1))
for (i in 2:nc) {
check <- paste0("")
for (j in 1:(nr-1)) {
case <- paste0(dt[j,1])
if (dt[j,i]=="N")
case <- paste0("!",case)
check <- paste0(check, "(", case, ")" )
if (j<(nr-1))
check <- paste0(check, "&")
}
subdf[i] <- subset(df,check)
subdf[i]$C <- dt[nr,i]
}
unlist(subdf)
unfortunately, I got an error using subset as by this, it cannot parse the conditions from my string statements. what should I do?

Your issue is your creating of the subset: the subset commands expects a boolean and you gave it a string. ('check'). So the simplest solution here is to add a 'parse'. I feel there is a more elegant way to solve this problem and I hope someone'll come along and do it, but you can fix the final part of your code with the following
mysubset <- subset(df,with(df,eval(parse(text=check))))
if(nrow(mysubset)>0){
mysubset$C <- dt[nr,i]
}
subdf[[i]]<-mysubset
I have added the parse/eval part to generate a vector of booleans to subset only the 'TRUE' cases, and added a check for whether C could be added (will give error if there are no rows).
Based on the previous answer, I came up with a more elegant/practical way of generating a vector of combined rules, and then applying them all to the data, using apply/lapply.
##create list of formatted rules
#format each 'building' block separately,
#based on rows in 'dt'.
part_conditions <- apply(dt[-nrow(dt),],MARGIN=1,FUN=function(x){
res <- sprintf("(%s%s)", ifelse(x[-1]=="Y","","!"), x[1])
})
# > part_conditions
# 1 2
# [1,] "(A>0)" "(B>1)"
# [2,] "(A>0)" "(!B>1)"
# [3,] "(!A>0)" "(B>1)"
# [4,] "(!A>0)" "(!B>1)"
#combine to vector of conditions
conditions <- apply(part_conditions, MARGIN=1,FUN=paste, collapse="&")
# > conditions
# [1] "(A>0)&(B>1)" "(A>0)&(!B>1)" "(!A>0)&(B>1)" "(!A>0)&(!B>1)"
#for each condition, test in data wheter condition is 'T'
temp <- sapply(conditions, function(rule){
return(with(df, eval(parse(text=rule))))
}
)
rules <- as.numeric(t(dt[nrow(dt),-1]))
#then find which of the (in this case) four is 'T', and put the appropriate rule
#in df
df$C <- rules[apply(temp,1,which)]
> df
x A B C
1 x1 -1 1 1
2 x2 1 3 1
3 x3 0 1 1
4 x4 2 1 2
5 x5 2 3 1
6 x6 -2 1 1

Related

iterating table() results into matrix/data frame

This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store these counts, with the rownames that were counted. For a reproducible example consider
a <- c("a", "b", "c", "d", "a", "b") # one count, occurring twice for a and
# b and once for c and d
b <- c("a", "c") # a completly different property from the dataset
# occurring once for a and c
x <- table(a)
y <- table(b) # so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution to work iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show my (probably flawed) logic, is:
member <- function (data = dfm, groupvar = 'group', analysis = kc15) {
res<-matrix(NA,ncol=length(analysis$size)+1) #preparing an object for the results
res[,1]<-table(docvars(data,groupvar)) #getting names and totals of groups
for (i in 1:length(analysis$size)) { #getting a bunch of counts that I care about
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r) #here's the problem, trying to add each new count as a column.
}
res
}
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this
The following may be an option, which merges on the "row names" of the data frames, converted from the frequency tables:
df <- merge(as.data.frame(x, row.names=1, responseName ="x"),
as.data.frame(y, row.names=1, responseName ="y"),
by="row.names", all=TRUE)
df[is.na(df)] <- 0; df
Row.names x y
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
Then, this method can be incorporated into your real data with some modification. I've made up the data since I didn't have any to work with.
set.seed(1234)
groupvar <- sample(letters[1:4], 16, TRUE)
clusters <- 1:4
cluster <- rep(clusters, each=4)
Merge the first two tables:
res <- merge(as.data.frame(table(groupvar[cluster==1]),
row.names=1, responseName=clusters[1]),
as.data.frame(table(groupvar[cluster==2]),
row.names=1, responseName=clusters[2]),
by="row.names", all=TRUE)
Then merge the others using your for loop.
for (i in 3:length(clusters)) {
r <- table(groupvar[cluster==i])
res <- merge(res, as.data.frame(r, row.names=1, responseName = clusters[i]),
by.x="Row.names", by.y="row.names", all=TRUE)
}
res[is.na(res)] <- 0
res
Row.names X1 X2 X3 X4
1 a 1 2 0 0
2 b 1 1 2 2
3 c 0 1 1 2
4 d 2 0 1 0
merge the transposed and re-transpose.
res <- t(merge(t(unclass(x)), t(unclass(y)), all=TRUE))
res <- `colnames<-`(res[order(rownames(res)), 2:1], c("x", "y"))
res[is.na(res)] <- 0
res
# x y
# a 2 1
# b 2 0
# c 1 1
# d 1 0

Return multiple results of column-to-matrix operations within a data.table

I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.

R ignore missing data

I have two R data files each with 100 columns but row number vary from 220 to 360 in each data1 and data2. data1 and data2 represent changes of two quantities changes during a set of experiments. so [i,j] of data1 and[i,j] of data2 represent same event, but will have different value. I want to print data which is greater than 2.5 in any of the file, along with the column and row number
for (i in 1:360){
for (j in 1:100){
if((data1[i,j]>2.5) | ( data2[i,j]>2.5)) {
cat(i, j, data1[i,j], data2[i,j], "\n", file="extr-b2.5.txt", append=T)
}
}
}
I get this error because of NAs.
Error in if ((data1[i, j] > 2.5) | (data2[i, j] > :
missing value where TRUE/FALSE needed
if I set i to 1:220 (every column has at least 220 row), it works fine.
How can modify above code to neglect NA values.
I would something like this :
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
reproducible example:
set.seed(1)
dat1 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
dat2 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
# row col v1 v2
# [1,] 3 1 3.291413 4.079366
# [2,] 4 1 4.632831 2.990797
# [3,] 2 2 4.593559 4.967624
# [4,] 3 2 4.778701 2.520141
# [5,] 4 2 3.643191 4.109781
# [6,] 1 3 3.516456 4.738821
where dat1 and dat2:
# dat1
# V1 V2 V3
# 1 2.062035 1.806728 3.516456
# 2 2.488496 4.593559 1.247145
# 3 3.291413 4.778701 1.823898
# 4 4.632831 3.643191 1.706227
# > dat2
# V1 V2 V3
# 1 3.748091 3.870474 4.738821
# 2 2.536415 4.967624 1.848570
# 3 4.079366 2.520141 3.606695
# 4 2.990797 4.109781 1.502220
Without the for loops you can use pmax to compare two arrays.
bigger=pmax(data1,data2)
this gives an array with the maximum values. Then you can check if the max is bigger than 2.5
which( bigger>2.5,arr.ind=T)
will give the location where the max is bigger than your cutoff.
for completeness if I were to do it in your double looping framework, I would just set the Missing values to be below the min of all the other data, this will work so long as you have a value below 2.5 somewhere in your data.
lowest=min(c(data1,data2))
data1[which(is.na(data1),arr.ind=T)]=lowest
then run your double loop

Gnu R: Rename variable in loop

I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229

R Matrix process with conditional additions

I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0

Resources