Turns thousands of dummy variables into multinomial variable - r

I have a dataframe of the following sort:
a<-c('q','w')
b<-c(T,T)
d<-c(F,F)
.e<-c(T,F)
.f<-c(F,F)
.g<-c(F,T)
h<-c(F,F)
i<-c(F,T)
j<-c(T,T)
df<-data.frame(a,b,d,.e,.f,.g,h,i,j)
a b d .e .f .g h i j
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
I want to turn all variables starting with periods at the start into a single multinomial variable called Index such that the second row would have a value 1 for the Index column, the third row would have a value 2, etc. :
df$Index<-c('e','g')
a b d .e .f .g h i j Index
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE e
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE g
Although many rows can have a T for any of period-initial variable, each row can be T for only ONE period-initial variable.
If it were just a few items id do an ifelse statement:
df$Index <- ifelse(df$_10000, '10000',...
But there are 12000 of these. The names for all dummy variables begin with underscores, so I feel like there must be a better way. In pseudocode I would say something like:
for every row:
for every column beginning with '_':
if value == T:
assign the name of the column without '_' to a Column 'Index'
Thanks in advance

Sample data:
df <- cbind(a = letters[1:10], b = LETTERS[1:10],
data.frame(diag(10) == 1))
names(df)[-(1:2)] <- paste0("_", 1:10)
set.seed(42)
df <- df[sample(nrow(df)),]
head(df,3)
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Execution:
df$Index <- apply(subset(df, select = grepl("^_", names(df))), 1,
function(z) which(z)[1])
df
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10 Index
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 1
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE 5
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 10
# 8 h H FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 8
# 2 b B FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 2
# 4 d D FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 4
# 6 f F FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE 6
# 9 i I FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE 9
# 7 g G FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE 7
# 3 c C FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 3
If there are more than one TRUE in a row of _-columns, the first found will be used, the remainder silently ignored. If there are none, Index will be NA for that row.

Related

How can I remove rows with same elements in all columns of a dataframe?

I have a dataframe with the following elements:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
3539 6 6
3540 2 2
I want to remove rows with the same elements in all columns.
My desired result is as follows:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
I tried this:
x<-x[,1] != x[,2]
But I get only boolean values for each row, not matrix with rows with non-same elements in columns:
> x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Any help would be greatly appreciated.
You need to subset/filter:
Base R:
x_new <- x[x$V1 != x$V2,]
dplyr:
library(dplyr)
x_new <- x %>%
filter(V1 != V2)
Result:
x_new
V1 V2
2 1 2
3 1 3
Data:
x <- data.frame(
V1 = c(1,1,1,1),
V2 = c(1,2,3,1)
)
The below is assuming you want to subset within specific rows as per original post.
library(data.table)
setDT(df)
df <- df[3536:3540][V1 != V2]

Function for mutating, tagging, or identifying records surrounding a condition by a given window

Given a data.frame with some type of a flag or identifier column, I would like to be able to flag the surrounding (leading and lagging) records by some time window parameter, n. So given:
df <- data.frame(
id = letters[1:26],
flag = FALSE
)
df$flag[10] <- TRUE
df$flag[17] <- TRUE
I would like to write something like:
flag_surrounding <- function(flag, n) {
# should flag surrounding -n to +n records with condition flag
}
# expected results for n = 2, n = 1...
df
# id flag flag_n2 flag_n1
# 1 a FALSE FALSE FALSE
# 2 b FALSE FALSE FALSE
# 3 c FALSE FALSE FALSE
# 4 d FALSE FALSE FALSE
# 5 e FALSE FALSE FALSE
# 6 f FALSE FALSE FALSE
# 7 g FALSE FALSE FALSE
# 8 h FALSE TRUE FALSE
# 9 i FALSE TRUE TRUE
# 10 j TRUE TRUE TRUE
# 11 k FALSE TRUE TRUE
# 12 l FALSE TRUE FALSE
# 13 m FALSE FALSE FALSE
# 14 n FALSE FALSE FALSE
# 15 o FALSE TRUE FALSE
# 16 p FALSE TRUE TRUE
# 17 q TRUE TRUE TRUE
# 18 r FALSE TRUE TRUE
# 19 s FALSE TRUE FALSE
# 20 t FALSE FALSE FALSE
# 21 u FALSE FALSE FALSE
# 22 v FALSE FALSE FALSE
# 23 w FALSE FALSE FALSE
# 24 x FALSE FALSE FALSE
# 25 y FALSE FALSE FALSE
# 26 z FALSE FALSE FALSE
I started writing some things using dplyr::lead and dplyr::lag and variants with cumsum, but I felt like this is already in a package somewhere, but couldn't find it quickly (and not really sure how to phrase this as a question for googling) - maybe someone has better recall than me :)
The following does the trick (using ideas from this post), but feels a bit clunky and error prone. I'd be curious to get other approaches/techniques and/or something more robust from a package.
library(dplyr)
flag_surrounding <- function(flag, n) {
as.logical(cumsum(lead(flag, n, default = FALSE)) - cumsum(lag(flag, n + 1, default = FALSE)))
}
df %>%
mutate(flag_n2 = flag_surrounding(flag, 2),
flag_n1 = flag_surrounding(flag, 1))
Here's a simple solution in base:
set.seed(4)
df <- data.frame(
id = letters[1:26],
flag = as.logical(rbinom(n = 26, size = 1, prob = 0.1))
)
lead_lag_flag = function(x, n) {
flagged = which(x)
to_flag = sapply(flagged, function(z) (z - n):(z + n))
to_flag = pmax(0, to_flag)
to_flag = pmin(length(x), to_flag)
to_flag = unique(to_flag)
new_flag = rep(FALSE, length(x))
new_flag[to_flag] = TRUE
return(new_flag)
}
df$flag_n1 = lead_lag_flag(df$flag, 1)
df$flag_n2 = lead_lag_flag(df$flag, 2)
df
# id flag flag_n1 flag_n2
# 1 a FALSE FALSE FALSE
# 2 b FALSE FALSE FALSE
# 3 c FALSE FALSE FALSE
# 4 d FALSE FALSE FALSE
# 5 e FALSE FALSE FALSE
# 6 f FALSE FALSE TRUE
# 7 g FALSE TRUE TRUE
# 8 h TRUE TRUE TRUE
# 9 i TRUE TRUE TRUE
# 10 j FALSE TRUE TRUE
# 11 k FALSE FALSE TRUE
# 12 l FALSE FALSE TRUE
# 13 m FALSE TRUE TRUE
# 14 n TRUE TRUE TRUE
# 15 o FALSE TRUE TRUE
# 16 p FALSE TRUE TRUE
# 17 q TRUE TRUE TRUE
# 18 r FALSE TRUE TRUE
# 19 s TRUE TRUE TRUE
# 20 t FALSE TRUE TRUE
# 21 u FALSE TRUE TRUE
# 22 v TRUE TRUE TRUE
# 23 w FALSE TRUE TRUE
# 24 x FALSE FALSE TRUE
# 25 y FALSE FALSE FALSE
# 26 z FALSE FALSE FALSE
Another base alternative:
n <- 1
nm <- paste0("flag", n)
i <- -n:n
df[ , nm] <- FALSE
ix <- rep(which(df$flag), each = length(i)) + i
ix <- ix[ix > 0 & ix <= nrow(d)]
df[ix, nm] <- TRUE
df
# id flag flag1
# 1 a FALSE FALSE
# 2 b FALSE FALSE
# 3 c FALSE FALSE
# 4 d FALSE FALSE
# 5 e FALSE FALSE
# 6 f FALSE FALSE
# 7 g FALSE FALSE
# 8 h FALSE FALSE
# 9 i FALSE TRUE
# 10 j TRUE TRUE
# 11 k FALSE TRUE
# 12 l FALSE FALSE
# 13 m FALSE FALSE
# 14 n FALSE FALSE
# 15 o FALSE FALSE
# 16 p FALSE TRUE
# 17 q TRUE TRUE
# 18 r FALSE TRUE
# 19 s FALSE FALSE
# 20 t FALSE FALSE
# 21 u FALSE FALSE
# 22 v FALSE FALSE
# 23 w FALSE FALSE
# 24 x FALSE FALSE
# 25 y FALSE FALSE
# 26 z FALSE FALSE

Create a logical or binary matrix/data.frame from a list of factors in R

I have a list of approximately 2 million elements. The list is made up of vectors of character strings. There are about 50 different character strings so can be considered factors. The vectors of character strings are different lengths varying between 1 and 50 (i.e the total number of character strings).
I want to convert the list to a logical or binary matrix/data.frame. Currently my method involves lapply and is incredibly slow, I would like to know if there is a vectorised approach.
require(dplyr); require(tidyr)
#create test data set
set.seed(123)
list1 <- list()
ListLength <-10
elementlength <- sample(1:5, ListLength, replace = TRUE )
for(i in 1:length(elementlength) ){
list1[[i]] <- sample(letters[1:15], elementlength[i])
}
#Create data frame from list using lapply
lapply(list1, function(n){
data.frame(type = n, value = TRUE) %>%
spread(., key = type, value )
}) %>% bind_rows()
I don't know if there is a way by preallocating the data frame then filling it in somehow.
Type <- unique(unlist(list1, use.names = FALSE))
#Create empty dataframe
TypeMat <- data.frame(matrix(NA,
ncol = length(Type),
nrow = ListLength)) %>%
setNames(Type)
We could use mtabulate from qdapTools
library(qdapTools)
mtabulate(list1)!=0
# a b c d e f g h i j k l m o
#[1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
#[3,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
#[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
#[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[9,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[10,]FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Comparing Vectors Values: 1 element with all other

I'm wondering how I can compare 1 element of a vector with all elements in the other vector. As an example: suppose
x <- c(1:10)
y <- c(10,11,12,13,14,1,7)
Now I can compare the elements parewise
x == y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
But I want to compare all elements of y with a specific element of x, something like
x[7] == y
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Is this possible?
Do you mean something like this?
x <- 1:10
y <- c(10,7,11,12,13,14,15,16,17,18)
res <- outer(x, y, `==`)
colnames(res) <- paste0("y=", y)
rownames(res) <- paste0("x=", x)
Which gives you the following matrix:
y=10 y=7 y=11 y=12 y=13 y=14 y=15 y=16 y=17 y=18
x=1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=7 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x=10 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If you want the dimnames to be as y[1] use
colnames(res) <- paste0("y[", seq_along(y), "]")
rownames(res) <- paste0("x[", seq_along(x), "]")
which gives you:
y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10]
x[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[2] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[3] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[4] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[5] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[6] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[7] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[8] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[10] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To get the index use which as follows:
which(res)
[1] 10 17
As R saves matrices rowwise this results in 10 and 17.
If you want the index in x and y component use:
which(res, arr.ind = TRUE)
row col
x=10 10 1
x=7 7 2
If you want to compare each element of x to y, usually one of the 'apply' functions will help.
As follows:
x <- c(1:10)
y <- c(10,11,12,13,14,1,7)
sapply(x,function(z){z==y})
Column i in the output is result from x[i]==y.
Is this what you're looking for?

R: Generate new data frame column based on a mapping of multiple (logical) columns

Clarification of 'map' or 'ordering' at bottom of post
Imagine we have a data frame with several logical columns, and a 'map' which, for specific combinations of those logical columns, gives a value.
What is the best/most efficient way to compute the value associated with each row of the data frame.
I have three possible solutions below: ifelse(), merge() and table(). I'd appreciate any comments or alternative solutions.
[Apologies, a rather long post]
Consider the following example data frame:
# Generate example
#N <- 15
#Data <- data.frame(A=sample(c(FALSE,TRUE),N,TRUE,c(8,2)),
# B=sample(c(FALSE,TRUE),N,TRUE,c(6,4)),
# C=sample(c(FALSE,TRUE),N,TRUE,c(7,3)),
# D=sample(c(FALSE,TRUE),N,TRUE,c(7,3)))
# Specific example used in this question
Data <- structure(list(A = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE),
B = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE), C = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE, FALSE, FALSE), D = c(TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, FALSE)), .Names = c("A", "B", "C", "D"),
class = "data.frame", row.names = c(NA,-15L))
A B C D
1 FALSE FALSE FALSE TRUE
2 FALSE FALSE FALSE FALSE
3 FALSE TRUE FALSE FALSE
4 TRUE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 FALSE TRUE FALSE FALSE
7 FALSE TRUE FALSE FALSE
8 FALSE FALSE FALSE FALSE
9 FALSE FALSE FALSE FALSE
10 TRUE FALSE TRUE TRUE
11 FALSE TRUE FALSE TRUE
12 FALSE FALSE TRUE FALSE
13 FALSE TRUE FALSE FALSE
14 FALSE FALSE FALSE TRUE
15 FALSE FALSE FALSE FALSE
Combined with the following map:
# A -> B -> C
# \_ D
### To clarify, if someone has both B & D TRUE (with C FALSE), D is higher than B
### i.e. there can be no ties
This defines an ordering of the logical columns. The final value I want is the 'highest' column within each row. Such that, if column C is true we return C always. We only return "D" if C is FALSE and D is true.
The naive way to do this would be nested ifelse statements:
Data$Highest <- with(Data, ifelse( C, "C",
ifelse( D, "D",
ifelse( B, "B",
ifelse( A, "A", "none")
)
)
)
)
But that code is difficult to read/maintain and gets very complicated for complex orderings with many columns.
I can quickly generate a mapping from the column combinations to the desired output:
Map <- expand.grid( lapply( lapply( Data[c("A","B","C","D")], unique ), sort ) )
Map$Value <- factor(NA, levels=c("A","B","C","D","none"))
Map$Value[which(Map$A)] <- "A"
Map$Value[which(Map$B)] <- "B"
Map$Value[which(Map$D)] <- "D"
Map$Value[which(Map$C)] <- "C"
Map$Value[which(is.na(Map$Value))] <- "none"
A B C D Value
1 FALSE FALSE FALSE FALSE none
2 TRUE FALSE FALSE FALSE A
3 FALSE TRUE FALSE FALSE B
4 TRUE TRUE FALSE FALSE B
5 FALSE FALSE TRUE FALSE C
6 TRUE FALSE TRUE FALSE C
7 FALSE TRUE TRUE FALSE C
8 TRUE TRUE TRUE FALSE C
9 FALSE FALSE FALSE TRUE D
10 TRUE FALSE FALSE TRUE D
11 FALSE TRUE FALSE TRUE D
12 TRUE TRUE FALSE TRUE D
13 FALSE FALSE TRUE TRUE C
14 TRUE FALSE TRUE TRUE C
15 FALSE TRUE TRUE TRUE C
16 TRUE TRUE TRUE TRUE C
Which can be used with merge():
merge( Data, Map, by=c("A","B","C","D"), all.y=FALSE )
A B C D Highest Value
1 FALSE FALSE FALSE FALSE none none
2 FALSE FALSE FALSE FALSE none none
3 FALSE FALSE FALSE FALSE none none
4 FALSE FALSE FALSE FALSE none none
5 FALSE FALSE FALSE FALSE none none
6 FALSE FALSE FALSE TRUE D D
7 FALSE FALSE FALSE TRUE D D
8 FALSE FALSE TRUE FALSE C C
9 FALSE TRUE FALSE FALSE B B
10 FALSE TRUE FALSE FALSE B B
11 FALSE TRUE FALSE FALSE B B
12 FALSE TRUE FALSE FALSE B B
13 FALSE TRUE FALSE TRUE D D
14 TRUE FALSE FALSE FALSE A A
15 TRUE FALSE TRUE TRUE C C
However, the merge() function does not currently preserve the row order. There are ways round this though.
My final idea was to use a 4-dimensional table with character entries corresponding to the map:
Map2 <- table( lapply( Data[c("A","B","C","D")], unique ) )
Map2[] <- "none"
Map2["TRUE",,,] <- "A"
Map2[,"TRUE",,] <- "B"
Map2[,,,"TRUE"] <- "D"
Map2[,,"TRUE",] <- "C"
But I find the above lines unclear (perhaps there is a better way to make the table? I thought it would be possible to turn Map into Map2, but I couldn't see how).
We then use matrix-indexing to pull out the corresponding value:
BOB <- as.matrix(Data[c("A","B","C","D")])
cBOB <- matrix(as.character(BOB),nrow=NROW(BOB),ncol=NCOL(BOB),dimnames=dimnames(BOB))
Data$Alt.Highest <- Map2[cBOB]
A B C D Highest Alt.Highest
1 FALSE FALSE FALSE TRUE D D
2 FALSE FALSE FALSE FALSE none none
3 FALSE TRUE FALSE FALSE B B
4 TRUE FALSE FALSE FALSE A A
5 FALSE FALSE FALSE FALSE none none
6 FALSE TRUE FALSE FALSE B B
7 FALSE TRUE FALSE FALSE B B
8 FALSE FALSE FALSE FALSE none none
9 FALSE FALSE FALSE FALSE none none
10 TRUE FALSE TRUE TRUE C C
11 FALSE TRUE FALSE TRUE D D
12 FALSE FALSE TRUE FALSE C C
13 FALSE TRUE FALSE FALSE B B
14 FALSE FALSE FALSE TRUE D D
15 FALSE FALSE FALSE FALSE none none
So in summary, is there a better way to achieve this 'mapping' type operation and any thoughts on the efficiency of these methods?
For the application I'm interested in, I have nine columns and an ordering chart with three branches to apply to 3000 rows. Essentially I am trying to construct a factor based on an awkward data storage format. So clarity of code is my first priority, with speed/memory efficiency my second.
Thanks in advance.
P.S. Suggestions for amending the question title also welcome.
Clarification
The real application involves a questionnaire with 9 questions asking whether the respondent has achieved a given education/qualification level. These are binary yes/no responses.
What we want is to generate a new variable 'highest qualification achieved'.
The problem is that the 9 levels don't form a simple stack. For example, professional qualifications can be achieved without going to university (especially in older respondents).
We have designed an 'map' or 'ordering' such that, for every combination of responses we have a 'highest qualification' (this order is subjective, hence the desire to make it simple to implement alternative orders).
# So given the nine responses: A, B, C, D, E, F, G, H, I
# we define an ordering as:
# D > C > B > A
# F > E
# E > A
# E == B
# I > H
# H == B
# G == B
# which has a set of order relationships. There is equality in this example
# A -> B -> C -> D
# \_ E -> F
# \_ H -> I
# \_ G
# 0 1 2 3 4
# We could then have five levels in out final 'highest' ordered factor: none, 1, 2, 3, 4
# Or we could decide to add more levels to break certain ties.
The R question is, given an ordering (and what to do with ties) that map combinations of the logical columns to a 'highest achieved' value. How best to implement this in R.
I think I might not understand your concept of 'ordering'. If it is the case that no ties are allowed, and you know exactly how each letter compares to all others, that means that there is a strict ordering, that can be broken down into a simple vector from highest to lowest. If this isn't true, then maybe you could give a more difficult example. If it is true, then you could code this quite easily like:
order<-c('C','D','B','A')
reordered.Data<-Data[order]
Data$max<-
c(order,'none')[apply(reordered.Data,1,function(x) min(which(c(x,TRUE))))]
# A B C D max
# 1 FALSE FALSE FALSE TRUE D
# 2 FALSE FALSE FALSE FALSE none
# 3 FALSE TRUE FALSE FALSE B
# 4 TRUE FALSE FALSE FALSE A
# 5 FALSE FALSE FALSE FALSE none
# 6 FALSE TRUE FALSE FALSE B
# 7 FALSE TRUE FALSE FALSE B
# 8 FALSE FALSE FALSE FALSE none
# 9 FALSE FALSE FALSE FALSE none
# 10 TRUE FALSE TRUE TRUE C
# 11 FALSE TRUE FALSE TRUE D
# 12 FALSE FALSE TRUE FALSE C
# 13 FALSE TRUE FALSE FALSE B
# 14 FALSE FALSE FALSE TRUE D
# 15 FALSE FALSE FALSE FALSE none
I think I now understand your concept of 'ordering'. However, I think that you can safely ignore it at first. For example, G is the same level as B. But G and B will never be compared; you can only have one of {B,E,H,G}. So, as long as each "branch" is in the correct order, it won't matter. If you provided some sample data for your new branching, I could test this, but try something like this:
order<-c(D,C,F,I,B,E,H,G,A)
levs<-c(4,3,3,3,2,2,2,2,1)
names(levs)<-order
reordered.Data<-Data[order]
Data$max<-
c(order,'none')[apply(reordered.Data,1,function(x) min(which(c(x,TRUE))))]
Data$lev<-levs[Data$max]
Here's a data.table approach:
require(data.table)
DT <- data.table(Data)
valord <- c('none','A','B','D','C')
DT[,val:={
vals <- c('none'=TRUE,unlist(.SD))[valord]
names(vals)[max(which(vals))]
},by=1:nrow(DT)]
The result is
A B C D val
1: FALSE FALSE FALSE TRUE D
2: FALSE FALSE FALSE FALSE none
3: FALSE TRUE FALSE FALSE B
4: TRUE FALSE FALSE FALSE A
5: FALSE FALSE FALSE FALSE none
6: FALSE TRUE FALSE FALSE B
7: FALSE TRUE FALSE FALSE B
8: FALSE FALSE FALSE FALSE none
9: FALSE FALSE FALSE FALSE none
10: TRUE FALSE TRUE TRUE C
11: FALSE TRUE FALSE TRUE D
12: FALSE FALSE TRUE FALSE C
13: FALSE TRUE FALSE FALSE B
14: FALSE FALSE FALSE TRUE D
15: FALSE FALSE FALSE FALSE none
If you run
class(DT) # [1] "data.table" "data.frame"
you'll see that this is a data.frame, like your "Data," and the same functions can be applied to it.

Categories

Resources