R within() order of operation and logic - r

I am trying to understand how the within() function in R "works." For example, in the code below I try to make a new variable named "FEELS" based on a condition. The first two uses of the within() function do not work. The third use of the within() function works, but I am not confident I understand the logic of "why" it works. Any help is appreciated.
DF <- data.frame(DATE = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), "month"), TEMP = c(30, 40, 50, 60, 70, 80, 90, 100, 90, 80, 70, 60))
DF <- within(DF, {
FEELS[30 <= TEMP & TEMP <= 50] <- "Cold"
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
})
DF <- within(DF, {
FEELS <- "Cold"
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
})
DF
DF <- within(DF, {
FEELS <- NA
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
FEELS[is.na(FEELS)] <- "Cold"
})
DF

Let's break these down one by one.
1. This one simply results in an error message:
Error in FEELS[30 <= TEMP & TEMP <= 50] <- "Cold" : object 'FEELS' not found
That makes perfect sense. You haven't yet defined FEELS, so subsetting it results in an error.
2. This one's interesting and can be seen more clearly if you do it outside of 'within'
FEELS <- "cold"
tf <- 60 <= DF$TEMP & DF$TEMP <= 70
tf
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FEELS[tf] <- "Good"
FEELS
[1] "cold" NA NA "Good" "Good" NA NA NA NA NA "Good"
[12] "Good"
R starts with a vector of length one containing "cold", but your subsetting forces it to extend and place "Good" in all elements where it's TRUE. R doesn't have any values for everything that's FALSE, so puts NA there.
3. The last one is pretty straightforward. You start with an NA vector which is extended in the same way as the one in 2. You then replace all the NAs which are left with "cold".

When you create an object inside within(DF, {...}), it does not automatically have the same length as columns of DF. Instead, it will be "recycled" at the end of {...} to fill out the column
within(data.frame(A=1:6), { B = 1; C = 1:2 })
# A C B
# 1 1 1 1
# 2 2 2 1
# 3 3 1 1
# 4 4 2 1
# 5 5 1 1
# 6 6 2 1
If, before the end of {...}, you want to modify an object as if it were a full column, it must have the correct length:
within(data.frame(A=1:6), {
D = 1
D[ A < 3 ] = 0
D2 = rep(1, length(A))
D2[A < 3 ] = 0
})
# A D2 D
# 1 1 0 0
# 2 2 0 0
# 3 3 1 NA
# 4 4 1 NA
# 5 5 1 NA
# 6 6 1 NA
To understand why D2 gave the expected output while D did not, try examining the objects in steps, using browser() as suggested by #sebastian-c or following the steps as illustrated in his answer.
In the OP's case, initializing with rep and then making several substitutions is one option. Another would be to use cut, which is designed for assigning labels to intervals of ordered data.

Related

How to filter alphanumeric characters range?

I need to create dummy variables using ICD-10 codes. For example, chapter 2 starts with C00 and ends with D48X. Data looks like this:
data <- data.frame(LINHAA1 = c("B342", "C000", "D450", "0985"),
LINHAA2 = c("U071", "C99", "D68X", "J061"),
LINHAA3 = c("D48X", "Y098", "X223", "D640"))
Then I need to create a column that receives 1 if it's between the C00-D48X range and 0 if it's not. The result I desire:
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
B342 U071 D48X 1
C000 C99 Y098 1
D450 D68X X223 1
O985 J061 D640 0
It needs to go through LINHAA1 to LINHAA3. Thanks in advance!
This should do it:
as.numeric(apply(apply(data, 1,
function(x) { x >="C00" & x <= "D48X" }), 2, any))
[1] 1 1 1 0
A little explanation: Checking if the codes are in the range can just be checked using alphabetic order (which you can get from <= etc). The inner apply checks each element and produces a matrix of logical values. The outer apply uses any to check if any one of the three logical values is true. as.numeric changes the result from TRUE/False to 1/0.
This is the typical case for dplyr::if_any. if_any returns TRUE if a given condition is met in any of the tested columns, rowwise:
library(dplyr)
data %>%
mutate(CHAPTER2 = +if_any(starts_with("LINHAA"),
~.x >= 'C00' & .x <='D48X'))
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
1 B342 U071 D48X 1
2 C000 C99 Y098 1
3 D450 D68X X223 1
4 0985 J061 D640 0
Using dedicated icd package
# remotes::install_github("jackwasey/icd")
library(icd)
#get the 2nd chapter start and end codes
ch2 <- icd::icd10_chapters[[ 2 ]]
# start end
# "C00" "D49"
#expland the codes to include all chapter2 codes
ch2codes <- expand_range(ch2[ "start" ], ch2[ "end" ])
# length(ch2codes)
# 2094
#check if codes in a row match
ix <- apply(data, 1, function(i) any(i %in% ch2codes))
# [1] FALSE TRUE FALSE FALSE
data$chapter2 <- as.integer(ix)
#data
# LINHAA1 LINHAA2 LINHAA3 chapter2
# 1 B342 U071 D48X 0
# 2 C000 C99 Y098 1
# 3 D450 D68X X223 0
# 4 0985 J061 D640 0
Note that you have some invalid codes:
#invalid
is_defined("D48X")
# [1] FALSE
explain_code("D48X")
# character(0)
#Valid
is_defined("D48")
# [1] TRUE
explain_code("D48")
# [1] "Neoplasm of uncertain behavior of other and unspecified sites"

Nested ifelse: improved syntax

Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.
I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)
Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4
With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4

Concatenating positions into genomic segments

I would like to concatenate all rows which have more than 0.955 of similarity score. The Aboand Bel columns represents the similarity score with above and below rows, respectively. In the following input df I have 10 genomic probes (NAME column) which is concatenated in just 4 genomic segments (dfout).
df <- " NAME Abo Bel Chr GD Position
BovineHD0100009217 NA 1.0000000 1 0 31691781
BovineHD0100009218 1.0000000 0.6185430 1 0 31695808
BovineHD0100019600 0.6185430 0.9973510 1 0 69211537
BovineHD0100019601 0.9973510 1.0000000 1 0 69213650
BovineHD0100019602 1.0000000 1.0000000 1 0 69214650
BovineHD0100019603 1.0000000 0.6600000 1 0 69217942
BovineHD0100047112 0.6600000 1.0000000 1 0 93797691
BovineHD0100026604 1.0000000 1.0000000 1 0 93815774
BovineHD0100026605 1.0000000 0.4649007 1 0 93819471
BovineHD0100029861 0.4649007 NA 1 0 105042452"
df <- read.table(text=df, header=T)
My expected output dfout:
dfout <- "Chr start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1"
dfout <- read.table(text=dfout, header=T)
Any ideas?
I couldn't think of any pretty solution using basic dataframe manipulation, so here's a bad-looking one that works:
First, add stringsAsFactors to df creation:
df <- read.table(text=df, header=T, stringsAsFactors = FALSE)
start <- df$Position[1]
end <- integer()
output <- NULL
count <- 1
for (i in 1:(nrow(df)-1)) {
if(df$Bel[i] < 0.955) {
end <- df$Position[i]
output <- rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count))
start <- df$Position[i+1]
count <- 0
}
count <- count + 1
}
end <- df$Position[nrow(df)]
output <- as.data.frame(rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count)))
colnames(output) <- c("start", "end", "startp", "endp", "nprob")
The basic idea here is looping through the rows and checking if the next should be added to the current segment (Bel > 0.955) or if a new segment should start (Bel <= 0.955). When a new sequence has to be started, the endrow is defined, the respective row added to the output and the new starting segment also defined. A count is used to add the number of rows used to create the segment (nprob).
Finally the last segment is added, outside the for loop, and the output receives its column names and is converted to a dataframe. I did not use Chr because 1. They are all equal, 2. if they weren't you didn't give any way to choose/summarize them.
Result:
> output
start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
I'm pretty sure that you or someone else can work on this to make it shorter and more concise.
Here is dplyr version. First we need to define groups, that is what mutate bit is doing, then simple summarise function within the groups.
library(dplyr)
df %>%
mutate(
Abo955=ifelse(Abo<0.955,NA,Abo),
myGroup=cumsum(is.na(Abo955)*1)) %>%
group_by(myGroup) %>%
summarise(
Chr=min(Chr),
start=min(Position),
end=max(Position),
startp=first(NAME),
lastp=last(NAME),
nprob=n()) %>%
select(-myGroup)
This solution is purely based on logical vectors and works with the provided example.
As Molx said, let's add stringsAsFactors=F
df <- read.table(text=df, header=T, stringAsFactors = F)
An just so that the logical evaluations work let's change NA to 0s
df(is.na(df)) <- 0
Now, for the consecutive rows that will be concatenated lets find the "start" and "end" rows using logical evaluations
starts <- df$Bel >= 0.955 & df$Abo < 0.955
ends <- df$Bel < 0.955 & df$Abo >= 0.955
With this we can already construct a data.frame concatenating rows that need to be concatenated
concatenated <- data.frame(Chr = df[starts, "Chr"],
start = df[starts, "Position"],
end = df[ends, "Position"],
startp = df[starts, "NAME"],
endp = df[ends, "NAME"],
nprob = c( diff (which(starts))[1] ,diff (which(ends)))
)
And let's also construct a data.frame with the rows that are not concatenated, i.e. the ones that do not have the desired similarity score with neither the above nor below row
notConcatenate <- df$Abo < 0.955 & df$Bel < 0.955
non_concatenated <- data.frame(Chr = df[notConcatenate, "Chr"],
start = df[notConcatenate, "Position"],
end = df[notConcatenate, "Position"],
startp = df[notConcatenate, "NAME"],
endp = df[notConcatenate, "NAME"],
nprob = 1
)
And finally bind the two data.frames
dfout <- rbind(concataneted,non_concatenated)
Resulting in
> dfout
Chr start end startp endp nprob
1 1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
NOTE: This code assumes that correlated probes are within the same chromosome
Cheers!

Comparison under more conditions

according to my last question i have an new belonging question. After Editing my post and ask there and wait abot a week i want to try it here again.
This time with a better example:
Equip<- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,6,6,6)
Notif <-c(1,1,3,4,2,2,2,5,6,7,9,9,15,10,11,12,13,14,16,17,18,19)
rank <- c(1,1,2,3,1,1,1,1,2,3,1,1,2,1,2,3,1,2,3,4,5,6)
Component <- c("Ventil","Motor","Ventil","Ventil","Vergaser","Vergaser","Bremse",
"Lichtmaschine","Bremse","Lichtmaschine","Bremse","Motor","Lichtmaschine",
"Bremse","Bremse","Motor","Vergaser","Motor","Vergaser","Motor",
"Vergaser","Motor")
df <- data.frame(Equip,Notif,rank,Component)
Equip is my subject and rank the actual visit number. Component is the subject what have to be looked for.
I want to have an output like this:
If an Equip(subject) was visited 2 times( rank 1 and 2) look by all Equips with rank 1&2 , if there is any Component which was regarded the first and the second time.
If an Equip(subject) was visited 3 times (rank 1 ,2 and 3) for this look by all Equips, if there is any Component list up 3 times like Equip 1, rank 1, Component Motor, Equip 1, rank 2, Component Motor, Equip 1, rank 3, Component Motor
The output should have the name of the Component, like True "Motor"
I have a code but with this, i can just compare the 1 and the 2 visit, the 2 and the 3 together and so on( i cannot split up again with the ranks, like Equips with 2 ranks, Equips with 3 ranks and so on)
the code is this:
a <- lapply(split(df,df$Equip),function(x){
ll <- split(x,x$rank)
if(length(ll)>1 )
ii <- intersect(ll[[1]]$Component,ll[[2]]$Component ) ## test intersection
else
ii <- NA
c(length(ii)> 0 && !is.na(ii),ii)
})
b <- unlist(a)
c <- table(b,b)
rowSums(c)
Hopefully you can help me. Please ask if there are any questions.
according to your question about the output, and to your way of solution,
Equip Component V1 idx
1: 1 Ventil TRUE 3
2: 2 NA False 1
3: 3 NA False 3
4: 4 NA FALSE 2
5: 5 NA FALSE 3
6: 6 NA FALSE 6
Something like that, but if its easier, Equip and idx is not neccessarilly needed
for Equip with 2 ranks:
TRUE FALSE
0 1
for Equip with 3 ranks:
TRUE FALSE
1 2
for Equip with 6 ranks:
TRUE FALSE
0 1
Here's the output I think would be of interest to you. Its using data.table.
First, we create a data.table from your data.frame df with keys = Equip, Component as follows.
require(data.table) # load package
# then create the data.table with keys as specified above
# Check that both these columns are already sorted out for you!
dt <- data.table(df, key=c("Equip", "Component"))
Second, we create a function that'll give the desired output for a given rank query (2, 3 etc..)
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
What does this do? Let's run this for rank=1,2. We run this by:
> this.check(2)
# output
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
This tells you that for Equip = 1 and 5, there are Components = Ventil and Bremse with rank = 1 and 2, respectively (indicated with idx=2). You also get the column V1 = TRUE, even though I, as #Carl pointed out already, don't understand the need for this. If you require, you can change the column names of this output by using setnames
Third, we use this function to query ranks=1,2, then ranks=1,2,3 .. and so on. This can be accomplished with a simple lapply as follows:
# Let's run the function for idx = 2 to 6.
# This will check from rank = 1,2 until rank=1,2,3,4,5,6
o <- lapply(2:6, function(idx) {
this.check(idx)
})
> o
[[1]]
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
[[2]]
Equip Component V1 idx
1: 1 Ventil TRUE 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
It shows that for rank=1,2 and rank=1,2,3 you have some Component. For others there's nothing = NULL.
Finally, we can bind all of these together using rbind to get one single data.table as follows:
o <- do.call(rbind, o)
> o
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
3: 1 Ventil TRUE 3
Here, idx=2 are the Component that satisfies rank=1,2 and idx=3 are the ones that satisfy rank=1,2,3.
Putting it all together:
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
o <- do.call(rbind, lapply(2:6, function(idx) {
this.check(idx)
}))
I hope this helps.
Edit: (After series of exchanges in comments, this is the new solution I propose. I hope this is what you are after.)
require(data.table)
dt <- data.table(df, key=c("Equip", "Component"))
dt[, `:=`(e.max=max(rank)), by=Equip]
dt[, `:=`(ec.max=max(rank)), by=c("Equip", "Component")]
setkey(dt, "e.max", "ec.max")
this.check <- function(idx) {
t1 <- dt[J(idx,idx)]
t2 <- t1[, identical(as.numeric(seq_len(idx)), as.numeric(rank)),
by=c("Equip", "Component")]
o <- table(t2$V1)
if (length(o) == 1)
o <- c(o, "TRUE"=0)
o <- c("idx"=idx, o)
}
o <- do.call(rbind, lapply(2:6, function(idx) this.check(idx)))
> o
# idx FALSE TRUE
# [1,] 2 1 0
# [2,] 3 2 1
# [3,] 4 1 0
# [4,] 5 1 0
# [5,] 6 1 0
If I make an array of your data, columnwise, as
foo<-cbind(Equip,Notif, rank, Component)
eqp<-1 # later, loop over all values
foo[c( which( foo[,1]==eqp & (foo[,3]==1 | foo[,3]==2) ) ),4]
[1] "Ventil" "Motor" "Ventil"
Feed those results to table and extract items with count ==2
Clearly any item which shows up twice is what you want.
This is not an answer I'd recommend using, since tools like ddply and aggregate will do this much more cleanly, but I want to be sure that this is the answer you're after, assuming a loop over eqp values in the original Equip .

Troubleshooting ddply() script

I am developing a censored dependent variable for use in survival analysis. My goal is to find the last time ("time") that someone answers a question in a survey (e.g. the point where "q.time" is coded as "1", and "q.time+1" and q at all subsequent times are coded as "0").
By this logic, the last question answered should be coded as "1" (q.time). The first question that is NOT answered (q.time+1) should be coded as "0". And all questions subsequent to the first question NOT answered should be coded as "NA". I then want to remove ALL rows where the DV=NA from my dataset.
A very generous coworker has helped me to develop the following code, but he's on vacation now and it needs a little more lovin'. Code is as follows:
library(plyr) # for ddply
library(stats) # for reshape(...)
# From above
dat <- data.frame(
id=c(1, 2, 3, 4),
q.1=c(1, 1, 0, 0),
q.2=c(1, 0, 1, 0),
dv.1=c(1, 1, 1, 1),
dv.2=c(1, 1, 0, 1))
# From above
long <- reshape(dat,
direction='long',
varying=c('q.1', 'q.2', 'dv.1', 'dv.2'))
ddply(long, .(id), function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
})
Unfortunately, this yields the error message:
"Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0"
Any ideas? The problem seems to be located in the "rep" command, but I'm a newbie to R. Thank you so much!
UPDATE: SEE EXPLANATIONS BELOW, and then REFER TO FOLLOW-UP QUESTION
Hi there-I completely followed you, and really appreciate the time you took to help me out. I went back into my data and coded in a dummy Q where all respondents have a value of "1" - but, discovered where the error really may be. In my real data set, I have 30 questions (i.e., 30 times in long form). After I altered the dataset so FOR SURE q==1 for all id variables, the error message changed to saying
"Error in `$<-.data.frame`(`*tmp*`, "newvar", value = c(0, 29)) : replacement has 2 rows, data has 31"
If the problem is with the number of rows assigned to subs, then is the source of the error coming from...
subs <- subset(df, time <= last.q + 1)
i.e., $time <= last.q + 1$ is setting the number of rows to the value EQUAL to last.q+1?
UPDATE 2: What, ideally, I'd like my new variable to look like!
id time q dv
1 1 1 1
1 2 1 1
1 3 1 1
1 4 1 1
1 5 0 0
1 6 0 NA
2 1 1 1
2 2 1 1
2 3 0 0
2 4 0 NA
2 5 0 NA
2 6 0 NA
Please note that "q" can vary between "0" or "1" over time (See the observation for id=1 at time=2), but due to the nature of survival analysis, "dv" cannot. What I need to do is create a variable that finds the LAST time that "q" changes between "1" and "0", and then is censored accordingly. After Step 4, my data should look like this:
id time q dv
1 1 1 1
1 2 1 1
1 3 1 1
1 4 1 1
2 1 1 1
2 2 1 1
2 3 0 0
.(id) in plyr is equivalent to
> dum<-split(long,long$id)
> dum[[4]]
id time q dv
4.1 4 1 0 1
4.2 4 2 0 1
your problem is in your 4th split. You reference
answered <- subset(df, q == 1)
in your function. This is an empty set as there are no dum[[4]]$q taking value 1
If you just want to ignore this split then something like
ans<-ddply(long, .(id), function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
if(length(answered$q)==0){return()}
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
})
> ans
id time q dv
1 1 1 1 2
2 1 2 1 2
3 2 1 1 0
4 2 2 0 1
5 3 1 0 2
6 3 2 1 2
would be the result
In short: The error is because there is no q == 1 when id == 4.
A good way to check what's going on here is to rewrite the function separately, and manually test each chunk that ddply is processing.
So first rewrite your code like this:
myfun <- function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
}
ddply(long, .(id), myfun )
That still gives an error of course, but at least now we can manually check what ddply is doing.
ddply(long, .(id), myfun ) really means:
Take the dataframe called long
Create a number of subset dataframes (one for each distinct id)
Apply the function myfun to each subsetted dataframe
Reassemble the results into a single dataframe
So let's attempt to do manually what ddply is doing automatically.
> myfun(subset(long, id == 1))
id time q dv
1.1 1 1 1 2
1.2 1 2 1 2
> myfun(subset(long, id == 2))
id time q dv
2.1 2 1 1 0
2.2 2 2 0 1
> myfun(subset(long, id == 3))
id time q dv
3.1 3 1 0 2
3.2 3 2 1 2
> myfun(subset(long, id == 4))
Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0
In addition: Warning message:
In max(answered$time) : no non-missing arguments to max; returning -Inf
>
So it seems like the error is coming from the step where ddply applies the function for id == 4.
Now let's take the code outside of the function so we can examine each chunk.
> #################
> # set the problem chunk to "df" so we
> # can examine what the function does
> # step by step
> ################
> df <- subset(long, id == 4)
>
> ###################
> # run the bits of function separately
> ###################
> answered <- subset(df, q == 1)
> answered
[1] id time q dv
<0 rows> (or 0-length row.names)
> last.q = max(answered$time)
Warning message:
In max(answered$time) : no non-missing arguments to max; returning -Inf
> last.q
[1] -Inf
> subs <- subset(df, time <= last.q + 1)
> subs
[1] id time q dv
<0 rows> (or 0-length row.names)
> # set all the dv as desired
> new.dv <- rep(last.q,1)
> new.dv
[1] -Inf
> if (last.q < max(df$time)) new.dv <- c(0,last.q)
> subs$dv <- new.dv
Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0
> subs
[1] id time q dv
<0 rows> (or 0-length row.names)
>
So the error that you're getting comes from subs$dv <- new.dv because new.dv has length two (i.e. two values - (0, -Inf)) but sub$dv is length 0. That wouldn't be a problem if dv were a simple vector, but because it's in the sub dataframe whose columns all have two rows, then sub$dv must also have two rows.
The reason sub has zero rows is because there is no q == 1 when id == 4.
Should the final data frame not have anything for id == 4? The answer to your problem really depends on what you want to happen in the case when there are no q==1 for an id. Just let us know, and we can help you with the code.
UPDATE:
The error that you're getting is because subs$dv has 31 values in it and new.dv has two values in it.
In R when you try to assign a longer vector to a shorter vector, it will always complain.
> test <- data.frame(a=rnorm(100),b=rnorm(100))
> test$a <- rnorm(1000)
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(-0.0507065994549323, :
replacement has 1000 rows, data has 100
>
But when you assign a shorter vector to a longer vector, it will only complain if the shorter is not an even multiple of the longer vector. (eg 3 does not go evenly into 100)
> test$a <- rnorm(3)
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(-0.897908251650798, :
replacement has 3 rows, data has 100
But if you tried this, it wouldn't complain since 2 goes into 100 evenly.
> test$a <- rnorm(2)
>
Try this:
> length(test$a)
[1] 100
> length(rnorm(2))
[1] 2
> test$a <- rnorm(2)
> length(test$a)
[1] 100
>
What's it's doing is silently repeating the shorter vector to fill up the longer vector.
And again, what you do to get around the error (i.e. make both vectors the same length) will depend on what you're trying to achieve. Do you make new.dv shorter, or subs$dv longer?
First, to give credit where credit is due, the code below is not mine. It was generated in collaboration with another very generous coworker (and engineer) who helped me work through my problem (for hours!).
I thought that other analysts tasked with constructing a censored variable from survey data might find this code useful, so I am passing the solution along.
library(plyr)
#A function that only selects cases before the last time "q" was coded as "1"
slicedf <- function(df.orig, df=NULL) {
if (is.null(df)) {
return(slicedf(df.orig, df.orig))
}
if (nrow(df) == 0) {
return(df)
}
target <- tail(df, n=1)
#print(df)
#print('--------')
if (target$q == 0) {
return(slicedf(df.orig, df[1:nrow(df) - 1, ]))
}
if (nrow(df.orig) == nrow(df)) {
return(df.orig)
}
return(df.orig[1:(nrow(df) + 1), ])
}
#Applies function to the dataset, and codes over any "0's" before the last "1" as "1"
long <- ddply(long, .(id), function(df) {
df <- slicedf(df)
if(nrow(df) == 0) {
return(df)
}
q <- df$q
if (tail(q, n=1) == 1) {
df$q <- rep(1, length(q))
} else {
df$q <- c(rep(1, length(q) - 1), 0)
}
return(df)
})
Thanks to everyone online who commented for your patience and help.

Resources