I am working on a big dataset and have got a problem with data cleaning. My data set looks like this:
data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
I just want to keep the group in which the sum of score is equal to 1 and remove the whole group in which the sum of score is equal to 0. For the group in which the sum of the score is greater than 1, e.g., sum of score = 3, I want to randomly select two group members with score equal to 1 and remove them from the group. Then the data may look like this:
newdata <- cbind (group = c(1,1,1,3,3,4,4,4),
member = c(1,2,3,2,3,1,3,5),
score = c(0,1,0,0,1,0,1,0))
Does anybody can help me get this done?
I would write a function that combines the various manipulations for you. Here is one such function, heavily commented:
process <- function(x) {
## this adds a vector with the group sum score
x <- within(x, sumScore <- ave(score, group, FUN = sum))
## drop the group with sumScore == 0
x <- x[-which(x$sumScore == 0L), , drop = FALSE]
## choose groups with sumScore > 1
## sample sumScore - 1 of the rows where score == 1L
foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample all bar one of these
want <- sample(want, scr-1)
## remove the selected rows & retun
x[-want, , drop = FALSE]
}
## which rows are samples with group sumScore > 1
want <- which(x$sumScore > 1L)
## select only those samples, split up those samples by group, lapplying foo
## to each group, then rbind the resulting data frames together
newX <- do.call(rbind,
lapply(split(x[want, , drop = FALSE], x[want, "group"]),
FUN = foo))
## bind the sampled sumScore > 1L on to x (without sumScore > 1L)
newX <- rbind(x[-want, , drop = FALSE], newX)
## remove row labels
rownames(newX) <- NULL
## return the data without the sumScore column
newX[, 1:3]
}
that with your data:
dat <- data.frame(group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
gives:
> set.seed(42)
> process(dat)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 3 2 0
6 4 1 0
7 4 3 1
8 4 5 0
Which is I think what was wanted.
Update: In process() above, the internal function foo() could be rewritten to sample only 1 row and remove the others. I.e replace foo() with the one below:
foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample just one of these
want <- sample(want, 1)
## return the selected row & retun
x[want, , drop = FALSE]
}
They are the same operations essentially but foo() that selects just 1 row makes the intended behaviour explicit; we want to select 1 row at random from those with score == 1L, rather than sample scr-1 values.
I would define a function that does what you want it to. Then use ddply and split by group.
myfun <- function(x) {
if(sum(x$score)==1) {
return(x)
} else if(sum(x$score)==0) {
return(data.frame())
} else {
row.names(x) <- NULL
score.1 <- sample(as.integer(row.names(x[x$score==1,])), nrow(x[x$score==1,])-1)
return(x[-score.1,])
}
}
library(plyr)
ddply(as.data.frame(dat), .(group), myfun)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 4 1 0
6 4 2 1
7 4 3 1
ugroups<-unique(data[,1])
scores<-sapply(ugroups,function(x){sum(data[,1]==x & data[,3]==1)})
data[data[,1]%in%ugroups[scores>0],]
....... etc
will give you the cumulative scores for each group etc
Related
I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))
I am trying to mutate() a 0 or 1 at a specific position in a column. Normally mutate() just mutates the whole column but I want to check conditions and then place a value at a specific position. I tried to use something like an index. Hear is an example: I have values and want to compare them one by one. compare 10 to 16, 16 to 9 and so on. The criteria is: Are value 1 and 2 either both in a or not in a, or is one in a and the other value is not. I wrote down an approach but it seems like mutate does not allow to use TaskS[i+1].
Thanks for your help!
Index
Values
TaskS
1
10
2
16
1
3
9
1
4
8
0
a <- c(1:10)
data_time_filter <- mutate(data_time_filter, TaskS = '')
for (i in 1:40){
current <- data_time_filter$Trial_Id[i] %in% a
adjacent <- data_time_filter$Trial_Id[i+1] %in% a
if (current == adjacent){
data_time_filter <- mutate(data_time_filter, TaskS[i+1] = 0)
}
else if (current != adjacent){
data_time_filter <- mutate(data_time_filter, TaskS[i+1] = 1)
}
}
I am not really sure if I understand your question correctly but I will try to help anyway.
In my approach I have used a user made function in combination with sapply. I believe to work mutate correctly you need an vector output which you won't get with a loop.
So, here is what I did:
# Recreate df
data_time_filter <- data.frame(
index = 1:4,
Values = c(10, 16, 9, 8)
)
# Create filter
ff <- c(1:10)
# Add empty TakS column
data_time_filter <- data_time_filter %>%
mutate(TaskS = '')
# Define a function
abc <- function(data, filter){
l <- length(data)
sapply(1:l, function(x){
if(x == 1){
""
} else {
current <- data[x-1] %in% filter
adjacent <- data[x] %in% filter
if(current == adjacent){
0
} else {
1
}
}
})
}
This approach will let you use mutate:
> data_time_filter
index Values TaskS
1 1 10
2 2 16
3 3 9
4 4 8
> data_time_filter %>%
mutate(TaskS = abc(Values, ff))
index Values TaskS
1 1 10
2 2 16 1
3 3 9 1
4 4 8 0
You could even skip making placeholder TaskS column and create a new one:
> data_time_filter %>%
mutate(TskS_new = abc(Values, ff))
index Values TaskS TskS_new
1 1 10
2 2 16 1
3 3 9 1
4 4 8 0
H_D<-function(level, zero, ...){
special<-c(0,0,0)
D<-list(special,...)
cell <- do.call(expand.grid, lapply(level, seq)) # create all cell
support <- apply(cell, 1, function(x) +(x != zero)) # create all support set
# provide subset H_D (support sets and given vectors matches
hd<-lapply(D, function (x) cell[colSums(support==x)==length(x),])
h_D<-do.call(rbind, hd)
rownames(h_D)<-1:nrow(h_D)
return(h_D)
}
level<-c(3,2,4)
zero<-c(1,2,1)
y<-c(0,1,1)
H_D(level,zero,y)
> H_D(level,zero,y)
Var1 Var2 Var3
1 1 2 1
2 1 1 2
3 1 1 3
4 1 1 4
My function works fine for the above situation as colSums works for data frame. But if my argument is a vector instead of data frame this is not working. I am getting the following errors. My input argument could a vector or a data frame. How can I incorporate both in my above mention function?
level = 3
zero = 2
y<-1
H_D(level,zero,y)
> H_D(level,zero,y)
Error in colSums(support == x) :
'x' must be an array of at least two dimensions
I tried drop=FALSE, but not working!
We could change the function with an if/else based on the number of columns of 'cell'. If it is one column, then just do the subset or else do the other part of computation
H_D <- function(level, zero, ...){
special <- c(0,0,0)
D <- list(special,...)
cell <- do.call(expand.grid, lapply(level, seq)) # create all cell
if(ncol(cell) == 1) {
h_D <- subset(cell, Var1 != zero)
} else {
support <- apply(cell, 1, function(x) +(x != zero)) # create all support set
# provide subset H_D (support sets and given vectors matches
hd <- lapply(D, function (x) cell[colSums(support==x)==length(x),])
h_D <- do.call(rbind, hd)
rownames(h_D) <- 1:nrow(h_D)
}
return(h_D)
}
-testing
level <- 3
zero <- 2
y <- 1
H_D(level, zero, y)
# Var1
#1 1
#3 3
and the first case
level <- c(3,2,4)
zero <- c(1,2,1)
y <- c(0,1,1)
H_D(level,zero,y)
# Var1 Var2 Var3
#1 1 2 1
#2 1 1 2
#3 1 1 3
#4 1 1 4
Problem solved, solution added at bottom of posting!
I'd like to know how to "fill" a data frame by inserting rows in between existing rows (not appending to the end).
My situation is following:
I have a data set with about 1700 cases and 650 variables
Certain variables have possible answer categories from 0 to 100 (question was: "How many percent..." -> people could fill in from 0 to 100)
Now I want to show the distribution of one of those variables (let's call it var) in a geom_area().
Problem:
1) I need an X-axis ranging from 0 to 100
2) Not all possible percentage values in var were chosen, for instance I have 30 times the answer "20%", but no answer "19%". For the x-Axis this means, the y-Value at x-position 19 is "0", the y-value at x-position 20 is "30".
To prepare my data (this one variable) for plotting it with ggplot, I transformend it via the table function:
dummy <- as.data.frame(table(var))
Now I have a column "Var1" with the answer categories and a column "Freq" with the counts of each answer categorie.
In total, I have 57 rows, which means that 44 possible answers (values from 0 to 100 percent) were not stated.
Example (of my dataframe), "Var1" contains the given answers, "Freq" the counts:
Var1 Freq
1 0 1
2 1 16
3 2 32
4 3 44
5 4 14
...
15 14 1
16 15 169 # <-- See next row and look at "Var1"
17 17 2 # <-- "16%" was never given as answer
Now my question is: How can I create a new data frame which inserts a row after row 16 (with "Var1"=15) where I can set "Var1" to 16 and "Freq" to 0?
Var1 Freq
...
15 14 1
16 15 169
17 16 0 # <-- This line I like to insert
18 17 2
I've already tried something like this:
dummy_x <- NULL
dummy_y <- NULL
for (k in 0:100) {
pos <- which(dummy$Var1==k)
if (!is.null(pos)) {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, dummy$Freq[pos])
}
else {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, 0)
}
}
newdataframe <- data.frame(cbind(dummy_x), cbind(dummy_y))
which results in the error that dummy_x has 101 values (from 0 to 101, correct), but dummy_y only contains 56 rows?
The result should be plotted like this:
plot(ggplot(newdataframe, aes(x=Var1, y=Freq)) +
geom_area(fill=barcolors, alpha=0.3) +
geom_line() +
labs(title=fragetitel, x=NULL, y=NULL))
Thanks in advance,
Daniel
Solution for this problem
plotFreq <- function(var, ftitle=NULL, fcolor="blue") {
# create data frame from frequency table of var
# to get answer categorie and counts in separate columns
dummyf <- as.data.frame(table(var))
# rename to "x-axis" and "y-axis"
names(dummyf) <- c("xa", "ya")
# transform $xa from factor to numeric
dummyf$xa <- as.numeric(as.character(dummyf$xa))
# get maximum x-value for graph
maxval <- max(dummyf$xa)
# Create a vector of zeros
frq <- rep(0,maxval)
# Replace the values in freq for those indices which equal dummyf$xa
# by dummyf$ya so that remaining indices are ones which you
# intended to insert
frq[dummyf$xa] <- dummyf$ya
# create new data frame
newdf <- as.data.frame(cbind(var = 1:maxval, frq))
# print plot
ggplot(newdf, aes(x=var, y=frq)) +
# fill area
geom_area(fill=fcolor, alpha=0.3) +
# outline
geom_line() +
# no additional labels on x- and y-axis
labs(title=ftitle, x=NULL, y=NULL)
}
I think this is much simpler solution. Looping is not necessary. Idea is to create a vector of size of desired result, with all values set to zero and then replace appropriate value with non zero values from frequency table.
> #Let's create sample data
> set.seed(12345)
> var <- sample(100, replace=TRUE)
>
>
> #Lets create frequency table
> x <- as.data.frame(table(var))
> x$var <- as.numeric(as.character(x$var))
> head(x)
var Freq
1 1 3
2 2 1
3 4 1
4 5 2
5 6 1
6 7 2
> #Create a vector of 0s
> freq <- rep(0, 100)
> #Replace the values in freq for those indices which equal x$var by x$Freq so that remaining
> #indices are ones which you intended to insert
> freq[x$var] <- x$Freq
> head(freq)
[1] 3 1 0 1 2 1
> #cbind data together
> freqdf <- as.data.frame(cbind(var = 1:100, freq))
> head(freqdf)
var freq
1 1 3
2 2 1
3 3 0
4 4 1
5 5 2
6 6 1
try something like this
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
X<-rbind(X[1:index_after,],vector_to_insert,X[(index_after+1):nrow(X),]);
row.names(X)<-1:nrow(X);
return (X);
}
you can call it with
df<-insertRowToDF(df,16,c(16,0)); # inserting the values (16,0) after the 16th row
This is the Aditya's code plus some conditions to handle special cases:
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
if (index_after != 0) {
if (dim(X)[1] != index_after) {
X <- rbind(X[1:index_after,], vector_to_insert, X[(index_after+1):nrow(X),]);
} else {
X <- rbind(X[1:index_after,], vector_to_insert);
}
} else {
if (dim(X)[1] != index_after) {
X <- rbind(vector_to_insert, X[(1):nrow(X),]);
} else {
X <- rbind(vector_to_insert);
}
}
row.names(X)<-1:nrow(X);
return (X);
}
I am developing a censored dependent variable for use in survival analysis. My goal is to find the last time ("time") that someone answers a question in a survey (e.g. the point where "q.time" is coded as "1", and "q.time+1" and q at all subsequent times are coded as "0").
By this logic, the last question answered should be coded as "1" (q.time). The first question that is NOT answered (q.time+1) should be coded as "0". And all questions subsequent to the first question NOT answered should be coded as "NA". I then want to remove ALL rows where the DV=NA from my dataset.
A very generous coworker has helped me to develop the following code, but he's on vacation now and it needs a little more lovin'. Code is as follows:
library(plyr) # for ddply
library(stats) # for reshape(...)
# From above
dat <- data.frame(
id=c(1, 2, 3, 4),
q.1=c(1, 1, 0, 0),
q.2=c(1, 0, 1, 0),
dv.1=c(1, 1, 1, 1),
dv.2=c(1, 1, 0, 1))
# From above
long <- reshape(dat,
direction='long',
varying=c('q.1', 'q.2', 'dv.1', 'dv.2'))
ddply(long, .(id), function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
})
Unfortunately, this yields the error message:
"Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0"
Any ideas? The problem seems to be located in the "rep" command, but I'm a newbie to R. Thank you so much!
UPDATE: SEE EXPLANATIONS BELOW, and then REFER TO FOLLOW-UP QUESTION
Hi there-I completely followed you, and really appreciate the time you took to help me out. I went back into my data and coded in a dummy Q where all respondents have a value of "1" - but, discovered where the error really may be. In my real data set, I have 30 questions (i.e., 30 times in long form). After I altered the dataset so FOR SURE q==1 for all id variables, the error message changed to saying
"Error in `$<-.data.frame`(`*tmp*`, "newvar", value = c(0, 29)) : replacement has 2 rows, data has 31"
If the problem is with the number of rows assigned to subs, then is the source of the error coming from...
subs <- subset(df, time <= last.q + 1)
i.e., $time <= last.q + 1$ is setting the number of rows to the value EQUAL to last.q+1?
UPDATE 2: What, ideally, I'd like my new variable to look like!
id time q dv
1 1 1 1
1 2 1 1
1 3 1 1
1 4 1 1
1 5 0 0
1 6 0 NA
2 1 1 1
2 2 1 1
2 3 0 0
2 4 0 NA
2 5 0 NA
2 6 0 NA
Please note that "q" can vary between "0" or "1" over time (See the observation for id=1 at time=2), but due to the nature of survival analysis, "dv" cannot. What I need to do is create a variable that finds the LAST time that "q" changes between "1" and "0", and then is censored accordingly. After Step 4, my data should look like this:
id time q dv
1 1 1 1
1 2 1 1
1 3 1 1
1 4 1 1
2 1 1 1
2 2 1 1
2 3 0 0
.(id) in plyr is equivalent to
> dum<-split(long,long$id)
> dum[[4]]
id time q dv
4.1 4 1 0 1
4.2 4 2 0 1
your problem is in your 4th split. You reference
answered <- subset(df, q == 1)
in your function. This is an empty set as there are no dum[[4]]$q taking value 1
If you just want to ignore this split then something like
ans<-ddply(long, .(id), function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
if(length(answered$q)==0){return()}
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
})
> ans
id time q dv
1 1 1 1 2
2 1 2 1 2
3 2 1 1 0
4 2 2 0 1
5 3 1 0 2
6 3 2 1 2
would be the result
In short: The error is because there is no q == 1 when id == 4.
A good way to check what's going on here is to rewrite the function separately, and manually test each chunk that ddply is processing.
So first rewrite your code like this:
myfun <- function(df) {
# figure out the dropoff time
answered <- subset(df, q == 1)
last.q = max(answered$time)
subs <- subset(df, time <= last.q + 1)
# set all the dv as desired
new.dv <- rep(last.q,1)
if (last.q < max(df$time)) new.dv <- c(0,last.q)
subs$dv <- new.dv
subs
}
ddply(long, .(id), myfun )
That still gives an error of course, but at least now we can manually check what ddply is doing.
ddply(long, .(id), myfun ) really means:
Take the dataframe called long
Create a number of subset dataframes (one for each distinct id)
Apply the function myfun to each subsetted dataframe
Reassemble the results into a single dataframe
So let's attempt to do manually what ddply is doing automatically.
> myfun(subset(long, id == 1))
id time q dv
1.1 1 1 1 2
1.2 1 2 1 2
> myfun(subset(long, id == 2))
id time q dv
2.1 2 1 1 0
2.2 2 2 0 1
> myfun(subset(long, id == 3))
id time q dv
3.1 3 1 0 2
3.2 3 2 1 2
> myfun(subset(long, id == 4))
Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0
In addition: Warning message:
In max(answered$time) : no non-missing arguments to max; returning -Inf
>
So it seems like the error is coming from the step where ddply applies the function for id == 4.
Now let's take the code outside of the function so we can examine each chunk.
> #################
> # set the problem chunk to "df" so we
> # can examine what the function does
> # step by step
> ################
> df <- subset(long, id == 4)
>
> ###################
> # run the bits of function separately
> ###################
> answered <- subset(df, q == 1)
> answered
[1] id time q dv
<0 rows> (or 0-length row.names)
> last.q = max(answered$time)
Warning message:
In max(answered$time) : no non-missing arguments to max; returning -Inf
> last.q
[1] -Inf
> subs <- subset(df, time <= last.q + 1)
> subs
[1] id time q dv
<0 rows> (or 0-length row.names)
> # set all the dv as desired
> new.dv <- rep(last.q,1)
> new.dv
[1] -Inf
> if (last.q < max(df$time)) new.dv <- c(0,last.q)
> subs$dv <- new.dv
Error in `$<-.data.frame`(`*tmp*`, "dv", value = c(0, -Inf)) :
replacement has 2 rows, data has 0
> subs
[1] id time q dv
<0 rows> (or 0-length row.names)
>
So the error that you're getting comes from subs$dv <- new.dv because new.dv has length two (i.e. two values - (0, -Inf)) but sub$dv is length 0. That wouldn't be a problem if dv were a simple vector, but because it's in the sub dataframe whose columns all have two rows, then sub$dv must also have two rows.
The reason sub has zero rows is because there is no q == 1 when id == 4.
Should the final data frame not have anything for id == 4? The answer to your problem really depends on what you want to happen in the case when there are no q==1 for an id. Just let us know, and we can help you with the code.
UPDATE:
The error that you're getting is because subs$dv has 31 values in it and new.dv has two values in it.
In R when you try to assign a longer vector to a shorter vector, it will always complain.
> test <- data.frame(a=rnorm(100),b=rnorm(100))
> test$a <- rnorm(1000)
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(-0.0507065994549323, :
replacement has 1000 rows, data has 100
>
But when you assign a shorter vector to a longer vector, it will only complain if the shorter is not an even multiple of the longer vector. (eg 3 does not go evenly into 100)
> test$a <- rnorm(3)
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(-0.897908251650798, :
replacement has 3 rows, data has 100
But if you tried this, it wouldn't complain since 2 goes into 100 evenly.
> test$a <- rnorm(2)
>
Try this:
> length(test$a)
[1] 100
> length(rnorm(2))
[1] 2
> test$a <- rnorm(2)
> length(test$a)
[1] 100
>
What's it's doing is silently repeating the shorter vector to fill up the longer vector.
And again, what you do to get around the error (i.e. make both vectors the same length) will depend on what you're trying to achieve. Do you make new.dv shorter, or subs$dv longer?
First, to give credit where credit is due, the code below is not mine. It was generated in collaboration with another very generous coworker (and engineer) who helped me work through my problem (for hours!).
I thought that other analysts tasked with constructing a censored variable from survey data might find this code useful, so I am passing the solution along.
library(plyr)
#A function that only selects cases before the last time "q" was coded as "1"
slicedf <- function(df.orig, df=NULL) {
if (is.null(df)) {
return(slicedf(df.orig, df.orig))
}
if (nrow(df) == 0) {
return(df)
}
target <- tail(df, n=1)
#print(df)
#print('--------')
if (target$q == 0) {
return(slicedf(df.orig, df[1:nrow(df) - 1, ]))
}
if (nrow(df.orig) == nrow(df)) {
return(df.orig)
}
return(df.orig[1:(nrow(df) + 1), ])
}
#Applies function to the dataset, and codes over any "0's" before the last "1" as "1"
long <- ddply(long, .(id), function(df) {
df <- slicedf(df)
if(nrow(df) == 0) {
return(df)
}
q <- df$q
if (tail(q, n=1) == 1) {
df$q <- rep(1, length(q))
} else {
df$q <- c(rep(1, length(q) - 1), 0)
}
return(df)
})
Thanks to everyone online who commented for your patience and help.