R apply returning rows but want columns - r

I am trying to clean survey data where sometimes minutes information is entered in the hours field. The cleaning rules say to move the hours value to the minutes field if mins is missing or 0 and hours is 15 or 30 or 45 or 60.
Here is the function I wrote:
cleanHrMin <- function(x){
h = x[1]
m = x[2]
if ( !is.na(h) && (h==15 || h==30 || h==45 || h==60) && (m==0 || is.na(m)) )
{ return(c(0,h)) }
else
{ return(x) }
}
With test data:
df <- as.data.frame(cbind(hrs = c(1,15,0), mins = c(10,NA,15)))
I run the function as follows:
as.data.frame(apply(df,1,cleanHrMin))
and get output:
V1 V2 V3
1 1 0 0
2 10 15 15
But what I want is:
V1 V2
1 1 10
2 0 15
3 0 15
What am I doing wrong?

You're doing the correct apply, across rows. You simply need to transpose the output.
> as.data.frame(t(apply(df,1,cleanHrMin)))
V1 V2
1 1 10
2 0 15
3 0 15
This is because apply returns columns resulting from the function, whether your margin is over rows or over columns. So to go from rows to rows, you must transpose.

Related

Row comparison in R

I have the below data frame,
R_Number A
1 0
2 15
3 10
4 11
5 12
6 18
7 19
8 15
9 17
10 11
Now I need to create another column B where the comparison of the values in A will be computed. The condition is that the comparion is not between two consecutive row, i.e Row number 1 is compared with Row number 4, like wise Row number 2 is compared with Row number 5 and this iteration continues till the end of the data . Condition for comparision result is:
if (A[1]>=15 && A[4] <= 12) {
B == 1
}
else if (A[1]<=0 && A[4]>= 10) {
B== 2
}
else {
B== 0
}
When it comes to Row number 8 and Row number 9these rows will not have next 4th row to compare with hence the value should be 0
Also, the comparision result of Row 1 and 4 is printed in Row number 1 similarly comparision result of Row 2 and 5 is printed in Row number 2
So the resulting dataframe should be as shown below
R_Number A B
1 0 2
2 15 1
3 10 0
4 11 0
5 12 0
6 18 0
7 19 1
8 15 0
9 17 0
10 11 0
According to #nicola comment, I tried to solve your problem as well.
I recreated your initial data frame:
df <- data.frame(R_Number = c(1:10), A = c(0,15,10,11,12,18,19,15,17,11), B = 0)
So I used an if statement inside a cycle for:
for (i in 1:(length(df$A)-3)) {
if (df$A[i] >= 15 && df$A[i+3] <= 12) {
df$B[i] <- 1
} else if ((df$A[i] <= 0) && (df$A[i+3] >= 10)) {
df$B[i] <- 2
}
else {
df$B[i] <- 0
}
}
With last edit I solved the problem that came up when the length of data frame changed.
Now you have a generic solution!
First lagging the variable and then computing your new variable should work. Something like this:
library(Hmisc)
df <- data.frame(R_Number = c(1:10), A = c(0,15,10,11,12,18,19,15,17,11))
A_Lag<-Lag(df$A,-3)
df$B <- rowSums(cbind(df$A>=15 & A_Lag <= 12,(df$A<=0 & A_Lag>= 10)*2),na.rm= T)
df$B
I tried to avoid if statements. The Lag function can be found in the Hmisc package.
> df$B
[1] 2 1 0 0 0 0 1 0 0 0

How to extract all rows between start signal and end signal?

I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0
Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0
We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0
A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined
Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).

R: ddply for an updating function

I have a silly function, which updates the value of S in the length of the vector called ACC by delta1 and delta2.
Sstart=0 #a starting value for S
ACC=c(1,1,0,1,1) #accuracy: 0 or 1
f=c(1,1,1,1,0) #feedback: 0 or 1
ID=rep(1,5) #ID of the participant
delta1=seq(1,5,1)
delta2=seq(1,5,1)
m<-as.matrix(expand.grid(delta1=delta1, delta2=delta2)) #all the possible combination of delta1 and delta2
The function is the following. When the feedback (f) is 1, it updates S by delta1, when the ffedback is 0, then with delta2. Delta1 and delta2 ranges from 1 to 5 and I increment them separately.
silly_function<-function(delta1, delta2,ACC,f,Sstart){
S = Sstart
for (i in 1:length(ACC)){
if (ACC[i]==1 & f[i]==1){
S[i+1]=S[i]+delta1
}
else if (ACC[i]==1 & f[i]==0){
S[i+1]=S[i]+delta2
}
else if (ACC[i]==0){
S[i+1]=S[i]
}
}
return(S)
}
I call the function
N=length(delta1)*length(delta2)
SMat<-matrix(data=NA, nrow=N, ncol=(length(ACC)+1)) #matrix for the data
for (i in 1:N){
SMat[i,] <- silly_function(m[i,1],m[i,2],ACC,f,Sstart)}
My problem:
The function works perfectly for 1 subject, but I cannot find a clever way for applying it for all my subjects separately (I have a data frame where all the data from my participants are in one data frame) and combine the results into one matrix or data frame. I wanted to use ddply from the plyr package, but I couldn’t find an example similar to mine to modify it and get some idea how to implement it in this case.
Thank you very much for your comments/hints in advance!
My input for two participants
ID Feedback ACC
1 1 1
1 1 1
1 1 0
1 1 1
1 0 1
2 1 1
2 1 1
2 0 1
2 1 0
2 1 1
Actual output
V1 V2 V3 V4 V5 V6
0 1 2 2 3 4 #row1
0 2 4 4 6 7
.
.
0 5 10 10 15 20 #row25
For subject 1 the output is:
25 * 6 (rows*columns) matrix: 25 rows because I update S by all possible combinations of delta1 and delta2. the first column is always 0 because Sstart is 0.
Desired output
Basically the same as for subject 1 but with all the subjects data
1-25 rows for subject 1
26-50 rows for subject 2...

R ffdfdply reset cumsum using data.table

Sorry for asking basic question. I am using ff package and used read.csv.ffdf to import the data. I have more then 50 million rows in excel and I want to do cumulative sum on one of the column and reset it when it finds 0. I have the below code to generate cumulative series and but don't know how to access the current row.
idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c_d <- ikey(ordered_i[c("a", "c","d")])
cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c_d), FUN= function(x) {
x <- as.data.table(x)
if(x[**current row**, d]==0)
{
result <- x[,cumsum_a_c_d :=0]
}
else
{
result <- x[, cumsum_a_c_d := cumsum(d), by = list(key_a_c_d)]
}
as.data.frame(result)
}, trace=T)
I am using the data.tablepackage to get the cumulative sum done. How can I access the current row in data table so that I can compare it with 0 and reset the cumsum. I need the expected output as shown below. It's the cumulative sum of column d.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
Thanks

Apply in R: recursive function that operates on its own previous result

How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}

Resources