I have a problem very similar to that described here:
subset of data.frame columns to maximize "complete" observations
I am trying to schedule a workshop that will meet five times. I have ten days from which to choose meeting dates, each day having three overlapping possible meeting times. Hence, I have 30 columns grouped into ten groups (days) of three columns (meeting times) each. I need to select 5 columns (or meeting date–time combinations) subject to the following criteria: only one meeting time is selected per day (one column per group); the number of respondents (rows) who can attend all 5 meetings is maximized. Ideally, I would also want to know how the optimal column selection changes if I relax the criterion that respondents must attend ALL 5 meetings, requiring only that they attend 4, or 3, etc.
For simple visualization, let's say I want to know which two columns I should choose—no more than one each from V1, V2, and V3—such that I maximize the number of rows that have no zeros (i.e. row sums to 2).
V1A V1B V1C V2A V2B V2C V3A V3B V3C
1 0 1 0 1 1 1 0 1
1 1 0 0 1 1 0 1 1
0 0 1 1 1 0 0 1 1
1 1 1 1 0 0 1 0 0
1 0 0 0 1 1 0 1 0
0 1 1 0 1 1 0 0 0
1 0 1 1 1 0 1 0 1
The actual data are here: https://drive.google.com/file/d/0B03dE9-8088aMklOUVhuV3gtRHc/view Groups are mon1* tue1* [...] mon2* tue2* [...] fri2*.
The code proposed in the link above would solve my problem if it were not the case that I needed to select columns from groups. Ideally, I would also be able to say which columns I should choose to maximize the number of rows under the weaker condition that a row could have one zero (i.e. row sums to 5 or 4 or 3, etc.).
Many thanks!
You could use rowSums to get the index of rows that have greater than or equal to two 1's. (The conditions are not very clear)
lapply(split(names(df),sub('.$', '', names(df))),
function(x) which(rowSums(df[x])>=2))
#$V1
#[1] 1 2 4 6 7
#$V2
#[1] 1 2 3 5 6 7
#$V3
#[1] 1 2 3 7
This just finds the first column index with 1 (or very first if all zero) in each of three groups, returning a three column matrix, one column for each group.
f <- substring(colnames(df), 1L, nchar(colnames(df))-1L)
ans <- lapply(split(as.list(df), f),
function(x) max.col(do.call(cbind, x), ties.method="first"))
do.call(cbind, ans)
With your dataset this delivers the rows that satisfy the requirement to deliver all rows==1:
> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1,
function(z) sum(z, na.rm=TRUE)==3) ) )
[[1]]
[1] 4
[[2]]
integer(0)
[[3]]
integer(0)
If you relax the requirement to allow values less than 3 you get more candidates:
> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1, function(z) sum(z, na.rm=TRUE)>=2) ) )
[[1]]
[1] 1 2 4 6 7
[[2]]
[1] 1 2 3 5 6 7
[[3]]
[1] 1 2 3 7
Now ,,,,,,, what exactly are the rutes for this task?????
Related
I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))
I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I need to check whether the number of elements of each unique value in the variable PPT in A is equal to the number of elements of each unique value in PPT in B, and whether there is any value unique only to A or only to B.
For example:
PPTa <- c("ppt0100109","ppt0301104","ppt0100109","ppt0100109","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0504409","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNa <- c(110,54,110,110,49,10,49,110,409,40,10,10,110)
LLa <- c(150,55,150,150,45,15,45,115,405,45,5,15,50)
A <-data.frame(PPTa,CNa,LLa)
PPTb <- c("ppt0100200","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0764091","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNb <- c(110,54,110,110,49,10,49,110,409,40)
LLb <- c(150,55,150,150,45,15,45,115,405,45)
B <-data.frame(PPTb,CNb,LLb)
In this case, we have these unique values which occur a certain amount of times:
A$PPTa TIMES
"ppt0100109" 6
"ppt0301104" 1
"ppt0300249" 2
"ppt0504409" 1
"ppt2303401" 1
"ppt0704210" 2
B$PPTb TIMES
"ppt0100200" 1
"ppt0300249" 2
"ppt0100109" 3
"ppt0764091" 1
"ppt2303401" 1
"ppt0704210" 2
I would like to create a new matrix (or anything you could suggest) with a value of 0 if the unique value exists both in A and B with the same number of elements, a value of 1 if it exists in both dataframes A and B but the number of elements differ, and a value of 2 if the value exists only in one of the two dataframes.
Something like:
A$PPTa TIMES OUTPUT
"ppt0100109" 6 1
"ppt0301104" 1 2
"ppt0300249" 2 0
"ppt0504409" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
B$PPTb TIMES OUTPUT
"ppt0100200" 1 2
"ppt0300249" 2 0
"ppt0100109" 3 1
"ppt0764091" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
You can use a nested ifelse statement,
ifelse(do.call(paste0, A) %in% do.call(paste0, B), 0, ifelse(A$PPTa %in% B$PPTb, 1, 2))
#[1] 1 0 2 2 0 0
ifelse(do.call(paste0, B) %in% do.call(paste0, A), 0, ifelse(B$PPTb %in% A$PPTa, 1, 2))
#[1] 1 2 0 0 2 0
How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}
I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3