I really don't have idea what I'm looking for, if a loop, recursive function or maybe something different.
This is my toy dataset:
ID1 S1 S2 S3
1 10 20 30
2 20 30 40
1 50 60 70
3 20 40 50
1 10 30 10
2 40 20 20
toy$OLD_RANK = find previous row with same ID1 and copy NEW RANK of that row. If no row with same ID1 give assigned value (10 in this example)
toy$NEW_RANK = OLD_RANK + S1+S2+S3
expected result:
ID1 S1 S2 S3 OLD_RANK NEW_RANK
1 10 20 30 10 70
2 20 30 40 10 100
1 50 60 70 70 250
3 20 40 50 10 120
1 10 30 10 280 330
2 40 20 20 100 180
dataframe for R as requested:
toy <- matrix(c(1,10,20,30,2,20,30,40,1,50,60,70,3,20,40,50,1,10,30,10,2,40,20,20),ncol=4,byrow=TRUE)
colnames(toy) <- c("ID1","S1","S2","S3")
toy <- as.data.frame(database )
Related
I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.
My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>
I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.
I need to replicate - or at least find an alternative solution - for a SUMIFS function I have in Excel.
I have a transactional database:
SegNbr Index Revenue SUMIF
A 1 10 30
A 1 20 30
A 2 30 100
A 2 40 100
B 1 50 110
B 1 60 110
B 3 70 260
B 3 80 260
and I need to create another column that sums the Revenue, by SegmentNumber, for all indexes that are equal or less the Index in that row. It is a distorted rolling revenue as it will be the same for each SegmentNumber/Index key. This is the formula is this one:
=SUMIFS([Revenue],[SegNbr],[#SegNbr],[Index],"<="&[#Index])
Let's say you have this sample data.frame
dd<-read.table(text="SegNbr Index Revenue
A 1 10
A 1 20
A 2 30
A 2 40
B 1 50
B 1 60
B 3 70
B 3 80", header=T)
Now if we make sure the data is ordered by segment and index, we can do
dd<-dd[order(dd$SegNbr, dd$Index), ] #sort data
dd$OUT<-with(dd,
ave(
ave(Revenue, SegNbr, FUN=cumsum), #get running sum per seg
interaction(SegNbr, Index, drop=T),
FUN=max, na.rm=T) #find largest sum per index per seg
)
dd
This gives
SegNbr Index Revenue OUT
1 A 1 10 30
2 A 1 20 30
3 A 2 30 100
4 A 2 40 100
5 B 1 50 110
6 B 1 60 110
7 B 3 70 260
8 B 3 80 260
as desired.
This is my first post on here and I am pretty new to R.
I have a huge datafile that looks like the example below.
> name = factor(c("A","B","C","D","E","F","G","H","H"))
> school = c(1,1,1,2,2,2,3,3,3)
> age = c(10,20,0,30,40,50,60,NA,70)
> mark = c(100,70,100,50,90,100,NA,50,50)
> data = data.frame(name=name,school=school,age=age)
name school age mark (many other trait columns)
A 1 10 100
B 1 20 70
C 1 NA 100
D 2 30 50
E 2 40 90
F 2 50 100
G 3 60 NA
H 3 NA 50
H 3 70 50
What I need to do is calculate the average of many traits per school and for each trait I want to create to other columns, one with the mean per school for the trait and another one with the average deviation. I also have trait values of "zero" and "NA", which I dont want to include in the mean calculation. The file I need would look like this:
name school age agemean agedev mark markmean markdev (continue for other traits)
A 1 10 15 -5 100 90 10
B 1 20 15 5 70 90 -20
C 1 0 15 0 100 90 10
D 2 30 40 -10 50 80 -30
E 2 40 40 0 90 80 10
F 2 50 40 10 100 80 20
G 3 60 65 -5 NA 50 0
H 3 NA 65 0 50 50 0
H 3 70 65 5 50 50 0
I did a search on here and found some similar questions, but I didnt get how to apply to my case. I tried to use the agreggate function, but it is not working. Any help would be very much appreciated.
Cheers.
Sounds like a good job for dplyr. Here's how you could do it if you want to keep all existing rows per school:
require(dplyr)
data %>%
group_by(school) %>%
mutate_each(funs(mean(., na.rm = TRUE), sd(., na.rm = TRUE)), -name)
#Source: local data frame [9 x 8]
#Groups: school
#
# name school age mark age_mean mark_mean age_sd mark_sd
#1 A 1 10 100 15 90 7.071068 17.32051
#2 B 1 20 70 15 90 7.071068 17.32051
#3 C 1 NA 100 15 90 7.071068 17.32051
#4 D 2 30 50 40 80 10.000000 26.45751
#5 E 2 40 90 40 80 10.000000 26.45751
#6 F 2 50 100 40 80 10.000000 26.45751
#7 G 3 60 NA 65 50 7.071068 0.00000
#8 H 3 NA 50 65 50 7.071068 0.00000
#9 H 3 70 50 65 50 7.071068 0.00000
If you want to reduce each school to a single-row-summary, you can replace mutate_each with summarise_each in the code above.
Say I have a data frame where one column is some repeating value (dates, IDs, etc). Is there a way to convert a data frame into a now data frame with columns instead of replicating rows? Basically I want to transpose something like this:
col1 col2 col3
1 aa 30
2 aa 40
3 aa 10
1 bb 20
2 bb 12
3 bb 15
1 cc 40
2 cc 31
3 cc 12
Into this:
aa bb cc
1 30 20 40
2 40 12 31
3 10 15 12
Here is some code that makes a sample of the first data frame:
a <- c(rep(1:10, 3))
b <- c(rep("aa", 10), rep("bb", 10), rep("cc", 10))
set.seed(123)
c <- sample(seq(from = 20, to = 50, by = 5), size = 30, replace = TRUE)
d <- data.frame(a,b, c)
I am unsure how to transpose it.
a <- c(rep(1:10, 3))
b <- c(rep("aa", 10), rep("bb", 10), rep("cc", 10))
set.seed(123)
c <- sample(seq(from = 20, to = 50, by = 5), size = 30, replace = TRUE)
d <- data.frame(a,b, c)
#how to transpose it#
e<-reshape(d,idvar='a',timevar='b',direction='wide')
e
This is also a case in which you can use unstack:
unstack(d, c ~ b)
# aa bb cc
# 1 30 50 50
# 2 45 35 40
# 3 30 40 40
# 4 50 40 50
# 5 50 20 40
# 6 20 50 40
# 7 35 25 35
# 8 50 20 40
# 9 35 30 30
# 10 35 50 25
Using your data frame d,
library(tidyr)
> spread(d, key = b, value = c)
a aa bb cc
1 1 30 50 50
2 2 45 35 40
3 3 30 40 40
4 4 50 40 50
5 5 50 20 40
6 6 20 50 40
7 7 35 25 35
8 8 50 20 40
9 9 35 30 30
10 10 35 50 25
Explanation, the argument key = b lets you specify a column in your data frame. spread will create a new column for each unique entry in the key column b. The argument value = c tells spread to retrieve the value in column c and write it in the corresponding new key column.
If there are always equal numbers of observations in each group, this would be very easy with split then as.data.frame
as.data.frame(split(d$c, d$b))
# aa bb cc
# 1 30 50 50
# 2 45 35 40
# 3 30 40 40
# 4 50 40 50
# 5 50 20 40
# 6 20 50 40
# 7 35 25 35
# 8 50 20 40
# 9 35 30 30
# 10 35 50 25
With split and cbind:
> ll = lapply(split(d, d$b), function(x) x[3])
> dd = do.call(cbind, ll)
> names(dd) = names(ll)
> dd
aa bb cc
1 30 50 50
2 45 35 40
3 30 40 40
4 50 40 50
5 50 20 40
6 20 50 40
7 35 25 35
8 50 20 40
9 35 30 30
10 35 50 25
I am trying to identify the time of primary ambulance arrival for a number of patients in my dataframe=data.
The primary ambulance is either the 1st, 2nd, 3rd or 4th vehicle on scene (data$prim.amb.num=1, 2, 3, or 4 for each patient/row).
data$time_v1, data$time_v2, data$time_v3 and data$time_v4 have a time or a missing value, which corresponds to the 1st, 2nd, 3rd and 4th vehicles, where relevant.
What I would like to do is make a new variable=prim.amb.time with the time that corresponds to primary ambulance arrival time. Suppose for patient=1, the ambulance was the first. Then I want data[1,"prim.amb.time"]=data[1,"time_v1"].
I can figure out the correct time_v* with the following:
paste("time_v", data$prim.amb.num, sep="")
But I'm stuck as to how to pass the resulting information to call the correct column.
My hope was to simply have something like:
data$prim.amb.time<-data$paste("time_v", data$prim.amb.num, sep="")
but of course, this doesn't work. I'm not even sure how to Google for this; I tried various combinations of this title but to no avail. Any suggestions?
Although I liked the answer by #mhermans, if you want a one-liner, one solution is to use ?apply as follows:
#From #mhermans
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
#Take each row of d and pull out time_vn where n = d$prime.amb.num
d$prime.amb.time <- apply(d, 1, function(x) {x[x['prime.amb.num'] + 2]})
> d
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4 prime.amb.time
1 1000 1 30 40 60 100 30
2 1001 3 40 50 60 80 60
3 1002 2 10 30 40 45 30
4 1003 1 24 40 45 60 24
EDIT - or with paste:
d$prime.amb.time <-
apply(
d,
1,
function(x) {
x[paste('time_v', x['prime.amb.num'], sep = '')]
}
)
#Gives the same result
Set up example data:
# read in basic example data for four patients, wide format
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
In the example dataset I'm thus assuming your data looks like this:
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1 1000 1 30 40 60 100
2 1001 3 40 50 60 80
3 1002 2 10 30 40 45
4 1003 1 24 40 45 60
Given that data structure, it is perhaps easier to work with a dataset with a vehicle per row, instead of a patient per row. This can be accomplised by using reshape() to convert from a wide to a long format.
dl <- reshape(d, direction='long', idvar="patient.id", varying=list(3:6))
# ordering & rename var for aesth. reasons:
dl <- dl[order(dl$patient.id, dl$time),]
dl$vehicle.id <- dl$time
dl$time <- NULL
dl
This gives a long dataset, with a row per vehicle:
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1000.2 1000 1 40 2
1000.3 1000 1 60 3
1000.4 1000 1 100 4
1001.1 1001 3 40 1
1001.2 1001 3 50 2
1001.3 1001 3 60 3
1001.4 1001 3 80 4
1002.1 1002 2 10 1
1002.2 1002 2 30 2
1002.3 1002 2 40 3
1002.4 1002 2 45 4
1003.1 1003 1 24 1
1003.2 1003 1 40 2
1003.3 1003 1 45 3
1003.4 1003 1 60 4
Getting the arrival time of the first ambulance per patient then become a simple oneliner:
dl[dl$prime.amb.num == dl$vehicle.id,]
which gives
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1001.3 1001 3 60 3
1002.2 1002 2 30 2
1003.1 1003 1 24 1