Having a subset of a data set based on a specific condition - r

I have a data set which the values of "age" has different units (days, months, year). I want to convert the rows which their values are based on days and months to year. How I can do it in R?
If there is no letter after the number, then the unit is years.
If there is a ‘D’ after the number, then the unit is days (e.g. 10D means 10 days)
If there is an ‘M’ after the number, then the unit is months (e.g. 5M means 5 months).
Age <- c("33","32","44","54M","67M","34D","33D","44","77","88M","49 D","55D","11M")
ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
Data <- data.frame(ID,Age)
> Data
ID Age
1 1 33
2 2 32
3 3 44
4 4 54M
5 5 67M
6 6 34D
7 7 33D
8 8 44
9 9 77
10 10 88M
11 11 49 D
12 12 55D
13 13 11M

Here's a quick way in base R:
Data$units = ifelse(grepl("M", Data$Age), "month", ifelse(grepl("D", Data$Age), "day", "year"))
Data$value = as.numeric(gsub(pattern = "[^0-9]", replacement = "", Data$Age))
Data$result = with(Data,
ifelse(units == "year", value,
ifelse(units == "month", value / 12, value / 365.25)))
Data
# ID Age units value result
# 1 1 33 year 33 33.00000000
# 2 2 32 year 32 32.00000000
# 3 3 44 year 44 44.00000000
# 4 4 54M month 54 4.50000000
# 5 5 67M month 67 5.58333333
# 6 6 34D day 34 0.09308693
# 7 7 33D day 33 0.09034908
# 8 8 44 year 44 44.00000000
# 9 9 77 year 77 77.00000000
# 10 10 88M month 88 7.33333333
# 11 11 49 D day 49 0.13415469
# 12 12 55D day 55 0.15058179
# 13 13 11M month 11 0.91666667

And here's another option using tidyverse tools:
library(dplyr)
library(stringr)
Data %>%
mutate(Unit = str_extract(string = Age,pattern = "[DM]"),
Unit = if_else(is.na(Unit),'Y',Unit),
Age = as.numeric(gsub(pattern = "[MD]","",Age))) %>%
mutate(AgeYears = Age / c('Y' = 1,'M' = 12,'D' = 365)[Unit])
ID Age Unit AgeYears
1 1 33 Y 33.00000000
2 2 32 Y 32.00000000
3 3 44 Y 44.00000000
4 4 54 M 4.50000000
5 5 67 M 5.58333333
6 6 34 D 0.09315068
7 7 33 D 0.09041096
8 8 44 Y 44.00000000
9 9 77 Y 77.00000000
10 10 88 M 7.33333333
11 11 49 D 0.13424658
12 12 55 D 0.15068493
13 13 11 M 0.91666667

#baseR
Age <-c("33","32","44","54M","67M","34D","33D","44","77","88M","49 D","55D","11M")
AgeNum<- as.numeric(sub("\\s*\\D$","",Age))
Age[grepl("M$",Age)] <- AgeNum[grepl("M$",Age)]/12
Age[grepl("D$",Age)] <- AgeNum[grepl("D$",Age)]/365
Age <- as.numeric(Age)
result:
> Age
[1] 33.00000000 32.00000000 44.00000000 4.50000000 5.58333333 0.09315068 0.09041096 44.00000000
[9] 77.00000000 7.33333333 0.13424658 0.15068493 0.91666667
>

Additionally, a further solution using data.table:
> library(data.table)
> dt <- data.table(ID, Age)
> dt[, Unit := ifelse(grepl("D$", Age), "D", ifelse(grepl("M$", Age), "M", "Y"))][
, Age := as.integer(gsub("M|D", "", Age))]
> dt[, Age_in_years := ifelse(Unit == "Y", Age,
ifelse(Unit == "M", Age / 12, Age / 365.25))][]
ID Age Unit Age_in_years
1: 1 33 Y 33.00000000
2: 2 32 Y 32.00000000
3: 3 44 Y 44.00000000
4: 4 54 M 4.50000000
5: 5 67 M 5.58333333
6: 6 34 D 0.09308693
7: 7 33 D 0.09034908
8: 8 44 Y 44.00000000
9: 9 77 Y 77.00000000
10: 10 88 M 7.33333333
11: 11 49 D 0.13415469
12: 12 55 D 0.15058179
13: 13 11 M 0.91666667

Related

Operations on multiple columns accross many tables

I have two tables (dt1, dt2). dt2 contains the same variables names as dt1.
For each variable in dt1 I would like to multiply it with its values from dt2.
In the exemple below, x from dt1 will get multiplied with 4 and y with 7.
How would be the fast way to do it?
Thank you
set.seed(123)
dt1 <- data.frame(x = sample(1:10, 10, TRUE), y = sample(1:10, 10, TRUE) )
dt1
dt2 = data.frame (names = c("x", "y"), values = c(4, 7))
dt2
purrr style
map2_df(dt1, dt2 %>% pivot_wider(names_from = names, values_from = values), ~.y * .x)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70
You can try sweep
> sweep(dt1, 2, dt2$values[match(dt2$names, names(dt1))], "*")
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70
or
> dt1[] <- t(t(dt1) * dt2$values[match(dt2$names, names(dt1))])
> dt1
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70

Bind data.table with diff rows such that the shorter dataset repeats the last row

I am trying to bind two dataset in R with diff nrows such that the shorter dataset repeats the last row till equal length of other dataset grouped by the 'BY' variable
Suppose below are the two datasets
dt1<- data.table(Student = c(6,6,6,7,7),
RollNum1 = c(49,69,44,86,39),
Marks1= c(8,9,10,8,5))
dt2<- data.table(Student = c(6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7),
RollNum2 = c(58,69,45,38,88,73,33,99,29,58,31,55,58,44,56,89),
Marks2= c(8,9,10,3,5,7,8,8,9,6,9,5,9,3,4,8))
The merge should give the below result
Student RollNum1 Marks1 RollNum2 marks2
6 49 8 58 8
6 69 9 69 9
6 44 10 45 10
6 44 10 38 3
6 44 10 88 5
6 44 10 73 7
6 44 10 33 8
6 44 10 99 8
6 44 10 29 9
7 86 8 58 6
7 39 5 31 9
7 39 5 55 5
7 39 5 58 9
7 39 5 44 3
7 39 5 56 4
7 39 5 89 8
I was able to get it for a single student value from another post but I am not sure how to group by student and run this and then append to get the final result.
To run for a single student value e.x - 6 : below is the code.
LastRowrep<-as.data.frame(matrix(rep(unlist(dt1[nrow(dt1),]),times=nrow(dt2)-nrow(dt1)),ncol = ncol(dt1), byrow = T))
colnames(LastRowrep)<-colnames(dt1)
cbind(rbind(dt1,LastRowrep),dt2)
But running this for different student values doesn't bind by group value
An other data.table approach using melting and casting.
After casting, fill in NA's with locf. In the development-version of data.table, an nafill is implemented, but I';m using the zoo::na.locf until 1.12.3 goes final
DT <- rbind( melt(dt1, id.vars = "Student"), melt(dt2, id.vars = "Student") )[, rowid := rowid( Student, variable )]
ans <- dcast( DT, rowid + Student ~ variable, value.var = "value" )
setorder( ans, Student)
zoo::na.locf(ans, na.rm=FALSE)[, rowid := NULL][]
# Student RollNum1 Marks1 RollNum2 Marks2
# 1: 6 49 8 58 8
# 2: 6 69 9 69 9
# 3: 6 44 10 45 10
# 4: 6 44 10 38 3
# 5: 6 44 10 88 5
# 6: 6 44 10 73 7
# 7: 6 44 10 33 8
# 8: 6 44 10 99 8
# 9: 6 44 10 29 9
#10: 7 86 8 58 6
#11: 7 39 5 31 9
#12: 7 39 5 55 5
#13: 7 39 5 58 9
#14: 7 39 5 44 3
#15: 7 39 5 56 4
#16: 7 39 5 89 8
dt1[, nrows := dt2[, .(rows = .N), by = Student][.SD, on = "Student", rows]]
dt1 <- dt1[, .SD[c(1:.N, rep(.N, nrows[1] - .N))], by = Student]
cbind(dt1[, !"nrows"], dt2[, !"Student"])
# Student RollNum1 Marks1 RollNum2 Marks2
# 1: 6 49 8 58 8
# 2: 6 69 9 69 9
# 3: 6 44 10 45 10
# 4: 6 44 10 38 3
# 5: 6 44 10 88 5
# 6: 6 44 10 73 7
# 7: 6 44 10 33 8
# 8: 6 44 10 99 8
# 9: 6 44 10 29 9
# 10: 7 86 8 58 6
# 11: 7 39 5 31 9
# 12: 7 39 5 55 5
# 13: 7 39 5 58 9
# 14: 7 39 5 44 3
# 15: 7 39 5 56 4
# 16: 7 39 5 89 8
If you want dplyr approach, here it is.
library(tidyverse)
dt1 = dt1 %>% gather(key, value, -Student)
dt2 = dt2 %>% gather(key, value, -Student)
dt3 = bind_rows(dt1, dt2) %>%
group_by(Student, key) %>%
mutate(id = seq(n())) %>%
spread(key, value) %>%
fill(c(Marks1:RollNum2)) %>%
select(-id)

selecting middle n rows in R

I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.
Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60
You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.

R data.frame add a column depending on row-values

In R, I have a data.frame that looks like this:
X Y
20 7
25 84
15 62
22 12
60 24
40 10
60 60
12 50
11 17
now, i want a new Colum, lets call it "SumX", that adds two following values of X into a new field of that SumX column, and one that does the same to "SumY" column. So the result data.frame would look like this:
X Y SumX SumY
20 7 20 #first row = X 7 #first row = Y
25 84 45 #X0 + X1 91 #Y0 + Y1
15 62 40 #X1 + X2 146 #Y1 + Y2
22 12 37 #X2 + X3 74 #Y2 + Y3
60 24 82 #X3 + X4 36 #Y3 + Y4
40 10 100 #X4 + X5 34 #Y4 + Y5
60 60 100 #and so on 70 #and so on
12 50 72 110
11 17 23 67
I can do simple X + Y into a new column with
myFrame$SumXY <- with(myFrame, X+Y)
but it there a simple way to add two X (n + (n-1)) values into SumX, and two Y (n + (n-1)) into SumY? Even if it is with a while-loop, though i would prefer a simpler way (its a lot of data like this). Any help is much appreciated! (I'm still pretty new to R)
The rollapply function from the zoo package will work here.
The following code block will create the rolling sum of each 2 adjacent values.
require(zoo)
myFrame$SumX <- rollapply(myFrame$X, 2, sum) # this is a rolling sum of every 2 values
You could add by = 2 as an argument to rollapply in order to not have a rolling sum (i.e. it sums values 1+2, then 3+4, then 5+6 etc.).
Look up ?rollapply for more info.
Here's a dplyr approach.
Use mutate() to add a new colum and var + lag(var, default = 0) to compute your variable. Example:
library(dplyr)
d <- data.frame(
x = 1:10,
y = 11:20,
z = 21:30
)
mutate(d, sumx = x + lag(x, default = 0))
#> x y z sumx
#> 1 1 11 21 1
#> 2 2 12 22 3
#> 3 3 13 23 5
#> 4 4 14 24 7
#> 5 5 15 25 9
#> 6 6 16 26 11
#> 7 7 17 27 13
#> 8 8 18 28 15
#> 9 9 19 29 17
#> 10 10 20 30 19
More variables can be handled similarly:
mutate(d, sumx = x + lag(x, default = 0), sumy = y + lag(y, default = 0))
#> x y z sumx sumy
#> 1 1 11 21 1 11
#> 2 2 12 22 3 23
#> 3 3 13 23 5 25
#> 4 4 14 24 7 27
#> 5 5 15 25 9 29
#> 6 6 16 26 11 31
#> 7 7 17 27 13 33
#> 8 8 18 28 15 35
#> 9 9 19 29 17 37
#> 10 10 20 30 19 39
If you know that you want to do this for many, or even EVERY column in your data frame, then here's a standard evaluation approach with mutate_() that uses a custom function I adapted from this blog post (note you need to have the lazyeval package installed). The function gets applied to each column in a for loop (which could probably be optimised).
f <- function(df, col, new_col_name) {
mutate_call <- lazyeval::interp(~ x + lag(x, default = 0), x = as.name(col))
df %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
for (var in names(d)) {
d <- f(d, var, paste0('sum', var))
}
d
#> x y z sumx sumy sumz
#> 1 1 11 21 1 11 21
#> 2 2 12 22 3 23 43
#> 3 3 13 23 5 25 45
#> 4 4 14 24 7 27 47
#> 5 5 15 25 9 29 49
#> 6 6 16 26 11 31 51
#> 7 7 17 27 13 33 53
#> 8 8 18 28 15 35 55
#> 9 9 19 29 17 37 57
#> 10 10 20 30 19 39 59
Just to continue the tidyverse theme, here's a solution using the purrr package (again, works for all columns, but can subset columns if need to):
library(purrr)
# Create new columns in new data frame.
# Subset `d` here if only want select columns
sum_d <- map_df(d, ~ . + lag(., default = 0))
# Set names correctly and
# bind back to original data
names(sum_d) <- paste0("sum", names(sum_d))
d <- cbind(d, sum_d)
d
#> x y z sumx sumy sumz
#> 1 1 11 21 2 22 42
#> 2 2 12 22 4 24 44
#> 3 3 13 23 6 26 46
#> 4 4 14 24 8 28 48
#> 5 5 15 25 10 30 50
#> 6 6 16 26 12 32 52
#> 7 7 17 27 14 34 54
#> 8 8 18 28 16 36 56
#> 9 9 19 29 18 38 58
#> 10 10 20 30 20 40 60
You can use the lag function to achieve something like this:
myFrame$SumX[1] <- X[1]
myFrame$SumX[2:nrow(myFrame)] <- X[2:nrow(myFrame)]+lag(X)[2:nrow(myFrame)]
#SumX
cumsum(df$X) - c(0, 0, cumsum(df$X)[1:(nrow(df)-2)])
#[1] 20 45 40 37 82 100 100 72 23
#SumY
cumsum(df$Y) - c(0, 0, cumsum(df$Y)[1:(nrow(df)-2)])
#[1] 7 91 146 74 36 34 70 110 67

Assign weights in lpSolveAPI to prioritise variables

I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.
My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>
I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.

Resources