I am facing an issue of a simple problem.
The data I have contain the following variables :BCSID id DD MM DAY.
The personal identifier, an id-day idenfifier, the calendar day, the calendar month and the day of the week. DD_flag is a variable that I need to create in order to correct the DD date that are wrong because they do not increment according to the day DAY.
My data look like this
BCSID id DD MM DAY
200 B10011Q B10011Q2 24 10 2
201 B10011Q B10011Q2 24 10 2
202 B10011Q B10011Q2 24 10 2
203 B10011Q B10011Q2 24 10 2
204 B10011Q B10011Q2 24 10 2
205 B10011Q B10011Q2 24 10 2
206 B10011Q B10011Q2 24 10 2
207 B10011Q B10011Q3 24 10 3
208 B10011Q B10011Q3 24 10 3
209 B10011Q B10011Q3 24 10 3
210 B10011Q B10011Q3 24 10 3
211 B10011Q B10011Q3 24 10 3
212 B10011Q B10011Q3 24 10 3
213 B10011Q B10011Q3 24 10 3
214 B10011Q B10011Q3 24 10 3
I will create my DD_flag variable based on DD
dtadate$DD_flag <- as.numeric(dtadate$DD)
What I need to do is to simply increment +1 to th DD_flag variable each time the day DAY change for each identifier BCSID.
I thought that it could be simpler to use the collapsed id id for my loop.
1
I tried a R loop but
I am not sure why this solution is wrong
for(i in 2:nrow(dtadate)){
if( dtadate$id[i] == dtadate$id[i-1] )
{ dtadate$DD_flag[i] = dtadate$DD_flag[i] + 1 }
}
2
I tried a Rcpp solution, that almost gives me the correct output.
Here I used the BCSID and the DAY.
The incrementation is correct but unfortunately is does not re-use the incremented value for the rest of the loop.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector TimeAddOneCpp(CharacterVector idDay, CharacterVector Day, NumericVector time) {
int n = idDay.size();
int len = n ;
for ( int i = 1; i < len; ++i ) {
if( ( idDay[i] == idDay[i - 1] ) &
( Day[i] != Day [i - 1] )
)
time[i] = time[i-1] + 1;
}
return time;
}
The function
TimeAddOneCpp(idDay = dtadate$BCSID, Day = dtadate$DAY, time = dtadate$DD_flag)
Expected output
The output I want is the following
BCSID id DD MM DAY DD_flag
200 B10011Q B10011Q2 24 10 2 24
201 B10011Q B10011Q2 24 10 2 24
202 B10011Q B10011Q2 24 10 2 24
203 B10011Q B10011Q2 24 10 2 24
204 B10011Q B10011Q2 24 10 2 24
205 B10011Q B10011Q2 24 10 2 24
206 B10011Q B10011Q2 24 10 2 24
207 B10011Q B10011Q3 24 10 3 25
208 B10011Q B10011Q3 24 10 3 25
209 B10011Q B10011Q3 24 10 3 25
210 B10011Q B10011Q3 24 10 3 25
211 B10011Q B10011Q3 24 10 3 25
212 B10011Q B10011Q3 24 10 3 25
213 B10011Q B10011Q3 24 10 3 25
214 B10011Q B10011Q3 24 10 3 25
215 B10011Q B10011Q3 24 10 3 25
216 B10011Q B10011Q3 24 10 3 25
217 B10011Q B10011Q3 24 10 3 25
218 B10011Q B10011Q3 24 10 3 25
219 B10011Q B10011Q3 24 10 3 25
220 B10011Q B10011Q4 24 10 4 26
...
So each time the DAY change for each BCSID, the DD_flag based on DD should be incremented by +1.
The data
dta = structure(list(BCSID = c("B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W"), id = c("B10011Q2", "B10011Q2", "B10011Q2", "B10011Q2",
"B10011Q2", "B10011Q2", "B10011Q2", "B10011Q3", "B10011Q3", "B10011Q3",
"B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3",
"B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q4", "B10011Q4",
"B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4",
"B10011Q4", "B10011Q4", "B10011Q5", "B10011Q5", "B10015U1", "B10015U1",
"B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1",
"B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U3", "B10015U3", "B10015U3",
"B10015U3", "B10015U3", "B10015U3", "B10015U3", "B10015U3", "B10015U3",
"B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4",
"B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1"), DD = c("24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "24", "24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "24", "24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "13", "13", "13", "13", "13", "13", "13", "13", "13",
"13", "13", "13", "13", "13", "13", "13", "13", "13", "13"),
MM = c("10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "6", "6", "6", "6", "6", "6",
"6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6",
"6"), DAY = c("2", "2", "2", "2", "2", "2", "2", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "4",
"4", "4", "4", "4", "4", "4", "4", "4", "4", "5", "5", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3",
"3", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4",
"4", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1")), .Names = c("BCSID",
"id", "DD", "MM", "DAY"), row.names = 200:300, class = "data.frame")
library(dplyr)
dta %>%
group_by(BCSID) %>%
mutate(DD_flag = c(0, cumsum(diff(as.integer(DAY))))+as.integer(DD))
# Source: local data frame [101 x 6]
# Groups: BCSID
#
# BCSID id DD MM DAY DD_flag
# 1 B10011Q B10011Q2 24 10 2 24
# 2 B10011Q B10011Q2 24 10 2 24
# 3 B10011Q B10011Q2 24 10 2 24
# 4 B10011Q B10011Q2 24 10 2 24
# 5 B10011Q B10011Q2 24 10 2 24
# 6 B10011Q B10011Q2 24 10 2 24
# 7 B10011Q B10011Q2 24 10 2 24
# 8 B10011Q B10011Q3 24 10 3 25
# 9 B10011Q B10011Q3 24 10 3 25
# 10 B10011Q B10011Q3 24 10 3 25
# .. ... ... .. .. ... ...
One option would be to create the desired values for DD_flag outside the original object, then merge them in. Let's call the data frame you posted z. So:
flags <- data.frame(id = unique(z$id), DD_flag = seq(length(unique(z$id))))
z2 <- merge(z, flags, all.x = TRUE)
That approach assumes that you don't care about the order of those flags. If you do, you just need to put the unique values of the id variable in the desired order in or before that first line.
That approach also assumes that you don't already have a variable named DD_flag in z when you go to merge. If you do, you could just run this before the merge:
z$DD_flag <- NULL
This could be a possible solution
library(data.table)
setDT(dta)
out = rbindlist(
lapply(split(dta, dta$BCSID),
function(x){ x[, DD_flag := (as.numeric(x$DD) + .GRP)-1, by = DAY]}))
#> out
# BCSID id DD MM DAY DD_flag
#1: B10011Q B10011Q2 24 10 2 24
#2: B10011Q B10011Q2 24 10 2 24
#3: B10011Q B10011Q2 24 10 2 24
#4: B10011Q B10011Q2 24 10 2 24
#5: B10011Q B10011Q2 24 10 2 24
#6: B10011Q B10011Q2 24 10 2 24
#7: B10011Q B10011Q2 24 10 2 24
#8: B10011Q B10011Q3 24 10 3 25
#9: B10011Q B10011Q3 24 10 3 25
#10: B10011Q B10011Q3 24 10 3 25
#11: B10011Q B10011Q3 24 10 3 25
#12: B10011Q B10011Q3 24 10 3 25
#13: B10011Q B10011Q3 24 10 3 25
#14: B10011Q B10011Q3 24 10 3 25
#15: B10011Q B10011Q3 24 10 3 25
#...
Related
I have the following extract of my dataset about the occupancy of a football match:
example <- data.frame(Date <- c("2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30", "2019-04-07",
"2019-03-21", "2019-03-30",
"2019-03-21", "2019-03-30",
"2019-03-21", "2019-03-30",
"2019-03-21"),
Block <- c("43L","43L", "43L", "15B", "15B", "15B", "43L", "43L",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B",
"15B", "15B",
"15B", "15B",
"15B"),
Preis <- as.numeric(c("24", "35", "30", "35", "45",
"40", "26", "30",
"35", "45", "40",
"34", "43", "42",
"35", "42", "45",
"36", "45", "43",
"36", "43", "40",
"35", "41",
"32", "42",
"30", "42",
"35")),
Max <- c("3", "3", "3", "10", "10","10","3", "3",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10",
"10", "10",
"10", "10",
"10"),
Actual <- c("2", "1", "2", "10", "9", "6","2", "2",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9",
"10", "9",
"10", "9",
"10"),
Temperatur <- c("15", "20", "18","15", "20", "18", "15", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20",
"15", "20",
"15", "20",
"15"),
Placesold <- c("1", "1", "1", "1", "1","1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1",
"1", "1",
"1", "1",
"1") )
colnames(example) <- c("Date", "Block", "Price", "Max", "Actual", "Temprature", "Placesold")
In reality, the dataset contains over 100 blocks and 46 different dates.
If you take a closer look at the data, you can see that different numbers of seats are sold out in block 15B and 43L on different days.
table(example$Date, example$Block)
table(example$Placesold)
15B 43L
2019-03-21 10 2
2019-03-30 9 1
2019-4-07 6 2
> table(example$Placesold)
1
30
My goal is to add the seats that were not sold to the data set. The variable Placesold should be 0 instead of 1. In addition, the average price of the sold tickets should be used instead of the price (without 0).
To clarify my goal, I have added the missing rows for the reduced data set.
result <- data.frame(Date <- c("2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07",
"2019-03-21", "2019-03-30", "2019-4-07"),
Block <- c("43L","43L", "43L",
"15B", "15B", "15B",
"43L", "43L","43L",
"15B", "15B", "15B",
"43L", "43L","43L",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B",
"15B", "15B", "15B"),
Preis <- c("24", "35", "30",
"35", "45", "40",
"26", "35","30",
"35", "45", "40",
"25", "35","30",
"34", "43", "42",
"35", "42", "45",
"36", "45", "43",
"36", "43", "40",
"35", "41", "41.67",
"32", "42", "41.67",
"30", "42", "41.67",
"35","43.11","41.67"),
Max <- c("3", "3", "3", "10", "10","10",
"3", "3", "3",
"10", "10","10",
"3", "3", "3",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10",
"10", "10","10"),
Actual <- c("2", "1", "2",
"10", "9", "6",
"2", "1","2",
"10", "9", "6",
"2", "1","2",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6",
"10", "9", "6"),
Temperatur <- c("15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18",
"15", "20", "18"),
Placesold <- c("1", "1", "1", "1", "1","1",
"1", "0", "1",
"1", "1", "1",
"0", "0", "0",
"1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1", "1",
"1", "1", "0",
"1", "1", "0",
"1", "1", "0",
"1", "0", "0") )
colnames(result) <- c("Date", "Block", "Price", "Max", "Actual", "Temprature", "Placesold")
The results of the blocks and the data as well as the occurrence of the variable "Placesold" look like this:
table(result$Date, result$Block)
table(result$Placesold)
15B 43L
2019-03-21 10 3
2019-03-30 10 3
2019-4-07 10 3
> table(result$Placesold)
0 1
9 30
My first thought was to create a matrix with more rows, but to be honest I don't really know how. I hope you can help me.
Thank you very much.
I use dplyr functions and base::merge. merge can perform cross join between data frames, vectors and other types.
Construction of each date and block pair - it includes unsold blocks of a date:
# ordered, unique vector of dates
dates <- example$Date %>% unique() %>% sort()
# ordered, unique vector of blocks
blocks <- example$Block %>% unique() %>% sort()
# insert dummy block to demonstrate effects of missing blocks
blocks <- c("11B", blocks)
# cross join of dates and blocks: each date paired with each block
# (it results a data.frame)
eachDateBlock <- merge(dates, blocks, by = NULL)
# merge generate x and y as names for the resulted data.frame
# I rename them as a preparation for left_join
eachDateBlock <- eachDateBlock %>% rename(Date = x, Block = y)
# rows from 'eachDateBlock' with matchig row in 'example' get values of variables,
# otherwise they filled by NAs
extendedData <- eachDateBlock %>%
left_join(example, by = c("Date" = "Date", "Block" = "Block"))
# NOTE: before avgPrice you need something similar conversion - I ignore
# other numeric columns here
#example$Price <- as.double(example$Price)
#example$Placesold <- as.double(example$Placesold)
# Overwrite NAs in rows of supplied unsold blocks
avgPrice <- mean(example$Price)
result <- extendedData %>% mutate(
Price = if_else(is.na(Price), avgPrice, Price),
Placesold = if_else(is.na(Placesold), 0, Placesold)
) %>% arrange(Date)
> table(result$Date, result$Block)
11B 15B 43L
2019-03-21 1 10 2
2019-03-30 1 9 1
2019-04-07 1 6 2
> table(result$Placesold)
0 1
3 30
> result
Date Block Price Max Actual Temprature Placesold
1 2019-03-21 11B 37.53333 <NA> <NA> <NA> 0
.
.
.
12 2019-03-21 43L 24.00000 3 2 15 1
13 2019-03-21 43L 26.00000 3 2 15 1
14 2019-03-30 11B 37.53333 <NA> <NA> <NA> 0
15 2019-03-30 15B 45.00000 10 9 20 1
.
.
.
24 2019-03-30 43L 35.00000 3 1 20 1
25 2019-04-07 11B 37.53333 <NA> <NA> <NA> 0
.
.
.
31 2019-04-07 15B 40.00000 10 6 18 1
32 2019-04-07 43L 30.00000 3 2 18 1
33 2019-04-07 43L 30.00000 3 2 18 1
I have two matrices provided below:
cf = structure(c("7", "7", "7", "7", "7", "7", "7", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "3", "17", "18", "19",
"20", "21", "22", "23", "0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20", "21", "22", "23", "0", "1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22", "23", "0", "1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"), .Dim = c(71L,
2L), .Dimnames = list(NULL, c("d", "h")))
hour_df<-data.frame(
day = as.character(rep(c(1,2,3,4,5,6,7), each = 24)),
hours = as.character(rep(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), times = 7)),
period = rep(c(rep("night",times = 8),rep("day",times = 12),rep("night",times = 4)), times = 7),
tariff_label = rep(c(rep("special feed", times = 8),rep("normal feed", times = 12),rep("special feed", times = 4)), times = 7),
week_period = c(rep("weekend",times = 32),rep("weekday",times = 108),rep("weekend",times = 28))
)
hour_df$tariff_label[hour_df$day %in% c("7","1")]<-"special feed"
hour_df<-as.matrix(hour_df)
I want to merge these matrices on two common columns in each matrix. e.g by.x = c("d","h"), by.y = c("day","hours")
If I use the base function merge()I get my desired output that looks like this
merge(cf,hour_df, by.x = c("d","h"), by.y = c("day","hours"))
d h period tariff_label week_period
1 1 0 night special feed weekend
2 1 1 night special feed weekend
3 1 10 day special feed weekend
4 1 11 day special feed weekend
5 1 12 day special feed weekend
6 1 13 day special feed weekend
7 1 14 day special feed weekend
8 1 15 day special feed weekend
9 1 16 day special feed weekend
10 1 17 day special feed weekend
11 1 18 day special feed weekend
12 1 19 day special feed weekend
13 1 2 night special feed weekend
14 1 20 night special feed weekend
15 1 21 night special feed weekend
16 1 22 night special feed weekend
17 1 23 night special feed weekend
18 1 3 night special feed weekend
19 1 4 night special feed weekend
20 1 5 night special feed weekend
21 1 6 night special feed weekend
22 1 7 night special feed weekend
23 1 8 day special feed weekend
24 1 9 day special feed weekend
25 2 0 night special feed weekend
26 2 1 night special feed weekend
27 2 10 day normal feed weekday
28 2 11 day normal feed weekday
29 2 12 day normal feed weekday
30 2 13 day normal feed weekday
31 2 14 day normal feed weekday
32 2 15 day normal feed weekday
33 2 16 day normal feed weekday
34 2 17 day normal feed weekday
35 2 18 day normal feed weekday
36 2 19 day normal feed weekday
37 2 2 night special feed weekend
38 2 20 night special feed weekday
39 2 21 night special feed weekday
40 2 22 night special feed weekday
41 2 23 night special feed weekday
42 2 3 night special feed weekend
43 2 4 night special feed weekend
44 2 5 night special feed weekend
45 2 6 night special feed weekend
46 2 7 night special feed weekend
47 2 8 day normal feed weekday
48 2 9 day normal feed weekday
49 3 0 night special feed weekday
50 3 1 night special feed weekday
51 3 10 day normal feed weekday
52 3 11 day normal feed weekday
53 3 12 day normal feed weekday
54 3 13 day normal feed weekday
55 3 14 day normal feed weekday
56 3 15 day normal feed weekday
57 3 2 night special feed weekday
58 3 3 night special feed weekday
59 3 4 night special feed weekday
60 3 5 night special feed weekday
61 3 6 night special feed weekday
62 3 7 night special feed weekday
63 3 8 day normal feed weekday
64 3 9 day normal feed weekday
65 7 17 day special feed weekend
66 7 18 day special feed weekend
67 7 19 day special feed weekend
68 7 20 night special feed weekend
69 7 21 night special feed weekend
70 7 22 night special feed weekend
71 7 23 night special feed weekend
As you see above I have 71 rows. I wanted to see if there is a faster function for merging matrices. I saw online that there is a function called merge.Matrix() and it should be faster than base merge. However, when I tried to implement it, I got a completely different result.
library(Matrix.utils)
merge.Matrix(cf,hour_df, by.x = c("d","h"), by.y = c("day","hours"))
d h day hours period tariff_label week_period
"7" "17" "1" "2" "night" "special feed" "weekend"
"7" "19" "1" "0" "night" "special feed" "weekend"
"7" "18" "1" "2" "night" "special feed" "weekend"
"7" "19" "1" "1" "night" "special feed" "weekend"
I tried to see online how it is used and more information about it but information on this function seems to be scarce. I also checked out the vignette. Can someone tell me what I am doing wrong or whether there is a better function than this?
Please Note
I am already aware of dplyr joins and data.table. It is important that both of the matrices stay matrices and that they are not changed into some other format. In reality, my code is performing a join from a list that contains thousands of matrices and therefore needs to be quick.
I have the following data and I want to subset some rows from the table if the name is in the vector l.
df <-data.frame("Names" = c("TIGIT", "ABCB1", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "ABL1", "CD2", "IL12A", "PSEN2", "CD3G", "CD28", "PSEN1", "ITGA1"),"1S" = c("5", "6", "8", "99", "5", "0", "1", "3", "15", "15", "34", "62", "54", "6", "8", "9"), "1T" = c("6", "4", "6", "9", "5", "11", "33", "7", "8", "24", "34", "62", "66", "4", "78", "44"))
rownames(df) <- df$Names
df <- df %>% select(-"Names") # df I have
l <- c("TIGIT", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "CD2", "PSEN2", "CD3G", "CD28", "PSEN1") # genes I want to select
I want to get the following table in the output.
X1S X1T
TIGIT 5 6
CD8B 8 6
CD8A 99 9
CD1C 5 5
F2RL1 0 11
LCP1 1 33
LAG3 3 7
CD2 15 24
PSEN2 62 62
CD3G 54 66
CD28 6 4
PSEN1 8 78
It is easier to filter by the gene names, if you keep them as a column,
instead of making them rownames.
The following changes to your code will get you the result you are lookin for.
library(tidyverse)
df <-data.frame("Names" = c("TIGIT", "ABCB1", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "ABL1", "CD2", "IL12A", "PSEN2", "CD3G", "CD28", "PSEN1", "ITGA1"),"1S" = c("5", "6", "8", "99", "5", "0", "1", "3", "15", "15", "34", "62", "54", "6", "8", "9"), "1T" = c("6", "4", "6", "9", "5", "11", "33", "7", "8", "24", "34", "62", "66", "4", "78", "44"))
genes_to_select <- c("TIGIT", "CD8B", "CD8A", "CD1C", "F2RL1", "LCP1", "LAG3", "CD2", "PSEN2", "CD3G", "CD28", "PSEN1") # genes I want to select
df <-
df %>%
filter(Names %in% genes_to_select) %>%
column_to_rownames("Names") %>%
mutate(across(.fns = as.numeric)) %>%
as.matrix()
df
#> X1S X1T
#> [1,] 5 6
#> [2,] 8 6
#> [3,] 99 9
#> [4,] 5 5
#> [5,] 0 11
#> [6,] 1 33
#> [7,] 3 7
#> [8,] 15 24
#> [9,] 62 62
#> [10,] 54 66
#> [11,] 6 4
#> [12,] 8 78
We could also use slice
library(dplyr)
library(tibble)
df %>%
slice(match(Names, l)) %>%
column_to_rownames('Names')
One line does the job:
df[rownames(df) %in% l,]
X1S X1T
TIGIT 5 6
CD8B 8 6
CD8A 99 9
CD1C 5 5
F2RL1 0 11
LCP1 1 33
LAG3 3 7
CD2 15 24
PSEN2 62 62
CD3G 54 66
CD28 6 4
PSEN1 8 78
Or if you have Names:
df[df$Names %in% l,]
I am analyzing data of patient admission/discharge in a number of hospitals for various inconsistencies.
My data structure is like -
Row_id ; nothing but a unique identifier of records (used as foreign key in some other table)
patient_id : unique identifier key for a patient
pack_id : the medical package chosen by the patient for treatment
hospital_id : unique identifier for a hospital
admn_dt : the date of admission
discharge_date : the date of discharge of patient
Snapshot of data
row_id patient_id pack_id hosp_id admn_date discharge_date
1 1 12 1 01-01-2020 14-01-2020
2 1 62 2 03-01-2020 15-01-2020
3 1 77 1 16-01-2020 27-01-2020
4 1 86 1 18-01-2020 19-01-2020
5 1 20 2 22-01-2020 25-01-2020
6 2 55 3 01-01-2020 14-01-2020
7 2 86 3 03-01-2020 17-01-2020
8 2 72 4 16-01-2020 27-01-2020
9 1 7 1 26-01-2020 30-01-2020
10 3 54 5 14-01-2020 22-01-2020
11 3 75 5 09-02-2020 17-02-2020
12 3 26 6 22-01-2020 05-02-2020
13 4 21 7 14-04-2020 23-04-2020
14 4 12 7 23-04-2020 29-04-2020
15 5 49 8 17-03-2020 26-03-2020
16 5 35 9 27-02-2020 07-03-2020
17 6 51 10 12-04-2020 15-04-2020
18 7 31 11 11-02-2020 17-02-2020
19 8 10 12 07-03-2020 08-03-2020
20 8 54 13 20-03-2020 23-03-2020
sample dput of data is as under:
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
I have to identify the records where patient got admitted without discharge from previous treatment. For this I have used the following code taking help from this thread How to know customers who placed next order before delivery/receiving of earlier order? In R -
library(tidyverse)
df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup() %>%
filter(admitted >1, activity == "admn")
This give me nicely all the records where patients got admission without being discharged from previous treatment.
Output-
# A tibble: 6 x 8
row_id patient_id pack_id hosp_id sort_key activity date admitted
<chr> <chr> <chr> <chr> <int> <ord> <date> <dbl>
1 2 1 62 2 2 admn 2020-01-03 2
2 4 1 86 1 4 admn 2020-01-18 2
3 5 1 20 2 5 admn 2020-01-22 2
4 9 1 7 1 6 admn 2020-01-26 2
5 7 2 86 3 8 admn 2020-01-03 2
6 8 2 72 4 9 admn 2020-01-16 2
Explanation-
Row_id 2 is correct because it overlaps with row_id 1
Row_id 4 is correct because it overlaps with row_id 3
Row_id 5 is correct because it overlaps with row_id 3 (again)
Row_id 9 is correct because it overlaps with row_id 3 (again)
Row_id 7 is correct becuase it overlaps with row_id 6
Row_id 8 is correct becuase it overlaps with row_id 7
Now I am stuck at a given validation rule that patients are allowed to take admission in same hospital n number of times without actually validating for their previous discharge. In other words, I have to extract only those records where patients got admitted in a different hospital without being discharged from 'another hospital. If the hospital would have been same, the group_by at hosp_id field could have done the work for me, but here the case is actually reverse. For same hosp_id it is allowed but for different it is not allowed.
Please help how may I proceed?
If I could map the resultant row_id with its overlapping record's row_id, may be we can solve the problem.
Desired Output-
row_id
2
5
8
because row_ids 4,, 9 and 7 overlaps with record having same hospital id.
Thanks in advance.
P.S. Though a desired solution has been given, I want to know can it done through map/apply group of function and/or through data.table package?
Is this what you're looking for? (Refer to the comments in the code for details. I can provide clarifications if necessary.)
#Your data
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
#Solution
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)
#Convert patient_id column into numeric
df$patient_id <- as.numeric(df$patient_id)
#Create empty (well, 1 row) data.frame to
#collect output data
#This needs three additional columns
#(as indicated)
outdat <- data.frame(matrix(nrow = 1, ncol = 9), stringsAsFactors = FALSE)
names(outdat) <- c(names(df), "ref_discharge_date", "ref_hosp_id", "overlap")
#Logic:
#For each unique patient_id take all
#their records.
#For each row of each such set of records
#compare its discharge_date with the admn_date
#of all other records with admn_date >= its own
#admn_date
#Then register the time interval between this row's
#discharge_date and the compared row's admn_date
#as a numeric value ("overlap")
#The idea is that concurrent hospital stays will have
#negative overlaps as the admn_date (of the current stay)
#will precede the discharge_date (of the previous one)
for(i in 1:length(unique(df$patient_id))){
#i <- 7
curdat <- df %>% filter(patient_id == unique(df$patient_id)[i])
curdat %<>% mutate(admn_date = lubridate::as_date(admn_date),
discharge_date = lubridate::as_date(discharge_date))
curdat %<>% arrange(admn_date)
for(j in 1:nrow(curdat)){
#j <- 1
currow <- curdat[j, ]
#otrows <- curdat[-j, ]
#
otrows <- curdat %>% filter(admn_date >= currow$admn_date)
#otrows <- curdat
for(k in 1:nrow(otrows)){
otrows$ref_discharge_date[k] <- currow$discharge_date
#otrows$refdisc[k] <- as_date(otrows$refdisc[k])
otrows$ref_hosp_id[k] <- currow$hosp_id
otrows$overlap[k] <- as.numeric(difftime(otrows$admn_date[k], currow$discharge_date))
}
otrows$ref_discharge_date <- as_date(otrows$ref_discharge_date)
outdat <- bind_rows(outdat, otrows)
}
}
rm(curdat, i, j, k, otrows, currow)
#Removing that NA row + removing all self-rows
outdat %<>%
filter(!is.na(patient_id)) %>%
filter(discharge_date != ref_discharge_date)
#Filter out only negative overlaps
outdat %<>% filter(overlap < 0)
#Filter out only those records where the patient
#was admitted to different hospitals
outdat %<>% filter(hosp_id != ref_hosp_id)
outdat
# row_id patient_id pack_id hosp_id admn_date discharge_date ref_discharge_date ref_hosp_id overlap
# 1 2 1 62 2 2020-01-03 2020-01-15 2020-01-14 1 -11
# 2 5 1 20 2 2020-01-22 2020-01-25 2020-01-27 1 -5
# 3 8 2 72 4 2020-01-16 2020-01-27 2020-01-17 3 -1
Group by the patient id again and then count the hospital IDs. Then merge that back on and filter the data.
Something like:
admitted_not_validated %>%
left_join(
admitted_not_validated %>%
group_by(patient_id) %>%
summarize (multi_hosp = length(unique(hosp_id)),.groups ='drop'),
by = 'patient_id') %>%
filter(multi_hosp >1)
I'm having trouble finding the mean for a subset of data. Here are the two questions I'm hoping to answer. The first seems to be working fine, but the second returns the same answer as the first, but without numbers to the right of the decimal place. What's going on?
There is also an error that appears:
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
# What is the mean suspension rate for schools by farms overall?
aggregate(suspension_rate_total ~ farms, merged_data, FUN = function(suspension_rate_total)
mean(as.numeric(as.character(suspension_rate_total))))
# What is the mean suspension rate for schools with farms > 100?
aggregate(suspension_rate_total ~ farms, merged_data, FUN = function(suspension_rate_total)
mean(as.numeric(as.character(suspension_rate_total))), subset = farms< 100)
Data
merged_data <- structure(list(schid = c("1030642", "1030766", "1030774", "1030840",
"1130103", "1230150", "1530435", "1530492", "1530500", "1931047",
"1931708", "1931864", "1932623", "1933746", "1937226", "1938554",
"1938612", "1938885", "1995836", "1996016"), farms = c("132",
"116", "348", "406", "68", "130", "370", "204", "225", "2,616",
"1,106", "1,918", "1,148", "2,445", "1,123", "1,245", "1,369",
"1,073", "932", "178"), foster = c("2", "0", "1", "8", "1", "4",
"4", "0", "0", "22", "11", "12", "2", "8", "13", "13", "4", "3",
"2", "3"), homeless = c("14", "0", "8", "4", "1", "4", "5", "0",
"14", "35", "42", "116", "9", "8", "34", "54", "26", "31", "5",
"11"), migrant = c("0", "0", "0", "0", "0", "0", "18", "0", "0",
"0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0"), ell = c("18",
"12", "114", "45", "7", "4", "50", "28", "26", "274", "212",
"325", "95", "112", "232", "185", "121", "84", "24", "35"), suspension_rate_total = c("*",
"20", "0", "0", "95", "5", "*", "256", "78", "33", "20", "1",
"218", "120", "0", "0", "*", "*", "*", "0"), suspension_violent = c("*",
"9", "0", "0", "20", "2", "*", "38", "0", "6", "3", "0", "53",
"35", "0", "0", "*", "*", "*", "0"), suspension_violent_no_injury = c("*",
"6", "0", "0", "47", "1", "*", "121", "52", "7", "13", "1", "77",
"44", "0", "0", "*", "*", "*", "0"), suspension_weapon = c("*",
"0", "0", "0", "8", "0", "*", "1", "0", "1", "1", "0", "4", "3",
"0", "0", "*", "*", "*", "0"), suspension_drug = c("*", "0",
"0", "0", "9", "1", "*", "59", "12", "16", "0", "0", "6", "5",
"0", "0", "*", "*", "*", "0"), suspension_defiance = c("*", "1",
"0", "0", "9", "1", "*", "16", "12", "0", "3", "0", "69", "30",
"0", "0", "*", "*", "*", "0"), suspension_other = c("*", "4",
"0", "0", "2", "0", "*", "21", "2", "3", "0", "0", "9", "3",
"0", "0", "*", "*", "*", "0")), row.names = c(NA, 20L), class = "data.frame")
Thank you so much.
Image-1
Image-2
Tidy up your data:
# replace * with NA
merged_data$suspension_rate_total[merged_data$suspension_rate_total == '*'] <- NA
# convert character to numeric format
merged_data$suspension_rate_total <- as.numeric(merged_data$suspension_rate_total)
# remove comma in strings and convert character to numeric format
merged_data$farms <- as.numeric(gsub(",", "", merged_data$farms))
Output
# What is the mean suspension rate for schools by farms overall?
aggregate(suspension_rate_total ~ farms, merged_data, FUN = mean, na.rm = TRUE)
# farms suspension_rate_total
# 1 68 95
# 2 116 20
# 3 130 5
# 4 178 0
# 5 204 256
# 6 225 78
# 7 348 0
# 8 406 0
# 9 1106 20
# 10 1123 0
# 11 1148 218
# 12 1245 0
# 13 1918 1
# 14 2445 120
# 15 2616 33
# What is the mean suspension rate for schools with farms > 100?
aggregate(suspension_rate_total ~ farms, merged_data, FUN = mean, na.rm = TRUE, subset = farms > 100)
# farms suspension_rate_total
# 1 116 20
# 2 130 5
# 3 178 0
# 4 204 256
# 5 225 78
# 6 348 0
# 7 406 0
# 8 1106 20
# 9 1123 0
# 10 1148 218
# 11 1245 0
# 12 1918 1
# 13 2445 120
# 14 2616 33
Are you sure 'NA's introduced by coercion' is a error and not a warning.
When you convert a character column to numeric :
as.numeric(as.character(suspension_rate_total)) , the blanks are coerced into NA's , which is intimated through warnings.
Also, I get different answers for both blocks of code
> aggregate(suspension_rate_total ~ farms, merged_data, FUN = function(suspension_rate_total)
+ mean(as.numeric(as.character(suspension_rate_total))))
farms suspension_rate_total
1 68 95
2 116 20
3 130 5
4 132 NA
5 178 0
6 204 256
7 225 78
8 348 0
9 370 NA
10 406 0
11 932 NA
> aggregate(suspension_rate_total ~ farms, merged_data, FUN = function(suspension_rate_total)
+ mean(as.numeric(as.character(suspension_rate_total))), subset = farms< 100)
farms suspension_rate_total
1 68 95
>
>
Further, the comment on you second block of code mention farms > 100? , but in you code you used subset = farms< 100