Related
Suppose a data:
df1 <- tibble::tribble(~"M1", ~"M2", ~"Beer, pints", ~"Coffee, oz", ~"Gasoline, galons", ~"Milk, galons", ~"Warehouse, square feet", ~"Nearest place, miles",
"NY", "22", "10", "12", "15", "100", "100", "20",
"NY", "20", "9", "10", "12", "100", "100", "20",
"NY", "18", "8", "9", "11", "100", "100", "20",
"M1", "M2", "Beer, liters", "Coffee, cups (120 ml)", "Gasoline, liters", "Milk, liters", "Warehouse, square meters", "Nearest place, kilometers",
"PR", "22", "7", "8", "9", "70", "67", "7",
"PR", "20", "6", "7", "8", "80", "75", "7",
"M1", "M2", "Beer, pints", "Coffee, oz", "Gasoline, liters", "Milk, liters", "Warehouse, square feet", "Nearest place, miles",
"KR", "22", "6", "6", "7", "60", "50", "9",
"KR", "20", "5", "6", "8", "55", "65", "9",
"KR", "18", "5", "6", "8", "50", "55", "9")
For visual representation:
Is there a nice method to recalculate all columns in the same metrics (like if it is liters, then the entrire column should be liters; if miles (not kilometers), then the entire column to be miles [based on condition in the subheadings inside]?
It could be great to think on the nicest methods to solve it.
PS: for information:
1 gallon = 3.78541 liters
1 pint = 0.473176 liters
1 oz = 0.0295735 liters
11 square feet = 1.02193 square meters
1 mile = 1.60934 kilometers
I am just wondering and just started to consider for solution.
I am interested to look for possible nice solutions.
In addition, it will be interesting for the entire R community to think on the best methods to edit the data by condition.
When the data is sloppy, we must also get our hands dirty.I thought of way, with many steps.
Data
df1 <-
structure(list(m1 = c("M1", "NY", "NY", "NY", "M1", "PR", "PR",
"M1", "KR", "KR", "KR"), m2 = c("M2", "22", "20", "18", "M2",
"22", "20", "M2", "22", "20", "18"), beer = c("Beer, pints",
"10", "9", "8", "Beer, liters", "7", "6", "Beer, pints", "6",
"5", "5"), coffee = c("Coffee, oz", "12", "10", "9", "Coffee, cups (120 ml)",
"8", "7", "Coffee, oz", "6", "6", "6"), gasoline = c("Gasoline, galons",
"15", "12", "11", "Gasoline, liters", "9", "8", "Gasoline, liters",
"7", "8", "8"), milk = c("Milk, galons", "100", "100", "100",
"Milk, liters", "70", "80", "Milk, liters", "60", "55", "50"),
warehouse = c("Warehouse, square feet", "100", "100", "100",
"Warehouse, square meters", "67", "75", "Warehouse, square feet",
"50", "65", "55"), nearest_place = c("Nearest_place, miles",
"20", "20", "20", "Nearest place, kilometers", "7", "7",
"Nearest place, miles", "9", "9", "9")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
Convert function
convert_unit <- function(value,unit){
m <-
case_when(
unit == "galons" ~ 3.78541,
unit == "pints" ~ 0.473176,
unit == "oz" ~ 0.0295735,
unit == "squarefeet" ~ 1.02193/11,
unit == "miles" ~ 1.02193/11,
TRUE ~ 1
)
output <- m*as.numeric(value)
return(output)
}
Data preparation
First, I would add the header as the first row and also create better names.
library(dplyr)
library(stringr)
library(tidyr)
#remotes::install_github("vbfelix/relper")
library(relper)
or_names <- names(df1)
new_names <- str_to_lower(str_select(or_names,before = ","))
n_row <- nrow(df1)
df1[2:(n_row+1),] <- df1
df1[1,] <- as.list(or_names)
names(df1) <- new_names
Data manipulation
Then, I would create new columns with the units, and the apply the function to each one.
df1 %>%
mutate(
across(.cols = -c(m1:m2),.fns = ~str_keep(str_select(.,after = ",")),.names = "{.col}_unit"),
aux = beer_unit == "",
across(.cols = ends_with("_unit"),~if_else(. == "",NA_character_,.))) %>%
fill(ends_with("_unit"),.direction = "down") %>%
filter(aux) %>%
mutate(
across(
.cols = beer:nearest_place,
.fns = ~convert_unit(value = .,unit = get(str_c(cur_column(),"_unit")))
)
) %>%
select(-aux,-ends_with("_unit"))
Output
# A tibble: 8 x 8
m1 m2 beer coffee gasoline milk warehouse nearest_place
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NY 22 4.73 0.355 56.8 379. 9.29 1.86
2 NY 20 4.26 0.296 45.4 379. 9.29 1.86
3 NY 18 3.79 0.266 41.6 379. 9.29 1.86
4 PR 22 7 8 9 70 67 7
5 PR 20 6 7 8 80 75 7
6 KR 22 2.84 0.177 7 60 4.65 0.836
7 KR 20 2.37 0.177 8 55 6.04 0.836
8 KR 18 2.37 0.177 8 50 5.11 0.836
I am analyzing data of patient admission/discharge in a number of hospitals for various inconsistencies.
My data structure is like -
Row_id ; nothing but a unique identifier of records (used as foreign key in some other table)
patient_id : unique identifier key for a patient
pack_id : the medical package chosen by the patient for treatment
hospital_id : unique identifier for a hospital
admn_dt : the date of admission
discharge_date : the date of discharge of patient
Snapshot of data
row_id patient_id pack_id hosp_id admn_date discharge_date
1 1 12 1 01-01-2020 14-01-2020
2 1 62 2 03-01-2020 15-01-2020
3 1 77 1 16-01-2020 27-01-2020
4 1 86 1 18-01-2020 19-01-2020
5 1 20 2 22-01-2020 25-01-2020
6 2 55 3 01-01-2020 14-01-2020
7 2 86 3 03-01-2020 17-01-2020
8 2 72 4 16-01-2020 27-01-2020
9 1 7 1 26-01-2020 30-01-2020
10 3 54 5 14-01-2020 22-01-2020
11 3 75 5 09-02-2020 17-02-2020
12 3 26 6 22-01-2020 05-02-2020
13 4 21 7 14-04-2020 23-04-2020
14 4 12 7 23-04-2020 29-04-2020
15 5 49 8 17-03-2020 26-03-2020
16 5 35 9 27-02-2020 07-03-2020
17 6 51 10 12-04-2020 15-04-2020
18 7 31 11 11-02-2020 17-02-2020
19 8 10 12 07-03-2020 08-03-2020
20 8 54 13 20-03-2020 23-03-2020
sample dput of data is as under:
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
I have to identify the records where patient got admitted without discharge from previous treatment. For this I have used the following code taking help from this thread How to know customers who placed next order before delivery/receiving of earlier order? In R -
library(tidyverse)
df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup() %>%
filter(admitted >1, activity == "admn")
This give me nicely all the records where patients got admission without being discharged from previous treatment.
Output-
# A tibble: 6 x 8
row_id patient_id pack_id hosp_id sort_key activity date admitted
<chr> <chr> <chr> <chr> <int> <ord> <date> <dbl>
1 2 1 62 2 2 admn 2020-01-03 2
2 4 1 86 1 4 admn 2020-01-18 2
3 5 1 20 2 5 admn 2020-01-22 2
4 9 1 7 1 6 admn 2020-01-26 2
5 7 2 86 3 8 admn 2020-01-03 2
6 8 2 72 4 9 admn 2020-01-16 2
Explanation-
Row_id 2 is correct because it overlaps with row_id 1
Row_id 4 is correct because it overlaps with row_id 3
Row_id 5 is correct because it overlaps with row_id 3 (again)
Row_id 9 is correct because it overlaps with row_id 3 (again)
Row_id 7 is correct becuase it overlaps with row_id 6
Row_id 8 is correct becuase it overlaps with row_id 7
Now I am stuck at a given validation rule that patients are allowed to take admission in same hospital n number of times without actually validating for their previous discharge. In other words, I have to extract only those records where patients got admitted in a different hospital without being discharged from 'another hospital. If the hospital would have been same, the group_by at hosp_id field could have done the work for me, but here the case is actually reverse. For same hosp_id it is allowed but for different it is not allowed.
Please help how may I proceed?
If I could map the resultant row_id with its overlapping record's row_id, may be we can solve the problem.
Desired Output-
row_id
2
5
8
because row_ids 4,, 9 and 7 overlaps with record having same hospital id.
Thanks in advance.
P.S. Though a desired solution has been given, I want to know can it done through map/apply group of function and/or through data.table package?
Is this what you're looking for? (Refer to the comments in the code for details. I can provide clarifications if necessary.)
#Your data
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
#Solution
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)
#Convert patient_id column into numeric
df$patient_id <- as.numeric(df$patient_id)
#Create empty (well, 1 row) data.frame to
#collect output data
#This needs three additional columns
#(as indicated)
outdat <- data.frame(matrix(nrow = 1, ncol = 9), stringsAsFactors = FALSE)
names(outdat) <- c(names(df), "ref_discharge_date", "ref_hosp_id", "overlap")
#Logic:
#For each unique patient_id take all
#their records.
#For each row of each such set of records
#compare its discharge_date with the admn_date
#of all other records with admn_date >= its own
#admn_date
#Then register the time interval between this row's
#discharge_date and the compared row's admn_date
#as a numeric value ("overlap")
#The idea is that concurrent hospital stays will have
#negative overlaps as the admn_date (of the current stay)
#will precede the discharge_date (of the previous one)
for(i in 1:length(unique(df$patient_id))){
#i <- 7
curdat <- df %>% filter(patient_id == unique(df$patient_id)[i])
curdat %<>% mutate(admn_date = lubridate::as_date(admn_date),
discharge_date = lubridate::as_date(discharge_date))
curdat %<>% arrange(admn_date)
for(j in 1:nrow(curdat)){
#j <- 1
currow <- curdat[j, ]
#otrows <- curdat[-j, ]
#
otrows <- curdat %>% filter(admn_date >= currow$admn_date)
#otrows <- curdat
for(k in 1:nrow(otrows)){
otrows$ref_discharge_date[k] <- currow$discharge_date
#otrows$refdisc[k] <- as_date(otrows$refdisc[k])
otrows$ref_hosp_id[k] <- currow$hosp_id
otrows$overlap[k] <- as.numeric(difftime(otrows$admn_date[k], currow$discharge_date))
}
otrows$ref_discharge_date <- as_date(otrows$ref_discharge_date)
outdat <- bind_rows(outdat, otrows)
}
}
rm(curdat, i, j, k, otrows, currow)
#Removing that NA row + removing all self-rows
outdat %<>%
filter(!is.na(patient_id)) %>%
filter(discharge_date != ref_discharge_date)
#Filter out only negative overlaps
outdat %<>% filter(overlap < 0)
#Filter out only those records where the patient
#was admitted to different hospitals
outdat %<>% filter(hosp_id != ref_hosp_id)
outdat
# row_id patient_id pack_id hosp_id admn_date discharge_date ref_discharge_date ref_hosp_id overlap
# 1 2 1 62 2 2020-01-03 2020-01-15 2020-01-14 1 -11
# 2 5 1 20 2 2020-01-22 2020-01-25 2020-01-27 1 -5
# 3 8 2 72 4 2020-01-16 2020-01-27 2020-01-17 3 -1
Group by the patient id again and then count the hospital IDs. Then merge that back on and filter the data.
Something like:
admitted_not_validated %>%
left_join(
admitted_not_validated %>%
group_by(patient_id) %>%
summarize (multi_hosp = length(unique(hosp_id)),.groups ='drop'),
by = 'patient_id') %>%
filter(multi_hosp >1)
I am reading in an Excel table with multiple rows of headers, which, through read.csv, creates an object like this in R.
R1 <- c("X", "X.1", "X.2", "X.3", "EU", "EU.1", "EU.2", "US", "US.1", "US.2")
R2 <- c("Min Age", "Max Age", "Min Duration", "Max Duration", "1", "2", "3", "1", "2", "3")
R3 <- c("18", "21", "1", "3", "0.12", "0.32", "0.67", "0.80", "0.90", "1.01")
R4 <- c("22", "25", "1", "3", "0.20", "0.40", "0.70", "0.85", "0.98", "1.05")
R5 <- c("26", "30", "1", "3", "0.25", "0.50", "0.80", "0.90", "1.05", "1.21")
R6 <- c("18", "21", "4", "5", "0.32", "0.60", "0.95", "0.99", "1.30", "1.40")
R7 <- c("22", "25", "4", "5", "0.40", "0.70", "1.07", "1.20", "1.40", "1.50")
R8 <- c("26", "30", "4", "5", "0.55", "0.80", "1.09", "1.34", "1.67", "1.99")
table1 <- as.data.frame(rbind(R1, R2, R3, R4, R5, R6, R7, R8))
How do I now 'flatten' this so that I end up with an R table with "Min age", "Max Age", "Min Duration", "Max Duration", "Area", "Level", "Price" columns. With the "Area" column showing either "EU" or "US", the "Level" column showing either 1, 2 or 3, and then the "Price" column showing the corresponding price found in the Excel table?
I would use the gather function from tidyr if there weren't multiple header rows, but can't seem to work it with this data, any ideas?
The output should have a total of 36 rows + headers
If you skip the first row, as suggested by akrun, you will presumably end up with data that looks something like this: (with "X"s and ".1"/".2" added automatically by R)
library(tidyverse)
df <- tribble(
~Min.Age, ~Max.Age, ~Min.Duration, ~Max.Duration, ~X1.1, ~X2.1, ~X3.1, ~X1.2, ~X2.2, ~X3.2,
"18", "21", "1", "3", "0.12", "0.32", "0.67", "0.80", "0.90", "1.01",
"22", "25", "1", "3", "0.20", "0.40", "0.70", "0.85", "0.98", "1.05",
"26", "30", "1", "3", "0.25", "0.50", "0.80", "0.90", "1.05", "1.21",
"18", "21", "4", "5", "0.32", "0.60", "0.95", "0.99", "1.30", "1.40",
"22", "25", "4", "5", "0.40", "0.70", "1.07", "1.20", "1.40", "1.50",
"26", "30", "4", "5", "0.55", "0.80", "1.09", "1.34", "1.67", "1.99"
)
With this data, you can then use gather to collect all headers beginning with X into one column and price into another. You can separate the the headers into the "Level" and "Area". Finally, recode Area and remove "X" from the levels.
df %>%
gather(headers, Price, starts_with("X")) %>%
separate(headers, c("Level", "Area")) %>%
mutate(Area = if_else(Area == "1", "EU", "US"),
Level = parse_number(Level))
#> # A tibble: 36 x 7
#> Min.Age Max.Age Min.Duration Max.Duration Level Area Price
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 18 21 1 3 1 EU 0.12
#> 2 22 25 1 3 1 EU 0.20
#> 3 26 30 1 3 1 EU 0.25
#> 4 18 21 4 5 1 EU 0.32
#> 5 22 25 4 5 1 EU 0.40
#> 6 26 30 4 5 1 EU 0.55
#> 7 18 21 1 3 2 EU 0.32
#> 8 22 25 1 3 2 EU 0.40
#> 9 26 30 1 3 2 EU 0.50
#> 10 18 21 4 5 2 EU 0.60
#> # ... with 26 more rows
Created on 2018-10-12 by the reprex package (v0.2.1)
P.S. You can find lots of spreadsheet munging workflows here: https://nacnudus.github.io/spreadsheet-munging-strategies/small-multiples-with-all-headers-present-for-each-multiple.html
I am facing an issue of a simple problem.
The data I have contain the following variables :BCSID id DD MM DAY.
The personal identifier, an id-day idenfifier, the calendar day, the calendar month and the day of the week. DD_flag is a variable that I need to create in order to correct the DD date that are wrong because they do not increment according to the day DAY.
My data look like this
BCSID id DD MM DAY
200 B10011Q B10011Q2 24 10 2
201 B10011Q B10011Q2 24 10 2
202 B10011Q B10011Q2 24 10 2
203 B10011Q B10011Q2 24 10 2
204 B10011Q B10011Q2 24 10 2
205 B10011Q B10011Q2 24 10 2
206 B10011Q B10011Q2 24 10 2
207 B10011Q B10011Q3 24 10 3
208 B10011Q B10011Q3 24 10 3
209 B10011Q B10011Q3 24 10 3
210 B10011Q B10011Q3 24 10 3
211 B10011Q B10011Q3 24 10 3
212 B10011Q B10011Q3 24 10 3
213 B10011Q B10011Q3 24 10 3
214 B10011Q B10011Q3 24 10 3
I will create my DD_flag variable based on DD
dtadate$DD_flag <- as.numeric(dtadate$DD)
What I need to do is to simply increment +1 to th DD_flag variable each time the day DAY change for each identifier BCSID.
I thought that it could be simpler to use the collapsed id id for my loop.
1
I tried a R loop but
I am not sure why this solution is wrong
for(i in 2:nrow(dtadate)){
if( dtadate$id[i] == dtadate$id[i-1] )
{ dtadate$DD_flag[i] = dtadate$DD_flag[i] + 1 }
}
2
I tried a Rcpp solution, that almost gives me the correct output.
Here I used the BCSID and the DAY.
The incrementation is correct but unfortunately is does not re-use the incremented value for the rest of the loop.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector TimeAddOneCpp(CharacterVector idDay, CharacterVector Day, NumericVector time) {
int n = idDay.size();
int len = n ;
for ( int i = 1; i < len; ++i ) {
if( ( idDay[i] == idDay[i - 1] ) &
( Day[i] != Day [i - 1] )
)
time[i] = time[i-1] + 1;
}
return time;
}
The function
TimeAddOneCpp(idDay = dtadate$BCSID, Day = dtadate$DAY, time = dtadate$DD_flag)
Expected output
The output I want is the following
BCSID id DD MM DAY DD_flag
200 B10011Q B10011Q2 24 10 2 24
201 B10011Q B10011Q2 24 10 2 24
202 B10011Q B10011Q2 24 10 2 24
203 B10011Q B10011Q2 24 10 2 24
204 B10011Q B10011Q2 24 10 2 24
205 B10011Q B10011Q2 24 10 2 24
206 B10011Q B10011Q2 24 10 2 24
207 B10011Q B10011Q3 24 10 3 25
208 B10011Q B10011Q3 24 10 3 25
209 B10011Q B10011Q3 24 10 3 25
210 B10011Q B10011Q3 24 10 3 25
211 B10011Q B10011Q3 24 10 3 25
212 B10011Q B10011Q3 24 10 3 25
213 B10011Q B10011Q3 24 10 3 25
214 B10011Q B10011Q3 24 10 3 25
215 B10011Q B10011Q3 24 10 3 25
216 B10011Q B10011Q3 24 10 3 25
217 B10011Q B10011Q3 24 10 3 25
218 B10011Q B10011Q3 24 10 3 25
219 B10011Q B10011Q3 24 10 3 25
220 B10011Q B10011Q4 24 10 4 26
...
So each time the DAY change for each BCSID, the DD_flag based on DD should be incremented by +1.
The data
dta = structure(list(BCSID = c("B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10011Q",
"B10011Q", "B10011Q", "B10011Q", "B10011Q", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10015U", "B10015U", "B10015U", "B10015U", "B10015U", "B10015U",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W", "B10017W", "B10017W", "B10017W", "B10017W", "B10017W",
"B10017W"), id = c("B10011Q2", "B10011Q2", "B10011Q2", "B10011Q2",
"B10011Q2", "B10011Q2", "B10011Q2", "B10011Q3", "B10011Q3", "B10011Q3",
"B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3",
"B10011Q3", "B10011Q3", "B10011Q3", "B10011Q3", "B10011Q4", "B10011Q4",
"B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4", "B10011Q4",
"B10011Q4", "B10011Q4", "B10011Q5", "B10011Q5", "B10015U1", "B10015U1",
"B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1",
"B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U1", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2", "B10015U2",
"B10015U2", "B10015U2", "B10015U2", "B10015U3", "B10015U3", "B10015U3",
"B10015U3", "B10015U3", "B10015U3", "B10015U3", "B10015U3", "B10015U3",
"B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4",
"B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4", "B10015U4",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1", "B10017W1",
"B10017W1"), DD = c("24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "24", "24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "24", "24", "24", "24", "24", "24", "24", "24",
"24", "24", "24", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "13", "13", "13", "13", "13", "13", "13", "13", "13",
"13", "13", "13", "13", "13", "13", "13", "13", "13", "13"),
MM = c("10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "10", "10", "10", "10", "10", "10", "10",
"10", "10", "10", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8", "8",
"8", "8", "8", "8", "8", "8", "6", "6", "6", "6", "6", "6",
"6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6",
"6"), DAY = c("2", "2", "2", "2", "2", "2", "2", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "4",
"4", "4", "4", "4", "4", "4", "4", "4", "4", "5", "5", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3",
"3", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4",
"4", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1")), .Names = c("BCSID",
"id", "DD", "MM", "DAY"), row.names = 200:300, class = "data.frame")
library(dplyr)
dta %>%
group_by(BCSID) %>%
mutate(DD_flag = c(0, cumsum(diff(as.integer(DAY))))+as.integer(DD))
# Source: local data frame [101 x 6]
# Groups: BCSID
#
# BCSID id DD MM DAY DD_flag
# 1 B10011Q B10011Q2 24 10 2 24
# 2 B10011Q B10011Q2 24 10 2 24
# 3 B10011Q B10011Q2 24 10 2 24
# 4 B10011Q B10011Q2 24 10 2 24
# 5 B10011Q B10011Q2 24 10 2 24
# 6 B10011Q B10011Q2 24 10 2 24
# 7 B10011Q B10011Q2 24 10 2 24
# 8 B10011Q B10011Q3 24 10 3 25
# 9 B10011Q B10011Q3 24 10 3 25
# 10 B10011Q B10011Q3 24 10 3 25
# .. ... ... .. .. ... ...
One option would be to create the desired values for DD_flag outside the original object, then merge them in. Let's call the data frame you posted z. So:
flags <- data.frame(id = unique(z$id), DD_flag = seq(length(unique(z$id))))
z2 <- merge(z, flags, all.x = TRUE)
That approach assumes that you don't care about the order of those flags. If you do, you just need to put the unique values of the id variable in the desired order in or before that first line.
That approach also assumes that you don't already have a variable named DD_flag in z when you go to merge. If you do, you could just run this before the merge:
z$DD_flag <- NULL
This could be a possible solution
library(data.table)
setDT(dta)
out = rbindlist(
lapply(split(dta, dta$BCSID),
function(x){ x[, DD_flag := (as.numeric(x$DD) + .GRP)-1, by = DAY]}))
#> out
# BCSID id DD MM DAY DD_flag
#1: B10011Q B10011Q2 24 10 2 24
#2: B10011Q B10011Q2 24 10 2 24
#3: B10011Q B10011Q2 24 10 2 24
#4: B10011Q B10011Q2 24 10 2 24
#5: B10011Q B10011Q2 24 10 2 24
#6: B10011Q B10011Q2 24 10 2 24
#7: B10011Q B10011Q2 24 10 2 24
#8: B10011Q B10011Q3 24 10 3 25
#9: B10011Q B10011Q3 24 10 3 25
#10: B10011Q B10011Q3 24 10 3 25
#11: B10011Q B10011Q3 24 10 3 25
#12: B10011Q B10011Q3 24 10 3 25
#13: B10011Q B10011Q3 24 10 3 25
#14: B10011Q B10011Q3 24 10 3 25
#15: B10011Q B10011Q3 24 10 3 25
#...
Basically, I have several frequency tables d1 and d2. Suppose I have:
UPDATE2: The actual structure of d1 is table. So d1 is obtained by d1 <- table(datavector), similarly for d2.
d1
Value 0 1 2 3 4 9
Freq 25 30 100 10 10 10
d2
Value 0 1 3 5 7 11 13
Freq 25 30 100 10 10 10 12
Problem: I want to produce a matrix with rows corresponding to d1 and d2 and the columns corresponding to all the distinct "Values" seen in d1 and d2. So I want to produce a matrix with rows and columns that looks like this:
[,"0"] [,"1"] [,"2"] [,"3"] [,"4"] [,"5"] [,"7"] [,"9"] [,"11"] [,"13"]
[1,] 25 30 100 10 10 0 0 10 0 0
[2,] 25 30 0 100 0 10 10 0 10 12
Notice that, there is no column number 6 , 8, and 10 because they do not appear in the frequency table. Eventually, I am trying to put this matrix into a function image.plot().
UPDATE 1: I think I can allow column number 6,8 and 10 appear in the matrix, but eventually I will have to write a for loop to eliminate columns which consist of zeros entries only.
UPDATE 3: Please note that I am in fact working with 250 data vectors and hence 250 tables (each with different length / dimension). So, I am looking for an efficient solution
UPDATE 4: Please treat the above as an abstract of what I want to achieve. The real dataset is as follow:
> dput(head(get.dist(fnn[1])))
structure(c(0.999214894571557, 0.000134589502018843, 4.48631673396142e-05,
2.24315836698071e-05, 6.72947510094213e-05, 8.97263346792284e-05,
2.24315836698071e-05, 4.48631673396142e-05, 4.48631673396142e-05,
2.24315836698071e-05, 2.24315836698071e-05, 6.72947510094213e-05,
2.24315836698071e-05, 2.24315836698071e-05, 4.48631673396142e-05,
2.24315836698071e-05, 6.72947510094213e-05, 2.24315836698071e-05
), class = "table", .Dim = 18L, .Dimnames = structure(list(d = c("0",
"1", "2", "3", "4", "5", "8", "9", "11", "12", "15", "16", "17",
"18", "20", "22", "24", "31")), .Names = "d"))
> dput(head(get.dist(fnn[2])))
structure(c(0.71161956034096, 0.199147599820547, 0.0644010767160162,
0.0147599820547331, 0.00327501121579183, 0.000807537012113055,
6.72947510094213e-05, 0.000785105428443248, 0.000179452669358457,
0.000134589502018843, 0.000112157918349035, 4.48631673396142e-05,
6.72947510094213e-05, 0.00307312696276357, 0.00107671601615074,
0.000336473755047106, 6.72947510094213e-05, 2.24315836698071e-05,
2.24315836698071e-05), class = "table", .Dim = 19L, .Dimnames = structure(list(
d = c("0", "1", "2", "3", "4", "5", "6", "9", "10", "11",
"35", "36", "37", "38", "39", "40", "41", "42", "43")), .Names = "d"))
> dput(head(get.dist(fnn[3])))
structure(c(0.747353073126963, 0.13138178555406, 0.0295423956931359,
0.0139075818752804, 0.0119560340960072, 0.0151861821444594, 0.0243382682817407,
0.00697622252131, 0.00255720053835801, 0.00161507402422611, 0.00293853746074473,
0.00116644235082997, 0.004419021982952, 0.0018842530282638, 0.000628084342754598,
0.00053835800807537, 0.000448631673396142, 0.000493494840735756,
0.000650515926424406, 0.000403768506056528, 0.000269179004037685,
0.000179452669358457, 0.000269179004037685, 0.000179452669358457,
8.97263346792284e-05, 0.000246747420367878, 4.48631673396142e-05,
4.48631673396142e-05, 4.48631673396142e-05, 2.24315836698071e-05,
2.24315836698071e-05, 4.48631673396142e-05, 2.24315836698071e-05,
2.24315836698071e-05, 2.24315836698071e-05, 2.24315836698071e-05,
2.24315836698071e-05, 2.24315836698071e-05, 2.24315836698071e-05
), class = "table", .Dim = 39L, .Dimnames = structure(list(d = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "30", "32", "33", "34", "36", "37",
"38", "43", "54", "67")), .Names = "d"))
> dput(head(get.dist(fnn[4])))
structure(c(0.217743382682817, 0.49416778824585, 0.135150291610588,
0.0331987438313145, 0.0243831314490803, 0.0431135038133692, 0.022790489008524,
0.00912965455361149, 0.00614625392552714, 0.00937640197397936,
0.00244504262000897, 0.000560789591745177, 0.000493494840735756,
0.000448631673396142, 0.000336473755047106, 0.000112157918349035,
0.000201884253028264, 4.48631673396142e-05, 4.48631673396142e-05,
2.24315836698071e-05, 2.24315836698071e-05, 4.48631673396142e-05,
2.24315836698071e-05), class = "table", .Dim = 23L, .Dimnames = structure(list(
d = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "23",
"25", "45")), .Names = "d"))
Here is an option using Reduce that seems to work given the provided data:
# make a list including your 3 dput parts
keylist <- list(d1,d2,d3)
result <- Reduce(function(...) merge(..., by="d", all=T), keylist)
result <- transform(result,row.names=d,d=NULL)
result <- t(result)
rownames(result) <- NULL
It seems to work:
> result[,c(1:2,44:45)]
0 1 54 67
[1,] 0.9992149 0.0001345895 NA NA
[2,] 0.7116196 0.1991475998 NA NA
[3,] 0.7473531 0.1313817856 2.243158e-05 2.243158e-05
I was using dataframes, but if d1 and d2 were matrices this should still work if you removed the unlist calls:
M <- matrix(0, nrow=2, ncol=12 )
colnames(M) <- as.character(0:11)
M[1 , as.character(d1[1 , 2:7]) ] <- unlist(d1[2, 2:7 ])
M
# 0 1 2 3 4 5 6 7 8 9 10 11
#[1,] 25 30 100 10 10 0 0 0 0 10 0 0
#[2,] 0 0 0 0 0 0 0 0 0 0 0 0
M[2 , as.character(d2[1 , 2:7]) ] <- unlist(d2[2, 2:7 ])
M
#-------------------
0 1 2 3 4 5 6 7 8 9 10 11
[1,] 25 30 100 10 10 0 0 0 0 10 0 0
[2,] 25 30 0 100 0 10 0 10 0 0 0 10
Converting my examples to matrices (which inherit their indexing from the matrix class):
d1a <-data.matrix(d1[,-1])
rownames(d1a) <- d1[,1]
d2a <-data.matrix(d2[,-1])
rownames(d2a) <- d2[,1]
M[1 , as.character(d1a[1 , ]) ] <-d1a[2, ]
M[2 , as.character(d2a[1 , ]) ] <-d2a[2, ]
M
#---------
0 1 2 3 4 5 6 7 8 9 10 11
[1,] 25 30 100 10 10 0 0 0 0 10 0 0
[2,] 25 30 0 100 0 10 0 10 0 0 0 10
If as thelatemail thinks (although I do not) these are one row tables then it's even easier:
M[2 , colnames(d2b) ] <-d2b
M[2 , colnames(d1b) ] <-d1b
M
0 1 2 3 4 5 6 7 8 9 10 11
[1,] 25 30 100 10 10 0 0 0 0 10 0 0
[2,] 25 30 0 100 0 10 0 10 0 0 0 10
And please, please, please, no for-loops to be used on these:
> M[ , !colSums(M==0)==2]
0 1 2 3 4 5 7 9 11
[1,] 25 30 100 10 10 0 0 10 0
[2,] 25 30 0 100 0 10 10 0 10
You don't need to remove any zero columns if you don't create any:
You can probably create dist.list this way:
dist.list= lapply(fnn, get.dist)
# 3 element example built from your example
dist.list<-{}
dist.list[[1]] <-
structure(c(0.999214894571557, 0.000134589502018843, 4.48631673396142e-05,
2.24315836698071e-05, 6.72947510094213e-05, 8.97263346792284e-05,
2.24315836698071e-05, 4.48631673396142e-05, 4.48631673396142e-05,
2.24315836698071e-05, 2.24315836698071e-05, 6.72947510094213e-05,
2.24315836698071e-05, 2.24315836698071e-05, 4.48631673396142e-05,
2.24315836698071e-05, 6.72947510094213e-05, 2.24315836698071e-05
), class = "table", .Dim = 18L, .Dimnames = structure(list(d = c("0",
"1", "2", "3", "4", "5", "8", "9", "11", "12", "15", "16", "17",
"18", "20", "22", "24", "31")), .Names = "d"))
dist.list[[2]] <-
structure(c(0.71161956034096, 0.199147599820547, 0.0644010767160162,
0.0147599820547331, 0.00327501121579183, 0.000807537012113055,
6.72947510094213e-05, 0.000785105428443248, 0.000179452669358457,
0.000134589502018843, 0.000112157918349035, 4.48631673396142e-05,
6.72947510094213e-05, 0.00307312696276357, 0.00107671601615074,
0.000336473755047106, 6.72947510094213e-05, 2.24315836698071e-05,
2.24315836698071e-05), class = "table", .Dim = 19L, .Dimnames = structure(list(
d = c("0", "1", "2", "3", "4", "5", "6", "9", "10", "11",
"35", "36", "37", "38", "39", "40", "41", "42", "43")), .Names = "d"))
dist.list[[3]] <-
structure(c(0.747353073126963, 0.13138178555406, 0.0295423956931359,
0.0139075818752804, 0.0119560340960072, 0.0151861821444594, 0.0243382682817407,
0.00697622252131, 0.00255720053835801, 0.00161507402422611, 0.00293853746074473,
0.00116644235082997, 0.004419021982952, 0.0018842530282638, 0.000628084342754598,
0.00053835800807537, 0.000448631673396142, 0.000493494840735756,
0.000650515926424406, 0.000403768506056528, 0.000269179004037685,
0.000179452669358457, 0.000269179004037685, 0.000179452669358457,
8.97263346792284e-05, 0.000246747420367878, 4.48631673396142e-05,
4.48631673396142e-05, 4.48631673396142e-05, 2.24315836698071e-05,
2.24315836698071e-05, 4.48631673396142e-05, 2.24315836698071e-05,
2.24315836698071e-05, 2.24315836698071e-05, 2.24315836698071e-05,
2.24315836698071e-05, 2.24315836698071e-05, 2.24315836698071e-05
), class = "table", .Dim = 39L, .Dimnames = structure(list(d = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "30", "32", "33", "34", "36", "37",
"38", "43", "54", "67")), .Names = "d"))
all.names <- lapply(dist.list, names)
uniq.names <- unique(unlist(all.names))
M <- matrix(0, nrow=length(dist.list), ncol=length(uniq.names) )
colnames(M) <- uniq.names
for (i in seq_along(dist.list) ) {
M[i, all.names[[i]] ] <- dist.list[[i]] }
M
First 20 columns
0 1 2 3 4
[1,] 0.9992149 0.0001345895 4.486317e-05 2.243158e-05 6.729475e-05
[2,] 0.7116196 0.1991475998 6.440108e-02 1.475998e-02 3.275011e-03
[3,] 0.7473531 0.1313817856 2.954240e-02 1.390758e-02 1.195603e-02
5 8 9 11 12
[1,] 8.972633e-05 2.243158e-05 4.486317e-05 4.486317e-05 2.243158e-05
[2,] 8.075370e-04 0.000000e+00 7.851054e-04 1.345895e-04 0.000000e+00
[3,] 1.518618e-02 2.557201e-03 1.615074e-03 1.166442e-03 4.419022e-03
15 16 17 18 20
[1,] 2.243158e-05 6.729475e-05 2.243158e-05 2.243158e-05 4.486317e-05
[2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[3,] 5.383580e-04 4.486317e-04 4.934948e-04 6.505159e-04 2.691790e-04
# remainder excluded
If you turn your d1 and d2 into data.tables, you can easily merge them by a common key:
library(data.table)
> d1 <- data.table(value = c(0, 1, 2, 3, 4, 9), freq = c(25, 30, 100, 10, 10, 10))
> d2 <- data.table(value = c(0, 1, 3, 5, 7, 11), freq = c(25, 30, 100, 10, 10, 10))
> setkey(d1, value)
> setkey(d2, value)
> merge(d1, d2, all = TRUE)
value freq.x freq.y
1: 0 25 25
2: 1 30 30
3: 2 100 NA
4: 3 10 100
5: 4 10 NA
6: 5 NA 10
7: 7 NA 10
8: 9 10 NA
9: 11 NA 10
You can then convert the resulting data.table to a matrix, replace NAs with 0s, etc.