I have a dataset that I'd like to perform synth() on. Right now I'm working on the dataprep() command and I'm running into some issues. Below is a sample of what my dataset looks like:
unit.number 2000 2001 2002 2003 2004
1 400 344 252 212 344
2 342 234 111 102 222
3 244 555 512 122 152
4 515 125 324 100 155
My treated unit is unit.number = 3, and the treated time period is 2004. Since I'd like to use the lagged outcome variables as predictors, I've left the data in long format. However, I also obviously want to use the years as time.variable, so I'm not sure how to do that (I originally tried inputting it as a vector with all of the year columns -- see below). Is there some way that I can make the year columns into rows as well (kind of like converting the data to wide, but while also keeping the columns present), or is there some other way that I can construct time.variable? Here is what I have so far in terms of code:
dataprep(foo = data, predictors = c("2000 : 2004"),
predictors.op = c("mean"), dependent = "2004",
unit.variable = "unit.number", time.variable = c("2000 : 2004"),
treatment.identifier = 3, controls.identifier = c(1:2),
time.predictors.prior = c("2000 : 2003"),
time.plot = c("2000 : 2004"))
I'd really appreciate some assistance with this! Thanks.
Related
I've been assigned to create a dataset of simulated patient data in R for an assignment. We've been provided variable names and thats it. I want to be able to get a random sample of 100, and use set.seed() to make it reproducible, but when I run the code, I originally got different sample variables each time I re-open the script, and now it I just get error messages and it won't run
This is what I have:
pulse_data <- data.frame(
group = c(rep("control", "treatment")),
age = sample(c(20:75)),
gender = c(rep("male", "female")),
resting_pulse = sample(c(40:120)),
height_cm = sample(c(140:220))
)
set.seed(30)
pulse_sim <- sample_n(pulse_data, 100, replace = FALSE)
am I missing something fundamental?!
(total beginner, speak to me like an idiot and I might understand :) )
I've tried to sample_n() straight from the dataframe, with the set.seed() and to put set.seed() inside the pulse_sim but to no avail... as for why I get errors now, I'm at my wits end
Realize that pulse_data is created using random data, so each time the script is called, you get random data. After you create it, you set the random seed, so you get the same rows you did the last time you opened the script, but ... the rows have different data. SOLUTION: set the random seed before you define pulse_data.
pulse_data <- data.frame(
group = rep(c("control", "treatment"), length.out=30),
age = sample(c(20:75), size=30),
gender = rep(c("male", "female"), length.out=30),
resting_pulse = sample(c(40:120), size=30),
height_cm = sample(c(140:220), size=30)
)
pulse_sim <- sample_n(pulse_data, 10, replace = FALSE)
I have put that code, plus a simple pulse_sim again (to print it) in a file 74408236.R. (Note that I added length.out and changed your sample size from 100 to 10, for the sake of this demonstration.) I can run this briefly with this shell command (not in R):
$ Rscript.exe 74408236.R
group age gender resting_pulse height_cm
1 treatment 28 female 76 210
2 treatment 24 female 118 140
3 control 44 male 57 141
4 control 70 male 96 184
5 treatment 22 female 87 177
6 control 30 male 50 168
7 control 39 male 56 145
8 treatment 37 female 120 182
9 treatment 20 female 79 181
10 treatment 75 female 98 186
When I run it a few times in a row, I get the same output. For brevity, I'll demonstrate same-ness by showing its MD5 checksum; while MD5 is not the most "secure" (cryptographically), I think this is an easy way to suggest that the output is unlikely to be different. (This is shell-scripting, still not in R.)
$ for rep in $(seq 1 5) ; do Rscript.exe 74408236.R | md5sum; done
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
In fact, if I repeat it 100 times, I still see no change. I'll pipe through uniq -c to replace repeated output with the count (first number) and the output (everything else, the checksum).
$ for rep in $(seq 1 100) ; do /mnt/c/R/R-4.1.2/bin/Rscript.exe 74408236.R | md5sum; done | uniq -c
100 0f06ecd84c1b65d6d5e4ee36dea76add -
I have 16 large datasets of landcover variables around routes. Example dataset "Trial1":
RtNo TYPE CA PLAND NP PD LPI TE
2001 cls_11 996.57 6.4297 22 0.1419 6.3055 31080
2010 cls_11 56.34 0.3654 23 0.1492 0.1669 15480
18003 cls_11 141.12 0.9899 37 0.2596 0.1503 38700
18014 cls_11 797.58 5.3499 47 0.3153 1.3969 98310
2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670
2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260
18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780
18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650
2001 cls_22 732.33 4.7249 653 4.2131 0.7212 377430
2010 cls_22 32.31 0.2096 168 1.0896 0.0198 31470
18003 cls_22 275.85 1.9351 781 5.4787 0.0423 237390
18014 cls_22 469.44 3.1488 104 6.7345 0.1014 377580
I want to first select rows that meet a condition, for example, all rows in column "TYPE" that is cls_21. I know the following code does this work:
Trial21 <-subset(Trial1, TYPE==" cls_21 ")
(yes the invisible space before and after the categorical variable caused me a considerable headache).
And there are several other ways of doing this as shown in
[https://stackoverflow.com/questions/5391124/select-rows-of-a-matrix-that-meet-a-condition]
I get the following output (sorry this one has extra columns, but shouldn't affect my question):
RtNo TYPE CA PLAND NP PD LPI TE ED LSI
2 18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780 38.5668 46.1194
18 18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650 51.6255 56.2522
34 2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670 49.1418 49.3462
50 2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260 30.0457 46.0118
62 2020 cls_21 625.5 4.1165 180 1.1846 0.5064 384840 25.3268 38.6407
85 2021 cls_21 503.55 2.7926 214 1.1868 0.1178 348330 19.3175 38.9267
I want to rename the columns in this subset so they uniquely identify the class by adding "L21" at the back of existing column names, and I can do this using
library(data.table)
setnames(Trial21, old = c('CA', 'PLAND', 'NP', 'PD', 'LPI', 'TE', 'ED', 'LSI'),
new = c('CAL21', 'PLANDL21', 'NPL21', 'PDL21', 'LPIL21', 'TEL21', 'EDL21', 'LSIL21'))
I want help to develop a function or a loop that automates this process so I don't have to spend days repeating the same codes for 15 different classes and 16 datasets (240 times). Also, decrease the risk of errors. I may have to do the same for additional datasets. Any help to speed the process will be greatly appreciated.
You could do:
a <- split(df, df$TYPE)
b <- sapply(names(a), function(x)setNames(a[[x]],
paste0(names(a[[x]]), sub(".*_", 'L', x))), simplify = FALSE)
You can use ls to get the variable names of the datasets, and manipulate them as you wish inside a loop and with get function, then create new datasets with assign.
sets = grep("Trial", ls(), value=TRUE) #Assuming every dataset has "Trial" in the name
for(i in sets){
classes = unique(get(i)$TYPE)
for(j in classes){
number = gsub("(.+)([0-9]{2})( )", "\\2", j)#this might be an overly complicated way of getting just the number, you can look for better options if you want
assign(paste0("Trial", number),
subset(Trial1, TYPE==j) %>% rename_with(function(x){paste0(x, number)}))}}
Here is a start that should work for your example:
library(dplyr)
myfilter <- function(data, number) {
data %>%
filter(TYPE == sprintf(" cls_%s ") %>%
rename_with(\(x) sprintf("%s%s", x, suffix), !1:2)
}
myfilter(example_data, 21)
Given a list of numbers (here: 21 to 31) you could then automatically use them to filter a single dataframe:
multifilter <- function(data) {
purrr::map(21:31, \(i) myfilter(data, i))
}
multifilter(example_data)
Finally, given a list of dataframes, you can automatically apply the filters to them:
purrr::map(list_of_dataframes, multifilter)
I'm trying to generate some sample insurance claims data that is meaningful instead of just random numbers.
Assuming I have two columns Age and Injury, I need meaningful values for ClaimAmount based on certain conditions:
ClaimantAge | InjuryType | ClaimAmount
---------------------------------------
35 Bruises
55 Fractures
. .
. .
. .
I want to generate claim amounts that increase as age increases, and then plateaus at around a certain age, say 65.
Claims for certain injuries need to be higher than claims for other types of injuries.
Currently I am generating my samples in a random manner, like so:
amount <- sample(0:100000, 2000, replace = TRUE)
How do I generate more meaningful samples?
There are many ways that this could need to be adjusted, as I don't know the field. Given that we're talking about dollar amounts, I would use the poisson distribution to generate data.
set.seed(1)
n_claims <- 2000
injuries <- c("bruises", "fractures")
prob_injuries <- c(0.7, 0.3)
sim_claims <- data.frame(claimid = 1:n_claims)
sim_claims$age <- round(rnorm(n = n_claims, mean = 35, sd = 15), 0)
sim_claims$Injury <- factor(sample(injuries, size = n_claims, replace = TRUE, prob = prob_injuries))
sim_claims$Amount <- rpois(n_claims, lambda = 100 + (5 * (sim_claims$age - median(sim_claims$age))) +
dplyr::case_when(sim_claims$Injury == "bruises" ~ 50,
sim_claims$Injury == "fractures" ~ 500))
head(sim_claims)
claimid age Injury Amount
1 1 26 bruises 117
2 2 38 bruises 175
3 3 22 bruises 102
4 4 59 bruises 261
5 5 40 fractures 644
6 6 23 bruises 92
I have an inventory dataframe that is like:
set.seed(5)
library(data.table)
#replicated data
invntry <- data.table(
warehouse <- sample(c("NY", "NJ"), 1000, replace = T),
intid <- c(rep(1,150), rep(2,100), rep(3,210), rep(4,50), rep(5,80), rep(6,70), rep(7,140), rep(8,90), rep(9,90), rep(10,20)),
placement <- c(1:150, 1:100, 1:210, 1:50, 1:80, 1:70, 1:140, 1:90, 1:90, 1:20),
container <- sample(1:100,1000, replace = T),
inventory <- c(rep(3242,150), rep(9076,100), rep(5876,210), rep(9572,50), rep(3369,80), rep(4845,70), rep(8643,140), rep(4567,90), rep(7658,90), rep(1211,20)),
stock <- c(rep(3200,150), rep(10000,100), rep(6656,210), rep(9871,50), rep(3443,80), rep(5321,70), rep(8659,140), rep(4567,90), rep(7650,90), rep(1298,20)),
risk <- runif(100)
)
setnames(invntry, c("warehouse", "intid", "placement", "container", "inventory", "stock", "risk"))
invntry[ , ticket := 1:.N, by=c("intid", "warehouse")]
invntry$ticket[invntry$warehouse=="NJ"] <- 0
#ensuring some same brands are same container
invntry$container[27:32] <- 6
invntry$container[790:810] <- 71
invntry[790:820,]
There's more variables in the actual data that I want to use to compare the same items itid that are in different containers. So I would like to conduct multiple trials for a given range of sample sizes n for each item, such that I keep randomly selecting an item until I have n items from different containers, but keeping the duplicates if they've already been selected. So for a sample size of 6 for item 8, it might take 7 tries to get a sample size of 6:
warehouse intid placement container inventory stock risk ticket
21: NY 8 10 71 4567 4567 0.38404806 5
22: NY 8 11 96 4567 4567 0.64665968 6
23: NJ 8 12 15 4567 4567 0.68265602 0
24: NY 8 13 19 4567 4567 0.84437586 7
21: NY 8 10 71 4567 4567 0.38404806 5
26: NY 8 15 34 4567 4567 0.69580270 8
28: NY 8 17 78 4567 4567 0.25352370 9
I tried searching on this site, but couldn't find for the above and something to accommodate wanting to compute some values for each trial and sample size from the trial's rows' columns so I think I have to use a for loop so that I can distinguish each trial for each sample size. To summarize, two goals:
conduct random sampling of each itid n unique containers are selected cumulatively keeping the itids already selected
be able to do calculations on variables for each trial for each sample size for each item
Any ideas?
*doesn't have to involve data.table, that's just how it got started
(I think it's essentially the basic probability example of continuing to draw marbles from the urn until you have a sample size of all different colors-but even realizing that didn't help me find a solution!)
I'm not positive, but isn't this equivalent to grouping by intid and then sampling n values with replacement, where n is some integer? If so, then here's a way to do that using tidyverse functions. The code below groups by intid and samples 6 through 10 values with replacement from each group. The column Sample_Size identifies each n-sample group for each intid:
library(tidyverse)
invntry.sampled = map_df(setNames(6:10, 6:10),
~ invntry %>%
group_by(intid) %>%
sample_n(.x, replace=TRUE),
.id="Sample_Size")
And here's a data.table approach, using code adapted from this SO answer. I've wrapped the data.table code in lapply to cycle through the different sample sizes, as my data.table skills are limited. There may be a way to do this within the data.table code itself.
invntry.sampled = do.call(rbind,
lapply(6:10, function(n) invntry[ , .SD[sample(.N, n, replace=TRUE)], by=intid]))
I have an xyplot grouped by a factor. I plot salinity (AvgSal = Y) against time (DayN = X) for 16 different sites, site being the factor (SiteCode). I want all of the site plots stacked so I set the layout to one column with 16 rows.
First issue: I would like to remove the strip above each plot that contains only the SiteCode label, as it takes up a lot of space. Instead, I could introduce a second column with the SiteCode names or introduce a legend in the same strip as the plot. Can anyone tell me how to remove the label strip and introduce labelling in a different fashion?
Here's the code:
Sample Data
zz <- "SiteCode DayN AvgSal
1 CC 157 29.25933
2 CC 184 29.68447
3 DW 160 26.47328
4 DW 190 29.07192
5 FP 157 30.40344
6 FP 184 30.58842
7 IN 157 30.25319
8 IN 184 29.20716
9 IP 156 29.09548
10 IP 187 27.86887
11 LB 162 27.58603
12 LB 191 28.86910
13 LR 160 28.06035
14 LR 190 29.52723
15 PB 159 30.10903
16 PB 188 29.46113
17 PG 161 29.67765
18 PG 189 28.90864
19 SA 162 23.23362
20 SA 190 26.96549
21 SH 156 24.86752
22 SH 187 23.12184
23 SP 161 18.95347
24 SP 189 19.16433
25 VC 162 29.49714
26 VC 186 29.66493
27 WP 157 27.33631
28 WP 183 27.18465
29 YB 157 30.50193
30 YB 183 30.49824
31 ZZ 159 30.14175
32 ZZ 186 29.44860"
Data <- read.table(text=zz, header = TRUE)
xyplot(AvgSal~DayN | factor(SiteCode),
layout = c(1, 16),
xlab = "Time (Day of the year)",
ylab = "Average Salinity (PSU)",
strip = function(bg = 'white', ...) strip.default(bg = 'white', ...),
data = Data, type = c("a","p"))
Second issue: The strips are ordered by SiteCode alphabetically, or in the original order had them entered into the csv datafile. I would like to order them from highest to lowest average salinity, but I do not know how to achieve this. Can anyone help?
I have tried using order () to change the data layout so it is sorted by ascending salinity before running the plot, but this doesn't seem to work, even when I remove the rownames.
I also tried the solution in How to change the order of the panels in simple Lattice graphs, by assigning set levels, i.e.
levels(Data$SiteCode) <- c("SP", "SA", "SH", "LB", "DW",
"LR", "PG", "VC", "ZZ", "PB",
"WP", "IP", "IN", "CC", "FP", "YB")
This seemed to change the label above each panel, but it did not change the corresponding plots, leaving plots with the wrong label. It also seems like an inefficient way to go about it if I want to do this process for a large number of variables.
Any help will be really appreciated! Cheers :)
The solutions always seem so simple, in hindsight.
Issue 1: I had to use levels() and reorder() in the factor command, where X = the numeric factor I want to order SiteCode by.
xyplot(AvgSal ~ DayN | factor(SiteCode, levels(reorder(SiteCode, X),
Issue 2: Turned out to be very simple, once I knew what I was doing. Just had to 'turn off' strip, with the following getting rid of the title strip altogether.
strip = FALSE
In addition, I decided having the strip vertically aligned on the left would be nice:
strip = FALSE,
strip.left = strip.custom(horizontal = FALSE)