Comparing test data and prediction outcome

Comparing test data and prediction outcome - r

I trying Logistic regression on a dataset. I have successfully divided my dataset into train and test. The regression model also works fine however when I apply it on my test I only get an outcome for 393 observations when the length of my test dataset is 480. How can I compare and get the mismatch or find out what went wrong?
My data has no NAs.
I am trying to create a confusion matrix.
This is my code:
n=nrow(wine_log)
shuffled=wine_log[sample(n),]
train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n
#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]
wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)
result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1
> table(result1)
result1
0 1
255 138
> table(test$final_take)
Bad Good
418 62
structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4,
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7,
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0,
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9,
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075,
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11,
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54,
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997,
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51,
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56,
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4,
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"),

Your line of code here:
result1 = ifelse(result > 0.5, 1, 0)
Should be referencing result1 in the ifelse statement. I'm guessing that result is another object you have in your environment that isn't 480 rows.
So you should use this instead.
result1 = ifelse(result1 > 0.5, 1, 0)
You also asked what this line of code is doing. It's basically a threshold for your predictions from the glm model. If the prediction from the model is greater than 0.50, then you are translating the prediction to a "1". If it's less than or equal to 0.50 then you are translating that prediction to a "0". It's a way to convert a probability to a TRUE/FALSE or 1/0.

Related

Making quarto html documents use the full window width

I am creating a quarto html document using R and publishing it to the web at rpubs.com. The document uses only the middle third of the window and leaves the left and right thirds blank. I am showing a table but it is not started on the far left side and as a result it is cut off part way through. How can I tell quarto to use the full window? Here is a reproducible example.
---
title: "Test2"
author: "Philip Smith"
format: html
html:
page-layout: custom
editor: visual
code-fold: true
link-external-icon: true
knitr:
opts_chunk:
collapse: true
comment: "#>"
R.options:
knitr.graphics.auto_pdf: true---
---
```{r set-options, echo=FALSE, warning=FALSE, cache=FALSE}
library(lubridate)
library(gt)
FdfT <- structure(list(Name = c("LFS participation rate", "LFS employment rate",
"LFS unemployment rate", "Job vacancy rate", "LFS payroll", "LFS payroll employment",
"LFS payroll average hours worked", "LFS payroll average hourly earnings",
"SEPH payroll", "SEPH payroll employment", "SEPH payroll average hours worked",
"SEPH payroll average hourly earnings", "LFS payroll", "LFS payroll employment",
"LFS payroll average hours worked", "LFS payroll average hourly earnings",
"SEPH payroll", "SEPH payroll employment", "SEPH payroll average hours worked",
"SEPH payroll average hourly earnings"), V1 = c(65.4, 61.5, 6,
5.6, 8.9, 5.8, 0.3, 2.7, 9.2, 6.8, -0.3, 2.3, 0.8, 0.3, 0.1,
0.4, 1.3, 0.6, 0.6, 0.1), V2 = c(65, 60.8, 6.5, 5.4, 7.7, 5.8,
-0.6, 2.3, 9.3, 7.8, -0.6, 2, -0.2, -1.5, 0.4, 0.9, 0.9, 0.1,
0, 0.9), V3 = c(65.4, 61.8, 5.5, 5.3, 9.5, 6.3, 0, 3.1, 10.1,
8, -0.6, 2.5, 1.9, 2.3, -0.3, -0.1, 0.8, 0.7, 0, 0), V4 = c(65.4,
61.9, 5.3, 5.7, 9.1, 5.1, 0.3, 3.5, 11.8, 7.2, -0.6, 4.9, 0.7,
0.3, 0.1, 0.3, 2.2, 0.8, -0.3, 1.7), V5 = c(65.3, 61.9, 5.2,
5.7, 9.3, 6.3, -0.5, 3.3, 10.2, 7, -1.2, 4.3, -0.2, 0, -0.6,
0.4, 0.2, 0.7, 0, -0.4), V6 = c(65.3, 61.9, 5.1, 5.7, 10.9, 7.2,
-0.6, 3.9, 11.5, 8.6, -0.9, 3.5, 1.2, 0.2, 0.2, 0.8, -0.4, 0,
-0.3, -0.3), V7 = c(64.9, 61.7, 4.9, 5.6, 11.3, 5.2, 0.6, 5.2,
11.2, 7.8, -0.6, 3.9, 1.3, 0.1, 0.5, 0.7, 1, 0.8, 0, 0.4), V8 = c(64.7,
61.6, 4.9, 5.5, 10.5, 4.2, 0.9, 5.2, 9, 5.9, -0.6, 3.3, 0, -0.4,
-0.1, 0.4, 0.1, 0.1, 0, -0.1), V9 = c(64.8, 61.3, 5.4, 5.3, 9.4,
3.5, 0.3, 5.4, 9.2, 5.5, 0, 3.6, 0.2, -0.1, -0.4, 0.6, 0.8, 0,
0, 0.9), V10 = c(64.7, 61.3, 5.2, 5.2, 8.5, 2.6, 0.6, 5.2, 9.1,
5.5, -0.6, 3.8, 0.5, 0.2, 0.3, 0, 0.8, 0.5, -0.3, 0.4), V11 = c(64.9,
61.6, 5.2, 4.9, 8.8, 2.8, 0.3, 5.5, 7.9, 4.7, -1.2, 4.1, 1.1,
0.5, -0.1, 0.7, -0.6, 0, -0.6, 0.1), V12 = c(64.8, 61.5, 5.1,
NA, 8.3, 2, 0.5, 5.6, NA, NA, NA, NA, 0.7, 0, 0.3, 0.5, NA, NA,
NA, NA), V13 = c(65, 61.8, 5, NA, 7.7, 2.1, 0.3, 5.2, NA, NA,
NA, NA, 0.2, 0.4, -0.1, 0, NA, NA, NA, NA)), row.names = c(NA,
-20L), class = "data.frame")
LASTdate <-"2022-12-01"
NumMths <- 12
Dates <- seq.Date(as.Date("2021-12-01"),as.Date("2022-12-01"),by="month")
colls <- c("V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13")
MyTitle <- paste0("**Labour market indicators<br>",format(Dates[1],"%B %Y"),
" to ",format(Dates[13],"%B %Y"),"**")
gt_tbl1 <- gt(data=FdfT)
gt_tbl1 <- gt_tbl1 %>%
tab_options(table.font.size=12,container.width = 1450) %>%
tab_header(
title=md(html(MyTitle))
) %>%
cols_align(
align=c("left"),
columns=c(`Name`)
) %>%
fmt_number(
columns=all_of(colls),
decimals=1,
use_seps=TRUE
) %>%
cols_label(
`Name`="",
`V1`=md("**Dec<br>2021**"),
`V2`=md("**Jan<br>2022**"),
`V3`=md("**Feb<br>2022**"),
`V4`=md("**Mar<br>2022**"),
`V5`=md("**Apr<br>2022**"),
`V6`=md("**May<br>2022**"),
`V7`=md("**Jun<br>2022**"),
`V8`=md("**Jul<br>2022**"),
`V9`=md("**Aug<br>2022**"),
`V10`=md("**Sep<br>2022**"),
`V11`=md("**Oct<br>2022**"),
`V12`=md("**Nov<br>2022**"),
`V13`=md("**Dec<br>2022**")
) %>%
sub_missing(columns=everything(),rows=everything(),
missing_text="---") %>%
tab_style(
style = list(
cell_text(weight = "bold")
),
locations = cells_title()
) %>%
tab_style( # column label style
style = list(
cell_text(weight = "bold")
),
locations = cells_column_labels(
columns=c(Name,all_of(colls)))
) %>%
tab_row_group(label="Ratio, per cent",
rows=c(1:4),id="Levels") %>%
tab_row_group(label="12-month percentage change, per cent",
rows=c(5:12),id="PC12") %>%
tab_row_group(label="1-month percentage change, per cent",
rows=c(13:20),id="PC01") %>%
opt_row_striping(row_striping = TRUE) %>%
opt_vertical_padding(scale = 0.25) %>%
tab_footnote(
footnote = paste0("Dashes mean 'data not yet available'. Source for ",
"data: Statistics Canada. #PhilSmith26. Prepared: ",Sys.time()),
locations = cells_title()
)
gt_tbl1
```

I discovered that one must update to the pre-release 1.3 of quarto. Then the following YAML code does the job.
format:
html:
grid:
sidebar-width: 0px
body-width: 2000px
margin-width: 0px
gutter-width: 1.5rem

Loop over several columns while appending results to new columns using the name of the exisiting columns to name the new columns

The following for-loop loops over columns 6:22 in poll_22 to calculate lower and upper confidence intervals for each political party. The lower and upper intervals are then saved in two new columns. However, the code I have written overwrites the calculated intervals resulting in poll_22$lowerinterval and poll_22$upperinterval containing calculated intervals for the last political party in column 22.
Is it possible to add the name of the column i.e. the letter of a political party when appending the confidence intervals in the last two lines? Intervals for column 6 named A would then be poll_22$A_upperinterval and poll_22$A_lowerinterval. Column 7 named B would be poll_22$B_upperinterval and poll_22$B_lowerinterval etc.
for(i in poll_22[, 6:22]) {
n <- poll_22$n # sample size
p <- i/100 # party
# calculate confidence interval
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
# calculate upper and lower intervals
lowerinterval <- (p - margin)*100
upperinterval <- (p + margin)*100
# append intervals
poll_22$lowerinterval <- lowerinterval
poll_22$upperinterval <- upperinterval
}
Head of poll_22
poll_22 <- structure(list(id = c(1555, 1556, 1557, 1558, 1559, 1560), pollingfirm = c("VOXMETER",
"VOXMETER", "MEGAFON", "VOXMETER", "VOXMETER", "VOXMETER"), year = c(2022,
2022, 2022, 2022, 2022, 2022), month = c(1, 1, 1, 1, 1, 2), day = c(8,
16, 20, 23, 30, 6), A = c(25.1, 25.8, 23.9, 26.8, 24.9, 24),
B = c(7.2, 7.5, 6.9, 6.9, 7.7, 7.2), C = c(15, 14.6, 18.8,
15.5, 16.4, 16.1), D = c(6, 5.9, 7.9, 6.8, 6, 5.4), E = c(NA,
NA, NA, NA, NA, NA), F = c(8.6, 8.5, 8.9, 8, 8.6, 9.2), G = c(NA,
NA, NA, NA, NA, NA), I = c(2.5, 3.3, 2.7, 2.9, 2.1, 2.8),
K = c(1.8, 1.4, 1.6, 1.2, 1.7, 1.8), M = c(NA, NA, 2.2, NA,
NA, NA), O = c(6.2, 5.3, 4.5, 5.5, 7.1, 6.2), P = c(NA, NA,
NA, NA, NA, NA), Q = c(0.1, 0.3, 0.2, 0.1, 0, 0.2), V = c(16.5,
15.1, 11.6, 14.4, 14.2, 14.8), Ø = c(8.9, 9.3, 9.4, 9.4,
8.3, 9.1), Å = c(1.2, 1.1, 0.7, 0.8, 0.9, 0.9), Æ = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), noparty = c(24.9,
23.5, NA, NA, 25, 25.8), n = c(1002, 1003, 2015, 1008, 1015,
1024)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))

If I understood what you are looking for correctly, then here is a solution using dplyr:
library(dplyr)
poll_22 %>%
mutate_at(vars(6:22), ~./100) %>% # performs p <- i/100 for columns 6:22
mutate_at(vars(6:22), list(lowerinterval = ~100 * (. - qnorm(0.975)*sqrt(.*(1-.)/n),
upperinterval = ~100 * (. + qnorm(0.975)*sqrt(.*(1-.)/n))) %>%
mutate_at(vars(6:22), ~100*.) # reverts values in columns 6:22

How to facet wrap plots with two different factors

For example, i have two trials, each with main factors of year and treatment.
I want to plot response to the same plot, preferably Trial one / treatment / year and the same for trial 2.
Closest that i can get is shown on the simple example picture. Basically i get two graphs joined together, one showing effects of treatment in trial 1 and trial 2 with y axis representing content, and other showing effect of year in trial 1 and trial 2 with y axis representing the same content.
Simple example of plot with Si content affected by treatments and year of sampling
Is it possible to facet wrap graphs together, or at least to remove redundant y axis?
Code that i use is:
a <- ggplot(I1, aes(x=fct_reorder(SISTEM, ORDER), y=Si)) + geom_jitter(show.legend=FALSE, width=0.25, color="black", size=0.5) + stat_summary(fun.data = mean_cl_normal, show.legend=FALSE, color="red", size=0.3) + labs(x=NULL,
y="Si (mg / 100 g)") + facet_wrap(~POSKUS, ncol=2, scales="free_x") + theme_classic(base_family = "Palatino Linotype") + theme(axis.text=element_text(colour="black", size=8), axis.title=element_text(colour="black", size=8), axis.text.x=element_text(angle=45, vjust = 1, hjust=1)) + theme(strip.background = element_blank()) + ggplot(I1, aes(x=Leto, y=Si)) + geom_jitter(show.legend=FALSE, width=0.25, color="black", size=0.5) + scale_x_continuous(breaks=c(2016,2017)) + stat_summary(fun.data = mean_cl_normal, show.legend=FALSE, color="red", size=0.3) + labs(x=NULL,
y="") + facet_wrap(~POSKUS + Leto, ncol=4, scales="free_x") + theme_classic(base_family = "Palatino Linotype") + theme(axis.text=element_text(colour="black", size=8), axis.title=element_text(colour="black", size=8), axis.text.x=element_text(angle=45, vjust = 1, hjust=1)) + theme(strip.background = element_blank())`
The solution as given below sorta works with minor tweaks still needed.
Using the provided code, and expanding it to sort the treatments as needed, and changing the order of variables in the facet wrap provided the plot as shown in figure 2. However, facets labels are now showing only 1 and 2 rather than Trial 1 and Trial 2. Furthermore, is it possible to have only one Trial 1 and Trial 2 name for both TREATMENT and YEAR variables?
Added dput:
I2 <- structure(list(Leto = c("2016", "2016", "2016", "2016", "2016",
"2016", "2016", "2016", "2016", "2016", "2016", "2016", "2016",
"2016", "2016", "2017", "2017", "2017", "2017", "2017", "2017",
"2017", "2017", "2017", "2017", "2017", "2017", "2017", "2017",
"2017", "2016", "2016", "2016", "2016", "2016", "2016", "2016",
"2016", "2016", "2016"), POSKUS = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("Trial 1", "Trial 2"), class = "factor"),
SISTEM = structure(c(5L, 5L, 5L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L,
9L), .Label = c("Manure-N0", "Manure-N1", "Manure-N2", "Manure-N3",
"No.org-N0", "No.org-N3", "Straw-N0", "Straw-N1", "Straw-N2",
"Straw-N3"), class = "factor"), ORDER = c(1, 1, 1, 2, 2,
2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3,
4, 4, 4, 5, 5, 5, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3), DUSIK = c(0,
0, 0, 0, 0, 0, 55, 55, 55, 110, 110, 110, 165, 165, 165,
0, 0, 0, 0, 0, 0, 55, 55, 55, 110, 110, 110, 165, 165, 165,
165, 165, 165, 0, 0, 0, 55, 55, 55, 110), Si = c(9.35, 11.6,
9.71, 8.96, 6.13, 7.08, 3.74, 3.72, 4.75, 1.3, 1.82, 3.41,
5.13, 3.41, 3.68, 7.67, 7.48, 6.21, 5.02, 9.46, 7.79, 8.11,
3.59, 8.28, 7.36, 9.69, 9.08, 6.46, 5.48, 7.9, 5.27, 4.06,
4.22, 5.6, 5.92, 6.9, 3.26, 4.45, 3.09, 4.38), P = c(2.62,
2.26, 2.33, 2.57, 3.06, 2.99, 1.71, 2.06, 2.18, 1.48, 1.71,
2.66, 2.24, 2.38, 2.55, 2.49, 2.48, 3.9, 2.65, 1.79, 2.88,
2.54, 3.22, 2.54, 2.88, 2.93, 3.26, 2.09, 3.03, 2.56, 2.43,
2.72, 2.59, 2.58, 3.71, 2.5, 2.45, 2.48, 3.49, 3.31), S = c(1.24,
0.95, 1.07, 1.17, 1.15, 1.15, 0.81, 1.08, 1.07, 0.89, 0.85,
1.15, 1.12, 1.22, 1.24, 1.16, 0.98, 1.32, 1.29, 1.04, 1,
0.9, 1.19, 1.03, 1.14, 1.05, 1.14, 1.1, 1.13, 1.25, 0.92,
1.19, 0.84, 1.27, 1.14, 1.05, 1.29, 1.05, 1.15, 1.02), Cl = c(0.39,
0.31, 0.32, 0.3, 0.39, 0.38, 0.24, 0.26, 0.32, 0.35, 0.3,
0.31, 0.3, 0.32, 0.28, 0.3, 0.24, 0.27, 0.29, 0.28, 0.25,
0.34, 0.38, 0.34, 0.33, 0.31, 0.33, 0.33, 0.31, 0.35, 0.25,
0.25, 0.26, 0.35, 0.35, 0.39, 0.33, 0.25, 0.25, 0.28), K = c(4.47,
4.05, 3.59, 4.18, 4.07, 4.43, 3.12, 3.79, 4.63, 5.02, 4.52,
4.49, 4.64, 4.21, 4.38, 4.27, 4.08, 5.23, 3.66, 3.39, 4.14,
3.99, 4.21, 3.83, 4.19, 4.95, 5.11, 3.44, 4.27, 4.6, 4.99,
4.54, 4.12, 3.82, 5.55, 4.48, 3.7, 3.8, 5.08, 4.47), Ca = c(0.78,
0.68, 0.66, 0.69, 0.77, 0.73, 0.46, 0.6, 0.66, 0.59, 0.61,
0.71, 0.77, 0.58, 0.7, 0.61, 0.79, 0.87, 0.77, 0.69, 0.84,
0.62, 0.77, 0.62, 0.66, 0.71, 0.68, 0.59, 0.67, 0.73, 0.62,
0.69, 0.61, 0.69, 0.8, 0.72, 0.56, 0.6, 0.63, 0.65), Ti = c(78.5,
73.7, 74, 69, 68.9, 52.3, 33.7, 35, 26.6, 41, 50.7, 42.2,
33.6, 38.7, 41.5, 56.9, 64.6, 60.1, 69.4, 65.7, 65.7, 52.6,
42.2, 46.1, 50.8, 44.1, 35.6, 47.3, 39.2, 47.7, 39.6, 40.3,
38.2, 67.9, 52.3, 63.1, 43.4, 35.1, 37.2, 27), Fe = c(56.2,
52.9, 57.1, 48.8, 46.7, 35.1, 45.8, 48.6, 49.6, 71.5, 66,
85.7, 45.6, 70.2, 58.8, 75.6, 85.2, 93.9, 85.7, 68.7, 70.1,
61.2, 60.6, 76.8, 113, 68.5, 74.9, 91.9, 44.4, 104, 62.1,
55.3, 78.5, 75.7, 51.7, 53.2, 49, 74.4, 51.9, 57.6), Zn = c(31.3,
29.9, 28, 27.4, 27.9, 27.7, 19.6, 19.6, 22, 20.6, 23.1, 20.6,
25.1, 22.6, 22.7, 32.5, 35.5, 31.1, 28.6, 29.2, 29.6, 21.8,
29.5, 25, 26.1, 24.7, 20.1, 23.9, 20.3, 24.6, 20.3, 21.1,
26.6, 27.4, 32.6, 30.4, 19.9, 21.8, 24.7, 20.7), Br = c(8.54,
7.65, 6.27, 5.83, 7.25, 6.92, 4.74, 4.79, 4.51, 7.53, 5.02,
4.35, 3.98, 3.64, 4.26, 10, 13.7, 12.7, 7.67, 8.62, 10.1,
2.52, 3.63, 2.7, 2.44, 2.73, 2.49, 5.9, 2.52, 2.56, 6.05,
5.6, 6.98, 7.81, 12.3, 8.11, 5.91, 6.01, 6.15, 5.74), Rb = c(1.95,
1.53, 2.12, 1.44, 2.54, 1.84, 1.62, 2.78, 2.35, 3.24, 3.62,
3.48, 4.74, 3.34, 4.21, 5.43, 3.94, 5.55, 3.01, 2.19, 3.34,
3.55, 5.08, 2.63, 5.44, 4.67, 4.71, 6.52, 2.99, 3.24, 4.19,
3.11, 4.11, 1.57, 1.26, 1.14, 1.95, 2.21, 2.57, 2.41), Sr = c(0.94,
0.97, 0.86, 1.07, 1.19, 1.97, 1.08, 1.23, 1.35, 1.23, 1.17,
1.03, 0.86, 0.96, 0.86, 3.51, 1.94, 3.44, 1.47, 1.95, 2.14,
1.36, 4.22, 2.07, 1.92, 1.8, 2.34, 2.89, 2.13, 2.62, 1.3,
1.16, 1.95, 1, 1.41, 0.77, 1.25, 1.09, 1.37, 1.28), N = c(5.68,
4.93, 4.36, 6.36, 4.68, 5, 4.67, 3.38, 3.33, 3.94, 3.61,
3.52, 3.03, 2.74, 2.7, 8.47, 7.33, 5.82, 8.01, 7.34, 7.12,
5.84, 4.5, 4.17, 3.91, 3.57, 2.35, 3.44, 4.2, 1.94, 2.97,
3.1, 3.42, 5.04, 5.42, 7.35, 3.28, 3.55, 4.2, 3.79), C = c(-29.04,
-28.81, -29.12, -28.91, -29.07, -29.13, -29.16, -29.16, -28.98,
-28.81, -28.74, -28.56, -28.58, -28.33, -28.51, -29.49, -30.45,
-30.34, -29.91, -30.13, -30.86, -30.3, -30.23, -30.46, -29.69,
-29.43, -29.74, -29.75, -29.92, -29.52, -28.44, -28.24, -28.01,
-28.68, -28.77, -29.13, -29.13, -29.41, -29, -28.85)), row.names = c(NA,
40L), class = "data.frame")

One way to achieve your desired result using faceting would be to split your dataframe into two like so:
The first dataframe contains the data by treatment, the second the data by year.
In each of these data frames rename the vars to be plotted on the x-axis to have the same name (I chose ´x). Doing so allows use to have one x-axis for the two different variables. But make sure to convert year or ´Leto to a character.
Add an identifier to each data.frame which could be used for facetting (besides your variable POSKUS).
Splitting the data into two df we need both two geom_jitter and two stat_summary layers.
Finally I added a custom labelled function to facet_wrap to show only the trial labels in the facet strip text.
library(dplyr)
library(ggplot2)
d1 <- I2 %>%
select(x = SISTEM, Si, POSKUS) %>%
mutate(name = "SISTEM", name = factor(name, levels = c("SISTEM", "Leto")))
d2 <- I2 %>%
select(x = Leto, Si, POSKUS) %>%
mutate(name = "Leto", name = factor(name, levels = c("SISTEM", "Leto")))
base <- ggplot(mapping = aes(x = x, y = Si)) +
geom_jitter(data = d1, show.legend = FALSE, width = 0.25, color = "black", size = 0.5) +
stat_summary(data = d1, fun.data = mean_cl_normal, show.legend = FALSE, color = "red", size = 0.3) +
geom_jitter(data = d2, show.legend = FALSE, width = 0.25, color = "black", size = 0.5) +
stat_summary(data = d2, fun.data = mean_cl_normal, show.legend = FALSE, color = "red", size = 0.3) +
labs(
x = NULL,
y = "Si (mg / 100 g)"
) +
#theme_classic(base_family = "Palatino Linotype") +
theme_classic() +
theme(axis.text = element_text(colour = "black", size = 8),
axis.title = element_text(colour = "black", size = 8),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
theme(strip.background = element_blank())
base +
facet_wrap(~name+POSKUS, nrow = 1, scales = "free_x", labeller = function(d) list(as.character(d$POSKUS)))
#> Warning: Removed 1 rows containing missing values (geom_segment).
EDIT To have only one label for each trial after changing the order of the variables you could make use of ggh4x:: facet_nested_wrap:
base +
ggh4x::facet_nested_wrap(~POSKUS+name, nrow = 1, scales = "free_x", labeller = function(d) list(as.character(d$POSKUS)))
#> Warning: Removed 1 rows containing missing values (geom_segment).

How to average 7 years of data

I'm new to R programming and this website so please bear with my incompetence. I pulled atmospheric data from the past 7 years for 7 variables; ozone, CO, NO, NO2, windspeed, PM 2.5, PM 10. What I am trying to do is graph this data to see if the government mandated stay at home orders during this covid-19 pandemic had any effect on atmospheric composition. From the graph I'm not quite sure what to do next. I believe I need an average of the past 7 years? The data is hourly data displayed in a 24 hour format.
NO.dat data frame
dput(head(NO.dat,10))
structure(list(Date = c("3/1/2014", "3/2/2014", "3/3/2014", "3/4/2014",
"3/5/2014", "3/6/2014", "3/7/2014", "3/8/2014", "3/9/2014", "3/10/2014"
), X0.00 = c(3.6, NA, 2.3, 17.1, 0.4, 0.9, 110.9, 0.1, NA, 0.4
), X1.00 = c(6.3, NA, 1.4, 18.7, 0.2, 0.2, 15.8, 0, NA, 0.6),
X2.00 = c(2.3, 0.6, 0.4, 13.9, 0.2, 0.1, 13.5, 0, 0.8, 0.3
), X3.00 = c(0.9, 0.3, 0.9, 4.2, 0.5, 0.3, 22.7, 0.2, 0.5,
0.7), X4.00 = c(0.2, 0.1, 2.8, 5.2, 0.7, 0, 40.1, 0.1, 0.8,
2.8), X5.00 = c(0.4, 0, 4.4, 11.1, 2.4, 1.8, 22.2, 0.1, 0.8,
4.1), X6.00 = c(11.8, 0.1, 17.6, 51.8, 3.6, 8.2, 2.8, 0.3,
1, 20.1), X7.00 = c(39.5, 0.6, 30.3, 118.6, 15.7, 12, 3.7,
1, 1.9, 39.1), X8.00 = c(23.9, 0.7, 25.8, 35.6, 20.6, 11.4,
6.3, 1.5, 1.2, 33.5), X9.00 = c(8.4, 1.1, 20.8, 28.7, 5.1,
9.4, 3.7, 1.3, 0.8, 9.6), X10.00 = c(4.3, 0.5, 13.3, 17.1,
1.1, 6, 1.3, 2.4, 1.4, 2.5), X11.00 = c(3.9, 0.3, 8.3, 13.9,
0.5, 5.6, 0.9, 2.3, 1.3, 1.2), X12.00 = c(4.1, 0.6, 6.3,
12.2, 0.6, 4.3, 0.8, 1.6, 1, 1.1), X13.00 = c(2.6, 0.6, 9.1,
9, 0.6, 3.6, 0.7, 2, 1.6, 1.1), X14.00 = c(3.7, 0.5, 9.3,
1.4, 0.9, 2.3, 0.9, 1.1, 1.1, 1.3), X15.00 = c(3.4, 0.5,
9.4, 0.8, 0.8, 1.8, 1.2, 1.8, 1.2, 1.1), X16.00 = c(1, 0.3,
5.7, 0.6, 2.5, 2.3, 1.1, 2.3, 1.2, 1), X17.00 = c(0.9, 0.3,
13.4, 0.5, 3.2, 1.8, 0.7, 1.4, 0.6, 0.7), X18.00 = c(0.8,
0.2, 22.1, 0.5, 3.9, 0.7, 0.7, 0.9, 0.4, 0.5), X19.00 = c(0.5,
0.2, 24.2, 1.8, 15.4, 1.1, 0.1, 0.8, 10.1, 0.6), X20.00 = c(0.5,
1, 18.4, 17.1, 5.1, 33.4, 0.3, 0.1, 45.3, 0.5), X21.00 = c(1,
0.5, 15.4, 55.7, 2, 39.5, 4.1, 0, 49.5, 0.4), X22.00 = c(0.4,
0.2, 8.1, 52.6, 2.7, 25.2, 0.9, 0.3, 27.2, 0.5), X23.00 = c(0.4,
6, 11.9, 2.2, 2.5, 62.1, 0.2, 0.1, 3.3, 0.4)), row.names = c(NA,
10L), class = "data.frame")
NO2.dat data frame
dput(head(NO2.dat,10))
structure(list(Date = c("3/1/2014", "3/2/2014", "3/3/2014", "3/4/2014",
"3/5/2014", "3/6/2014", "3/7/2014", "3/8/2014", "3/9/2014", "3/10/2014"
), X0.00 = c(5, 0.5, 3.2, 16.3, 0.4, 2, 91.2, 0.2, 0.5, 0.2),
X1.00 = c(7, 0.4, 2.4, 18.4, 0.3, 0.6, 17.7, 0.2, 0.5, 0.1
), X2.00 = c(1.7, 0.4, 0.3, 16.3, 0.1, 0.4, 10.3, 0.3, 0.6,
0.2), X3.00 = c(0.8, 0.6, 0.7, 4.4, 0.8, 0.6, 8.5, 0.4, 0.5,
0.6), X4.00 = c(0.6, 0.2, 2.6, 4.4, 1, 0.6, 43.7, 0.3, 0.7,
2.6), X5.00 = c(0.6, 0.3, 5, 12.8, 2.7, 2.8, 15.7, 0.4, 0.7,
4.3), X6.00 = c(5.8, 0.4, 18.6, 60.5, 3.8, 9.5, 3, 0.6, 0.9,
22.1), X7.00 = c(32, 0.7, 27.4, 117.5, 15.3, 12.6, 4.4, 1.7,
2.2, 36.2), X8.00 = c(21.3, 1, 22.7, 37.1, 20.3, 12.5, 7.6,
2.1, 1.4, 33.2), X9.00 = c(7.9, 1.4, 19.4, 28.7, 5, 10.5,
4.8, 2.3, 0.9, 11), X10.00 = c(4.2, 0.6, 12.4, 19, 1.6, 8.1,
1.9, 3.1, 1.8, 2.9), X11.00 = c(4.2, 0.8, 9.6, 15.7, 1.1,
7.4, 1.6, 3.4, 1.8, 1), X12.00 = c(4.2, 0.9, 6.6, 14.2, 1.2,
6.1, 1.4, 2.7, 1.3, 1.2), X13.00 = c(NA, 0.8, 9.4, NA, 1.4,
3.9, 1.2, NA, 1.9, 1.2), X14.00 = c(NA, 0.9, 9.6, NA, 1.9,
3.1, 1.3, NA, 1.3, 1.3), X15.00 = c(NA, 0.9, 9.6, NA, 1.7,
2.9, 1.9, NA, 1.6, 1), X16.00 = c(1, 0.8, 6.4, 1.2, 3.8,
3, 1.8, 3, 1.2, 1.1), X17.00 = c(1.2, 0.7, 12, 1, 4, 1.5,
1.5, 2, 0.5, 0.6), X18.00 = c(0.9, 0.5, 20.2, 0.9, 5, 1,
1.3, 1.5, 0.3, 0.3), X19.00 = c(0.5, 0.5, 19.1, 2.1, 15.8,
1.1, 0.6, 1.3, 5.1, 0.4), X20.00 = c(0.4, 1.1, 17.5, 7, 4.2,
24.9, 0.5, 0.7, 32.1, 0.4), X21.00 = c(0.7, 0.7, 13.3, 28.4,
2.4, 31.7, 3.4, 0.7, 37, 0.3), X22.00 = c(0.4, 0.4, 7.3,
21, 2.9, 18.5, 1.2, 0.6, 20, 0.3), X23.00 = c(0.4, 5.8, 11.6,
0.8, 2.9, 47.8, 0.5, 0.6, 2.1, 0.2)), row.names = c(NA, 10L
), class = "data.frame")
Any help would be much appreciated!
'''
library(reshape2)
library(dplyr)
library(lubridate)
library(ggplot2)
#remove summary stats
NO.dat <- NO.dat[,1:25]
NO2.dat <- NO2.dat[,1:25]
#reorganize data using reshape
x<-melt(NO.dat, id="Date")
colnames(x) <- c("Date","Hour","NO")
x$Hour<- as.numeric(x$Hour)-y<-melt(NO2.dat, id="Date")
y<-melt(NO2.dat, id="Date")
colnames(y) <- c("Date","Hour","NO2")
y$Hour<- as.numeric(y$Hour)-1
x <- cbind(x,y$NO2)
colnames(x)[4] <- "NO2"
x$min <- ":00"
x$time <- paste(x$Hour, x$min, sep="")
x$DT <- paste(x$Date, x$time)
x %>% select(DT, NO, NO2) %>% mutate(NOx=NO + NO2) %>% mutate(DT =
mdy_hm(DT)) %>% arrange(DT) -> x
p <- ggplot(x, aes(x=DT, y=NOx)) + geom_line() + xlab("")
x$index <- 1:nrow(x)
loessMod10 <- loess(NOx ~ index, data=x, span=0.10, na.action=)
x <- na.omit(x)
x$smoothed10 <- predict(loessMod10)
#pdf("El Paso NOx.pdf",w=6,h=3,useDingbats= FALSE)
p <- ggplot(x) + geom_line(aes(x=DT, y=NOx), linetype = "dashed",
size=0.3) + xlab("") +
geom_line(aes(x=DT, y=smoothed10), color = "red") + labs(y="NOx
(ppbv)") +
ggtitle("NOx concentrations at Chamizal TCEQ Site") +
theme(plot.title = element_text(hjust = 0.5)) +
annotate("text",x=as.POSIXct("2020-03-24 17:00:00"), y=130, +
label="Stay Home 1", angle=90, size=2.5)
annotate("text",x=as.POSIXct("2020-04-01 17:00:00"), y=130,
label="Stay Home 2", angle=90, size = 2.5) +
annotate("segment", x = as.POSIXct("2020-03-24 17:00:00"), xend=
as.POSIXct("2020-03-24 17:00:00"), y = 0, yend = 105, colour =
"blue") +
annotate("segment", x = as.POSIXct("2020-04-01 17:00:00"), xend=
as.POSIXct("2020-04-01 17:00:00"), y = 0, yend = 105, colour =
"blue")
'''
Graph that I think needs to be averaged??

Perhaps you can add a year column and use this to group your data by year. You could then overlay the data by year.
For fun, I had a go at reproducing the plots avoiding reshape.
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
# stack raw data for NO and NO2
NO_stacked <- NO %>%
pivot_longer(cols = starts_with("X"),
names_to = "hours",
values_to = "NO")
NO2_stacked <- NO2 %>%
pivot_longer(cols = starts_with("X"),
names_to = "hours",
values_to = "NO2")
# combine into one data frame
data <- bind_cols(NO_stacked, NO2_stacked) %>%
select(Date, hours, NO, NO2)
# coerce dates to POSIXct and add hours; remove hours; reshape to long format using pivot_longer
data <- data %>%
mutate(Date = as.POSIXct(Date, format = "%m/%d/%Y", tz = "UTC"),
hours = as.numeric(str_sub(hours, start = 2, end = -1))) %>%
mutate(Date = Date + 60*60*hours) %>%
select(-hours) %>%
pivot_longer(cols = contains("NO"),
names_to = "Contaminant",
values_to = "Concentration")
# plot
ggplot(data = data, aes(x = Date, y = Concentration)) +
geom_line() +
geom_smooth(method = "loess", formula = y ~ x) +
facet_wrap(vars(Contaminant), nrow = 2) +
ggtitle("NOx concentrations at Chamizal TCEQ Site")

Coarsened Exact Matching with cem package- Error in .subset2(x, i, exact = exact)

I'm trying to perform coarsened exact matching on the following data.
> dput(head(cem_data))
structure(list(sex = c(1, 1, 1, 2, 2, 2), age = c(40, 59, 53,
60, 49, 60), edlev = c(3, 3, 3, 2, 3, 3), sw = c(44, 17, 10,
41, 26, 23), sw2 = c(15, 1, 5, 34, 5, 6), som = c(2.14, 0.14,
1.86, 3, 1.71, 2.14), som_2 = c(0.71, 0.14, 2, 2.57, 1.71, 2.14
), ap = c(3.5, 1.5, 1.33, 3.33, 2.67, 2.17), ap_2 = c(3, 0.17,
2.33, 3, 0.83, 1.67), dep = c(2.83, 0.17, 0.33, 2.83, 2.17, 2.33
), dep_2 = c(1.17, 0, 0.33, 2.33, 0.83, 1), int = c(2.86, 1.43,
1, 2, 2.29, 2.14), int_2 = c(2.29, 0.57, 0.14, 2.57, 1.71, 1.43
), pho = c(3.2, 0, 0, 3.4, 0.8, 0.4), pho_2 = c(1.6, 0, 0, 3.2,
0, 0.4), psy_b = c(2.67, 0.11, 0.83, 3.06, 1.61, 1.72), psy_b_2 = c(1.11,
0.06, 0.89, 2.67, 0.94, 1.28), s_wirk = c(4, 2.2, 1.6, 3.2, 1.4,
2.2), s_wirk_2 = c(2.8, 0.8, 1.8, 2.6, 1.6, 1.4), soz_b = c(2.75,
1.5, 1, 2.25, 1.25, 1.5), soz_b_2 = c(2.75, 1, 1, 2.25, 1.5,
1.25), soz_u = c(0.75, 0.75, 1.75, 3.25, 1, 3.25), soz_u_2 = c(1,
0.25, 1.75, 2.5, 2.5, 2), wohl = c(3.6, 1.4, 1.8, 3.4, 3, 3),
wohl_2 = c(2, 0.6, 1.4, 2.8, 2.2, 1.2), au_bei_aufn = c(1,
1, 1, 1, 1, 1), age_reha = c(40.9890410958904, 59.3945205479452,
53.372602739726, 60.2, 49.3342465753425, 60.7534246575342
), group_format = c(0, 0, 0, 0, 0, 0)), row.names = c(6L,
7L, 10L, 15L, 20L, 29L), class = "data.frame")
With the following code:
require(cem)
voll_data <- voll_data %>%
select(-c("auf_nr", "icd_10_1", "icd_10_2", "icd_10_3", "icd_10_4","icd_10_5", "bdi_date", "aufnahme", "entlassung")) %>%
mutate_if(is.factor,as.numeric) %>%
mutate_if(is.character, as.numeric)
cem_data <- data.frame(na.omit(voll_data))
#cem_data_s <- scale(cem_data[,5:26])
#cem_data <- cbind.data.frame(cem_data[, 1:4], cem_data_s, cem_data[, 27:36])
variables <- c("age", "sex", "edlev", "sw","au_bei_aufn")
ungleich2 <- imbalance(cem_data$group_format, data=cem_data)
However, following error is being shown, when calculating the "matt".
Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index
7.
(function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, i, exact = exact))(x, ..., exact = exact)
6.
[[.data.frame(data, treatment)
5.
data[[treatment]]
4.
is.factor(x)
3.
as.factor(data[[treatment]])
2.
cem.main(treatment = treatment, data = data, cutpoints = cutpoints, drop = drop, k2k = k2k, method = method, mpower = mpower, verbose = verbose, baseline.group = baseline.group, keep.all = keep.all)
1.
cem(treatment = cem_data$group_format, data = cem_data, drop = "sw2", cutpoints = list(age = agecut), grouping = list(edlev_gr))
# automated coarsening
matt <- cem(cem_data$group_format, data = cem_data, drop= "sw2")
print(matt)
Does anyone have an idea what am I doing wrong?
Thanks a lot!!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing test data and prediction outcome - r

Related

Making quarto html documents use the full window width

Loop over several columns while appending results to new columns using the name of the exisiting columns to name the new columns

How to facet wrap plots with two different factors

How to average 7 years of data

Coarsened Exact Matching with cem package- Error in .subset2(x, i, exact = exact)

Categories

Resources