Add a new row on basis of column values in R - r

I am trying to get my head around this simple preprocessing task in R. I am trying to get the ideal value column as a row titled ideal in Product ID. I think the image below will shed more light on it.
> dput(df)
structure(list(Consumer = c(43L, 43L, 43L, 43L, 43L, 41L, 41L,
41L, 41L, 41L), Product = c(106L, 992L, 366L, 257L, 548L, 106L,
992L, 366L, 257L, 548L), Firm = c(1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 0L), Juicy = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L
), Sweet = c(0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L), Ideal_Firm = c(1L,
1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), Ideal_Juicy = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Ideal_Sweet = c(1L, 1L, 1L,
1L, 1L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-10L))

Below is a solution:
df <- data.frame(
Consumer = c(rep(43, 5), rep(41, 5)),
Product = rep(sample(100:900,size = 5, replace = F), 2),
Firm = c(sample(rep(0:1, 5), replace = T)),
Juicy = c(sample(rep(0:1, 5), replace = T)),
Sweet = c(sample(rep(0:1, 5), replace = T)),
Ideal_Firm = 1,
Ideal_Juicy = c(rep(1, 5), rep(2, 5)),
Ideal_Sweet = c(rep(1, 5), rep(0, 5))
)
library(dplyr)
df <- merge(
# Bind the observation...
df %>% select(Consumer:Sweet) %>%
pivot_wider(id_cols = Consumer,names_from = Product,values_from = Firm:Sweet),
# ... to the ideal
df %>% group_by(Consumer) %>%
# Here I put mean, but it could be 1, median, min, max... If I understood correctly, it has to be 1?
summarise(across(Ideal_Firm:Ideal_Sweet, ~mean(.x))) %>%
# Rename so the column name has the form [characteristic]_ideal instead of Ideal_[characteristic]
# remove prefix Ideal_ ...
rename_at(.vars = vars(starts_with("Ideal_")),
.funs = funs(sub("Ideal_", "", .))) %>%
# ... add _Ideal as a suffix instead
rename_at(vars(-Consumer), function(x) paste0(x,"_Ideal"))
)
# Then manipulate to get into long form again
df <- df %>% pivot_longer(cols = !Consumer) %>%
separate(name, c("Characteristic", "Product")) %>%
pivot_wider(id_cols = Consumer:Product, names_from = Characteristic, values_from = value)
df

Related

create a dataframe for multiple line plot for ggplot R

This question is about arranging data for a ggplot line plot. I have been doing this manually with excel and I want to work out a way to do this using r.
I have reviewed this post which is similar
Arrange dataframe format for ggplot - R
I have a dataset that looks like this:
]1
I want to convert it to a dataframe that is divided into the groups (N,A,G) and into age brackets and the proportion per age_group.
An example of what I am trying to achieve:
Appreciate your help.
Data:
structure(list(ID = 1:10, Age = c(9L, 16L, 12L, 13L, 29L, 24L,
23L, 24L, 16L, 40L), Sex = structure(c(1L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), Age_group =
c(1L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 4L), N = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 0L), A = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
0L), G = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
We can pivot to 'long' format with pivot_longer and then create a grouping variable with cut on the 'Age' and get the sum of 'n' and 'proportion'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = N:G, names_to = 'group', values_to = 'n') %>%
group_by(Age_group_new = cut(Age, breaks = c(-Inf, 0, seq(10, 70, by = 10), 100, Inf)), group) %>%
summarise(n = sum(n)) %>%
group_by(Age_group_new) %>%
mutate(proportion = n/sum(n),
proportion = replace(proportion, is.nan(proportion), 0))

How to mutate a column using dplyr with a value when any of the columns contain a 1 otherwise 0

events <- structure(list(ID = c(3049951, 3085397, 3204081, 3262134,
3467254), TVTProcedureStartDate = structure(c(16210, 16238, 16322,
16420, 16546), class = "Date"), DCDate = structure(c(16213, 16250,
16326, 16426, 16560), class = "Date"), CE_EventOccurred = c(0L,
0L, 0L, 0L, 0L), CE_EventDate = c(0L, 0L, 0L, 0L, 0L), `Annular Dissection (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Aortic Dissection (In Hospital)` = c(0L, 0L,
0L, 1L, 0L), `Atrial Fibrillation (In Hospital)` = c(0L, 1L,
0L, 0L, 1L), `Bleeding at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Cardiac Arrest (In Hospital)` = c(1L, 0L, 0L,
0L, 0L), `Conduction/Native Pacer Disturbance Req ICD (In Hospital)` = c(0L,
0L, 1L, 0L, 0L), `Conduction/Native Pacer Disturbance Req Pacer (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Endocarditis (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `GI Bleed (In Hospital)` = c(0L, 0L, 0L, 0L, 0L), `Hematoma at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Ischemic Stroke (In Hospital)` = c(0L, 0L,
0L, 0L, 0L), `Major Vascular Complications (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Minor Vascular Complication (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Leaflet Injury - detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Subvalvular Injury -detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `New Requirement for Dialysis (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Other Bleed (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `Perforation with or w/o Tamponade (In Hospital)` = c(1L,
0L, 0L, 0L, 0L), `Retroperitoneal Bleeding (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Single Leaflet Device Attachment (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Other Cardiac Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Vascular Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 1L, 0L)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254
)), class = "data.frame", row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254,
3467324, 3510387, 3586037, 3661089, 3668621, 3679485, 3737916,
3738064, 3960141, 4006862, 4018241, 4019056, 4025174, 4027490,
4050900, 4051101, 4096816, 4097119, 4097146, 4097180, 4098426,
4106410, 4109968, 4147466, 4198427, 4198450, 4198458, 4204554,
4208053, 4213116, 4218802, 4218854, 4223378, 4223415, 4243959,
4316979, 4341660, 4348676, 4413567, 4419513, 4421948, 4422768,
4426483, 4430159, 4431211, 4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", labels = structure(list(NCDRPatientID = c(3049951,
3085397, 3204081, 3262134, 3467254, 3467324, 3510387, 3586037,
3661089, 3668621, 3679485, 3737916, 3738064, 3960141, 4006862,
4018241, 4019056, 4025174, 4027490, 4050900, 4051101, 4096816,
4097119, 4097146, 4097180, 4098426, 4106410, 4109968, 4147466,
4198427, 4198450, 4198458, 4204554, 4208053, 4213116, 4218802,
4218854, 4223378, 4223415, 4243959, 4316979, 4341660, 4348676,
4413567, 4419513, 4421948, 4422768, 4426483, 4430159, 4431211,
4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", drop = TRUE), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10:12, 13L, 14L, 15L,
16:17, 18L, 19:21, 22L, 23L, 24L, 25:26, 27L, 28L, 29:30,
31L, 32:33, 34L, 35:38, 39L, 40:41, 42L, 43L, 44L, 45L, 46L,
47L, 48:50, 51:53, 54L, 55L, 56L, 57L, 58L, 59:60, 61L, 62L,
63:64, 65:66, 67:68, 69L, 70L, 71:72, 73L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 4L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L), biggest_group_size = 4L), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,
51L, 52L), drop = TRUE, group_sizes = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), biggest_group_size = 1L), indices = list(0L, 1L, 2L, 3L, 4L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L), biggest_group_size = 1L)
From this data, I need to create a column that has value 1 if any of the columns which ends in (in-hospital) contains 1 else 0.
I tried multiple things but either doesn't work or displays error
Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
event %>% mutate(TR = rowSums(select_(.,6:n)))
Error in mutate_impl(.data, dots) : Column `TR` must be length 1 (the group size), not 53
event %>% mutate(TR = rowSums(.[6:ncol(.)]))
And some other variations of it to see if I can understand or make some sense, but it keeps running into the similar errors and problems
Another thing i tried was the following which seems to do the row sums, but it also adds the ID even when I'm doing the following:
event %>% select(6:27) %>% rowSums()
but it added the ID with the 1s and 0s from columns 6 to 27 for each row. Not sure why it's doing this.
I want the results as a data frame with the same data, but also a column with 1s if any of the columns from 6 to 27 contains 1 otherwise 0
Before I developed my solution, I ran the following code to ungroup your data.
library(dplyr)
events <- events %>% ungroup()
Solution 1: rowSums with selected columns
The idea of this solution is to use rowSums to add all the numbers from the selected columns, determine if the sum is larger than 0, and then convert the logical vector to an integer vector (with 1 or 0).
There are many ways to select the columns. We can select based on column numbers.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., 6:27)) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use ends_with.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., ends_with("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use matches. The regular expression \\(In Hospital\\)$ indicates the string at the end.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., matches("\\(In Hospital\\)$"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use contains, but notice that the target string does not need to be in the end of the column names.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., contains("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
Solution 2: apply with max
Since the numbers from the target columns are all 1 or 0, we can use apply with max to get the maximum, which will be 1 if there ara any 1, or 0. All the ways to use the select function as was shown above will also work here. Below I presented one way to do this.
events2 <- events %>% mutate(Col = apply(select(., ends_with("(In Hospital)")), 1, max))
events2$Col
# [1] 1 1 1 1 1
It is not a dplyr way, but it also works:
events$new_col <- 0
events$new_col[rowSums(events[, grep("In Hospital", colnames(events))]) >= 1] <- 1
A solution from base R using apply()
cols <- grep("in hospital", colnames(events), ignore.case = T)
apply(events[, cols], 1, function(x) ifelse(any(x == 1), 1, 0))
# [1] 1 1 1 1 1

performing customized forecast's plot to save it in pdf file in R

I have this datasets. I must perform forecast for six weeks using data of stores. Using train and test samples. The forecast i can perform, but i need
the visualization. Here part of my datasets train and test sample (if it's needed)
combinedTrainingData=structure(list(id = 1:19, Store = 1:19, DayOfWeek = c(5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30.01.2015", class = "factor"),
Sales = c(5577L, 5919L, 6911L, 13307L, 5640L, 6555L, 11430L,
6401L, 8072L, 6350L, 10031L, 9156L, 7004L, 6491L, 8898L,
9546L, 7929L, 9941L, 7121L), Open = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
Promo = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), StateHoliday = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), SchoolHoliday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), WeekOfYear = c(5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L), Weekend = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DateDiff = c(65L,
65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L,
65L, 65L, 65L, 65L, 65L, 65L)), .Names = c("id", "Store",
"DayOfWeek", "Date", "Sales", "Open", "Promo", "StateHoliday",
"SchoolHoliday", "WeekOfYear", "Weekend", "DateDiff"), class = "data.frame", row.names = c(NA,
-19L))
Now test sample
testingData=structure(list(Id = 1:20, Store = 1:20, DayOfWeek = c(5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L), Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "31.07.2015", class = "factor"),
Open = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Promo = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), StateHoliday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), SchoolHoliday = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 0L), WeekOfYear = c(31L, 31L, 31L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L,
31L, 31L, 31L), Weekend = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DateDiff = c(29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L)), .Names = c("Id", "Store",
"DayOfWeek", "Date", "Open", "Promo", "StateHoliday", "SchoolHoliday",
"WeekOfYear", "Weekend", "DateDiff"), class = "data.frame", row.names = c(NA,
-20L))
This is forecast
regularSales <- combinedTrainingData[combinedTrainingData$Promo == 0 &
combinedTrainingData$Open == 1, ]
testingForecast <- testingData
for(store in storeData$Store) {
coeff <- lm(data = regularSales[regularSales$Store == store, ],
Sales ~ DateDiff)$coefficients
storeData[storeData$Store == store, 'reg_intercept'] <- coeff[1]
storeData[storeData$Store == store, 'reg_slope'] <- coeff[2]
combinedTrainingData[combinedTrainingData$Store == store, 'LinearRegressionForecast'] <-
coeff[1] + coeff[2] * combinedTrainingData[combinedTrainingData$Store == store, 'DateDiff']
testingForecast[testingForecast$Store == store, 'LinearRegressionForecast'] <-
coeff[1] + coeff[2] * testingForecast[testingForecast$Store == store, 'DateDiff']
}
predictors <- c('Store', 'WeekOfYear', 'DayOfWeek', 'Promo')
modelForecast <- combinedTrainingData[combinedTrainingData$Open == 1, ] %>%
group_by_(.dots=predictors) %>%
summarize(salesMinusForecast=mean(Sales - LinearRegressionForecast)) %>%
ungroup()
testingForecast <- testingForecast %>%
left_join(modelForecast, by=predictors) %>%
mutate(Sales=salesMinusForecast + LinearRegressionForecast) %>%
select(Id, Store, DayOfWeek, Date, Sales, Open, Promo, StateHoliday,
SchoolHoliday, WeekOfYear, Weekend, DateDiff, LinearRegressionForecast)
testingForecast[!is.na(testingForecast$Open) & testingForecast$Open == 0, 'Sales'] <- 0.0
index <- which(is.na(testingForecast$Sales))
for(i in index) {
iStore <- testingForecast[i, 'Store']
iWeekOfYear <- testingForecast[i, 'WeekOfYear']
iDayOfWeek <- testingForecast[i, 'DayOfWeek']
# 1 - Check to see if we have data for a previous day
iDayOfWeek <- ifelse(iDayOfWeek %in% 2:5, iDayOfWeek - 1, iDayOfWeek)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
if(dim(match)[1] <= 0)
{
iDayOfWeek <- testingForecast[i, 'DayOfWeek']
# 2 - Check to see if we have data for a previous day
iDayOfWeek <- ifelse(iDayOfWeek %in% 1:4, iDayOfWeek + 1, iDayOfWeek)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
}
iDayOfWeek <- testingForecast[i, 'DayOfWeek']
if(dim(match)[1] <= 0)
{
# 3 - Check to see if we have data for a previous Week
iWeekOfYear <- ifelse(iWeekOfYear > 1, iWeekOfYear - 1, iWeekOfYear)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
}
iWeekOfYear <- testingForecast[i, 'WeekOfYear']
if(dim(match)[1] <= 0)
{
# 4 - Check to see if we have data for a next Week
iWeekOfYear <- ifelse(iWeekOfYear < 51, iWeekOfYear + 1, iWeekOfYear)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
}
iWeekOfYear <- testingForecast[i, 'WeekOfYear']
if(dim(match)[1] <= 0)
{
# 5 - Check to see if we have data for two weeks ago
iWeekOfYear <- ifelse(iWeekOfYear > 2, iWeekOfYear - 2, iWeekOfYear)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
}
iWeekOfYear <- testingForecast[i, 'WeekOfYear']
if(dim(match)[1] <= 0)
{
# 6 - Check to see if we have data for two Weeks later
iWeekOfYear <- ifelse(iWeekOfYear < 50, iWeekOfYear + 2, iWeekOfYear)
match = filter(modelForecast,
Store == iStore &
WeekOfYear == iWeekOfYear &
DayOfWeek == iDayOfWeek )
}
iWeekOfYear <- testingForecast[i, 'WeekOfYear']
if(dim(match)[1] > 0)
{
testingForecast[i, 'Sales'] <-
match[1, 'salesMinusForecast'] +
testingForecast[i, 'LinearRegressionForecast']
if(match[1, 'Promo'] == 0){
testingForecast[i, 'Sales'] <-
testingForecast[i, 'Sales'] *
avgSalesRatios[avgSalesRatios$Store == iStore, 'Ratio']
}
}
}
combinedTrainingTestingData <- rbind(combinedTrainingData[, c(1:4, 6:15)],
testingForecast[, 2:15])
combinedTrainingTestingData[combinedTrainingTestingData$Forecast == 1,
'Type'] <- "Forecast"
combinedTrainingTestingData[combinedTrainingTestingData$Imputed == 0 &
combinedTrainingTestingData$Forecast == 0,
'Type'] <- "Observed"
finalForecast <- data.frame(Id=testingForecast$Id, Sales=testingForecast$Sales)
than i want create the forecast plot
# Convert finalForecast from list to data frame object
df1 <- fortify(finalForecast) %>% as_tibble()
# Create Date column, remove Index column and rename other columns
df1 %<>%
mutate(Date = as.Date(Index, "%Y-%m-%d")) %>%
select(-Index) %>%
rename("Low95" = "Lo 95",
"Low80" = "Lo 80",
"High95" = "Hi 95",
"High80" = "Hi 80",
"Forecast" = "Point Forecast")
df1
### Avoid the gap between data and forcast
# Find the last non missing NA values in obs then use that
# one to initialize all forecast columns
lastNonNAinData <- max(which(complete.cases(df1$Data)))
df1[lastNonNAinData,
!(colnames(df1) %in% c("Data", "Fitted", "Date"))] <- df1$Data[lastNonNAinData]
#To obtain a complex graph with overlapping of the forecast value of the time series by the initial values
ggplot(df1, aes(x = Date)) +
geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
geom_point(aes(y = Data, colour = "Data"), size = 4) +
geom_line(aes(y = Data, group = 1, colour = "Data"),
linetype = "dotted", size = 0.75) +
geom_line(aes(y = Fitted, group = 2, colour = "Fitted"), size = 0.75) +
geom_line(aes(y = Forecast, group = 3, colour = "Forecast"), size = 0.75) +
scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
scale_fill_brewer(name = "Intervals") +
guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
theme_bw(base_size = 42)
So the question how can i save the forecast plots(in this format, that plot code represented above) for each store, saved in pdf file. So as output i must have pdf file with 1115 forecast plots(ie for each store its own plot)
I cannot run your code, so here the generic answer:
# Compute your forecasts by store
forecasts <- list()
# Create PDF
pdf(file = path_fo_file, width = your_width, height = your_height)
# Iterate over your forcasts
for (f in forecasts) {
# Plot forecast f
pl <- ggplot(f)
# Print forecast to new page in PDF file
print(pl)
}
# Cloe file connection
dev.off()

Turning a Presence/Absence Matrix into a Cluster Analysis in R Studio

Okay, I have a presence/absence matrix of 6 samples with 25 possibilities of presence/absence.
I've been able to make a cluster dendrogram with the data, but I'd rather have it plotted as a distance matrix that looks better and is easier to analysis? (Maybe a cluster plot or something similar?)
I'm really stuck with figuring out the next part - I've spent days searching on here and various other Google searches but nothing is turning up!
Here's the code I've got for the cluster dendrogram:
matrix<-read.csv("Horizontal.csv")
distance<-dist(matrix)
hc.m<-hclust(distance)
plot(hc.m, labels=matrix$Sample, main ="", cex.main=0.8, cex.lab= 1.1)
Help!
> dput(head(matrix,20))structure(list(Sample = structure(1:6, .Label = c("CL1", "CL2",
"CL3", "COL1", "COL2", "COL3"), class = "factor"), X = c(0L,
0L, 0L, 1L, 1L, 1L), X.1 = c(1L, 0L, 0L, 1L, 1L, 1L), X.2 = c(1L,
1L, 1L, 0L, 0L, 0L), X.3 = c(1L, 1L, 1L, 1L, 1L, 1L), X.4 = c(1L,
1L, 1L, 0L, 0L, 0L), X.5 = c(0L, 0L, 0L, 1L, 1L, 0L), X.6 = c(1L,
1L, 1L, 1L, 1L, 1L), X.7 = c(1L, 1L, 1L, 1L, 1L, 1L), X.8 = c(0L,
0L, 0L, 1L, 1L, 1L), X.9 = c(0L, 0L, 0L, 1L, 1L, 1L), X.10 = c(1L,
1L, 1L, 1L, 1L, 1L), X.11 = c(1L, 1L, 1L, 1L, 1L, 1L), X.12 = c(1L,
1L, 1L, 1L, 1L, 1L), X.13 = c(1L, 0L, 0L, 0L, 0L, 0L), X.14 = c(0L,
0L, 0L, 1L, 1L, 1L), X.15 = c(0L, 0L, 0L, 1L, 1L, 1L), X.16 = c(1L,
1L, 1L, 1L, 0L, 0L), X.17 = c(1L, 1L, 1L, 1L, 1L, 1L), X.18 = c(1L,
1L, 1L, 1L, 1L, 1L), X.19 = c(1L, 1L, 1L, 1L, 1L, 1L), X.20 = c(1L,
1L, 1L, 1L, 1L, 1L), X.21 = c(1L, 1L, 1L, 1L, 0L, 0L), X.22 = c(0L,
0L, 0L, 0L, 1L, 1L), X.23 = c(1L, 1L, 1L, 1L, 1L, 1L), X.24 = c(0L,
1L, 1L, 1L, 1L, 1L)), .Names = c("Sample", "X", "X.1", "X.2",
"X.3", "X.4", "X.5", "X.6", "X.7", "X.8", "X.9", "X.10", "X.11",
"X.12", "X.13", "X.14", "X.15", "X.16", "X.17", "X.18", "X.19",
"X.20", "X.21", "X.22", "X.23", "X.24"), row.names = c(NA, 6L
), class = "data.frame")
Okay with this code:
library(vegan)
library(ggplot2)
library(tidyverse)
library(MASS)
#set working directory
setwd("~/Documents/Masters/BS707/Metagenomics")
#read csv file
cookie<-read.csv("Horizontal.csv")
data.frame(cookie, row.names = c("CL1", "CL2", "CL3", "COL1", "COL2", "COL3"))
df = subset(cookie)
data.frame(df, row.names = c("CL1", "CL2", "CL3", "COL1", "COL2", "COL3"))
dm<- dist(df, method = "binary") #calculate the distance matrix
cmdscale(dm, eig = TRUE, k=2) -> mds
as.tibble(mds$points) #mds coordinates
bind_cols(df, Sample = df$Sample) #bind sample names
mutate(df,group = gsub("\\d$", "", "Sample1"))#remove last digit from sample names to form groups
ggplot(df)+
geom_point (aes(x = "V1",y = "V2", color = "group")) #plot
as.tibble(mds$points) %>% ggplot() + geom_point (aes(x = V1, y = V2))
I get the plot but each group is named 'Sample' rather than CL1, CL2, CL3, COL1, COL2, COL3. I had to remove the %>% because my R didn't recognise it as a command or anything and gave an error every single time (switched to + or deleted and then it worked fine).
Here is a way to visualize your data in 2 dimensions:
library(tidyverse)
df %>%
dplyr::select(-1) %>% #remove first column
dist(method = "binary") %>% #calculate the distance matrix
cmdscale(eig = TRUE, k = 2) -> mds #do MDS also known as principal coordinates analysis
as.tibble(mds$points) %>% #mds coordinates
bind_cols( Sample = df$Sample) %>% #bind sample names
mutate(group = gsub("\\d$", "", Sample)) %>% #remove last digit from sample names to form groups
ggplot()+
geom_point(aes(x = V1,y = V2, color = group)) #plot
or without tidyverse:
df_dist <- dist(df[,-1], method = "binary")
mds <- cmdscale(df_dist, eig = TRUE, k = 2)
for_plot <- data.frame(mds$points, group = gsub("\\d$", "", df$Sample))
ggplot(for_plot)+
geom_point(aes(x = X1,y = X2, color = group))
other options include using isoMDS from MASS library which will perform Kruskal's Non-metric Multidimensional Scaling or metaMDS from vegan library which performs Nonmetric Multidimensional Scaling with Stable Solution from Random Starts, Axis Scaling and Species Scores.

Passing the list of strings as input to a function

I am trying automate a simple task in R using a function.
C is list of character variables. mydata- is the dataset.
Basically, I need to give each of the strings in vector C as an input to the function.
dataset:
mydata <- structure(list(a = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L), b = c(4L,3L, 1L, 2L, 1L, 5L, 2L, 2L), c = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), t = c(42L, 34L, 74L,39L, 47L, 8L, 36L, 39L), s = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), .Names = c("a", "b", "c", "d", "t", "s"), row.names = c(NA,8L), class = "data.frame")
code:
c<-c("a","b","c","d")
plot<-function()
for (i in c)
{
fit<-survfit(Surv(s,t)~paste(i), dat=mydata)
ggsurvplot(fit, pval = TRUE)
}
plot()
I m facing the following error:
Error in model.frame.default(formula = Surv(mydata$s, mydata$t) ~
paste(i), : variable lengths differ (found for 'paste(i)')
I have tried the reformulate as well:
plot<-function()
for (i in c)
{
survfit(update(Surv(s,t)~., reformulate(i)), data=mydata)
ggsurvplot(fit, pval = TRUE)
}
plot()
but this code also gives this error:
Error in reformulate(i) : object 'i' not found
Any help to make this code work?
Thanks
Building formulas dynamically can be tricky. Rather than
fit(Surv(mydata$s,mydata$t)~paste(i), dat=mydata)
use
fit(update(Surv(s,t)~., reformulate(i)), data=mydata)
You should avoid using $ with formulas. Here reformualte() helps to build a formula from a string and update combines parts of formulas. See the help pages for these functions if you would like more details.
Here's the full working version with the sample inout
#sample input
mydata <- structure(list(a = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L), b = c(4L,3L, 1L, 2L, 1L, 5L, 2L, 2L), c = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), t = c(42L, 34L, 74L,39L, 47L, 8L, 36L, 39L), s = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), .Names = c("a", "b", "c", "d", "t", "s"), row.names = c(NA,8L), class = "data.frame")
c<-c("a","b","c","d")
and the code
library(survival)
library(survminer)
plot <- function() {
for (i in c) {
fit <- survfit(update(Surv(t,s)~., reformulate(i)), data=mydata)
ggsurvplot(fit)
}
}
plot()
When I copy/paste that into R I do not get any errors. You must be doing something different than the sample code you've posted.

Resources