Create mean value plot without missing values count to total - r

Using a dataframe with missing values:
structure(list(id = c("id1", "test", "rew", "ewt"), total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8), total_frq_4 = c(36, NA, 104, NA)), row.names = c(NA, 4L), class = "data.frame")
How is is possible to create a bar plot with the mean for every column, excluding the id column, but without filling the missing values with 0 but leaving out the row with missing values example for total_frq_3 24+25+8 = 57/3 = 19

You can use colMeans function and pass it the appropriate argument to ignore NA.
library(ggplot2)
xy <- structure(list(id = c("id1", "test", "rew", "ewt"),
total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8),
total_frq_4 = c(36, NA, 104, NA)),
row.names = c(NA, 4L),
class = "data.frame")
xy.means <- colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE)
xy.means <- as.data.frame(xy.means)
xy.means$total <- rownames(xy.means)
ggplot(xy.means, aes(x = total, y = xy.means)) +
theme_bw() +
geom_col()
Or just use base image graphic
barplot(height = colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE))

Related

Error with using a function to create a new variable with subtraction between variables in R

I have a huge dataset of the Marseille's rental property market (named marseilleannonces) which contains some variables:
structure(list(ID = c("af626000-342e-11e8-a56e-8326540c0e87",
"20629290-c926-11e6-a626-abf6d3bf8a25", "8495af50-b92c-11e5-86ef-abf6d3bf8a25",
"a4299b60-11e3-11ea-9589-c1180fadeaa5", "833f81d0-d3da-11ea-b28a-1b6a75606a9a",
"75358b40-6d76-11e5-bb7a-cfb08fbdec46", "8d6f22f3-abc7-11e4-b16a-1100e6029c1e",
"10ed2580-28cb-11e9-bcd9-d3a30a46a7fe", "dd156b70-1534-11e6-afdf-abf6d3bf8a25",
"15688650-2934-11e8-ab89-41d65c7c6457"), TYPE = c("APARTMENT",
"APARTMENT", "APARTMENT", "APARTMENT", "PREMISES", "APARTMENT",
"APARTMENT", "APARTMENT", "APARTMENT", "PREMISES"), SURFACE = c(19,
29, 17, 55, 35, 50, 67, 30, 28, 45), ROOM_COUNT = c(1, 2, 1,
3, 1, 2, 2, 1, 1, NA), PRICE = c(295, 470, 290, 610, 550, 500,
500, 655, 445, 1943), RENTAL_EXPENSES = c(45, NA, NA, NA, NA,
NA, 40, NA, NA, NA), RENTAL_EXPENSES_INCLUDED = c(TRUE, TRUE,
NA, TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
In this dataset, if RENTAL_EXPENSES_INCLUDED=TRUE, the variable PRICE contains the values in RENTAL_EXPENSES, and if RENTAL_EXPENSES_INCLUDED=FALSE, the variable PRICE does not contain the values in RENTAL_EXPENSES. My goal is to create a new column whith prices that does not contain the values in RENTAL_EXPENSES, named HC. I tried to create a function:
for(i in 1:length(marseilleannonces$RENTAL_EXPENSES_INCLUDED)){
x = marseilleannonces$RENTAL_EXPENSES_INCLUDED[i]
if(x == TRUE){
marseilleannonces$HC[i] = PRICE[i]-RENTAL_EXPENSES[i]
}
else {
marseilleannonces$HC[i] = PRICE[i]
}
}
R tells me that there is a missing value where TRUE/FALSE is required. Maybe the fact that there is a lot of NAs in my dataset is a problem.
Any advice is the right direction is welcomed.
Thanks in advance !
Edit: Based on your comments:
marseillannonces %>%
mutate(HC = case_when(RENTAL_EXPENSES_INCLUDED == TRUE ~ PRICE - RENTAL_EXPENSES,
RENTAL_EXPENSES_INCLUDED == FALSE ~ PRICE))

Label group of plots

I merged nine plots together and I would like to group them based on different characteristics (A,B,C). Is there a simple way to add labels or annotations at the bottom of plots? When using cowplot or GridExtra i receive the following error:
In as_grob.default(plot) :
Cannot convert object of class list into a grob.
Sample data
list(list(stats = structure(c(43, 96.5, 297.5, 707.5, 778), .Dim = c(5L,
1L)), n = 36, conf = structure(c(136.603333333333, 458.396666666667
), .Dim = 2:1), out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(2, 10.5, 55.5, 102, 128), .Dim = c(5L,
1L)), n = 36, conf = structure(c(31.405, 79.595), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(1, 3, 5.5, 77, 88), .Dim = c(5L,
1L)), n = 36, conf = structure(c(-13.9866666666667, 24.9866666666667
), .Dim = 2:1), out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(531, 632.5, 701, 726.5, 786), .Dim = c(5L,
1L)), n = 36, conf = structure(c(676.246666666667, 725.753333333333
), .Dim = 2:1), out = c(485, 464, 446), group = c(1, 1, 1
), names = ""), list(stats = structure(c(104,
109.5, 113.5, 121, 125), .Dim = c(5L, 1L)), n = 36, conf = structure(c(110.471666666667,
116.528333333333), .Dim = 2:1), out = c(91, 91, 88, 84, 84,
79), group = c(1, 1, 1, 1, 1, 1), names = ""),
list(stats = structure(c(28, 53.5, 83.5, 88, 91), .Dim = c(5L,
1L)), n = 36, conf = structure(c(74.415, 92.585), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(80, 89, 102.5, 153, 236), .Dim = c(5L,
1L)), n = 36, conf = structure(c(85.6466666666667, 119.353333333333
), .Dim = 2:1), out = c(343, 318, 299, 257), group = c(1,
1, 1, 1), names = """"), list(stats = structure(c(7,
12, 22.5, 44, 72), .Dim = c(5L, 1L)), n = 36, conf = structure(c(14.0733333333333,
30.9266666666667), .Dim = 2:1), out = numeric(0), group = numeric(0),
names = ""), list(stats = structure(c(5,
5, 6, 12.5, 21), .Dim = c(5L, 1L)), n = 36, conf = structure(c(4.025,
7.975), .Dim = 2:1), out = numeric(0), group = numeric(0),
names = ""))
Many thanks
I agree with the idea of using ggplot2 graphics with facets, but given your plot objects, you could do something like this (to get you started). I used ggplotify instead of cowplot because I ran into trouble with the figure margins, but you might be able to fix that by changing the null device (not tested).
Edit:
Added individual labels and y axis labels, as well as outer margins. You might have to adjust some of that depending on the output size of your composite plot. This may show you how you could adjust those settings for individual plots. Still, using ggplot2 to generate the plots would make things quite a bit easier.
library(grid)
library(gridExtra)
library(ggplotify)
sdt <- list(list(stats = structure(c(43, 96.5, 297.5, 707.5, 778), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(136.603333333333, 458.396666666667), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(2, 10.5, 55.5, 102, 128), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(31.405, 79.595), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(1, 3, 5.5, 77, 88), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(-13.9866666666667, 24.9866666666667), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(531, 632.5, 701, 726.5, 786), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(676.246666666667, 725.753333333333), .Dim = 2:1),
out = c(485, 464, 446), group = c(1, 1, 1), names = ""),
list(stats = structure(c(104, 109.5, 113.5, 121, 125), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(110.471666666667, 116.528333333333), .Dim = 2:1),
out = c(91, 91, 88, 84, 84, 79), group = c(1, 1, 1, 1, 1, 1), names = ""),
list(stats = structure(c(28, 53.5, 83.5, 88, 91), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(74.415, 92.585), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(80, 89, 102.5, 153, 236), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(85.6466666666667, 119.353333333333), .Dim = 2:1),
out = c(343, 318, 299, 257), group = c(1,1, 1, 1), names = ""),
list(stats = structure(c(7, 12, 22.5, 44, 72), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(14.0733333333333, 30.9266666666667), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""),
list(stats = structure(c(5, 5, 6, 12.5, 21), .Dim = c(5L, 1L)),
n = 36, conf = structure(c(4.025, 7.975), .Dim = 2:1),
out = numeric(0), group = numeric(0), names = ""))
sublabels <- paste0(rep(LETTERS[1:3], each=3), 1:3)
gplts <- lapply(1:9, function(x) as.grob(function(y=sdt[[x]]) {
par(oma=c(0,3,0,3))
bxp(y, ylab="values", main=sublabels[x])}))
grid.arrange(rectGrob(gp=gpar(col="red")), rectGrob(gp=gpar(col="green")),
rectGrob(gp=gpar(col="yellow")), nrow=1, newpage =T)
vp <- viewport(.33/2,0.45, gp = gpar(col="red"))
grid.text("Group A",
y = .1, just = c("center", "bottom"),
gp = gpar(fontsize=20), vp = vp)
vp <- viewport(.5,.45, gp = gpar(col="green"))
grid.text("Group B",
y = .1, just = c("center", "bottom"),
gp = gpar(fontsize=20), vp = vp)
vp <- viewport(1-(.33/2),.45, gp = gpar(col="yellow"))
grid.text("Group C",
y = .1, just = c("center", "bottom"),
gp = gpar(fontsize=20), vp = vp)
grid.arrange(grobs=gplts, nrow=1, newpage=F)
Created on 2021-03-25 by the reprex package (v1.0.0)

How to create differences between several pairs of columns?

I have a panel (cross-sectional time series) dataset. For each group (defined by (NAICS2, occ_type) in time ym) I have many variables. For each variable I would like to subtract each group's first (dplyr::first) value from every value of that group.
Ultimately I am trying to take the Euclidean difference between the vector of each row 's group's first entry, (i.e. sqrt(c_1^2 + ... + c_k^2).
I was able to create the a column equal to the first entries for each group:
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
mutate_at(vars(-group_cols(), ends_with("_scf")),
list(first = dplyr::first))
I then tried to include variations of f.diff = . - dplyr::first(.) in the list, but none of those worked. I googled the dot notation for a while as well as first and lag in dplyr timeseries but have not been able to resolve this yet.
Ideally, I unite all variables into a vector for each row first and then take the difference.
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
mutate(vector.diff = ???)
I expect the output to be a numeric column that contains a distance measure of how different each group's row vector is from its initial row vector.
Here is a sample of the data:
structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L,
0L), occ_type = c("is_middle_manager", "is_middle_manager"),
Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11,
17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57,
51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302,
163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7,
7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25,
28), Engineering_scf = c(90, 64), Environment_scf = c(19,
19), Finance_scf = c(455, 313), Health.Care_scf = c(105,
71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265,
174), Information.Technology_scf = c(467, 402), Legal_scf = c(21,
17), Maintenance..Repair..and.Installation_scf = c(194, 222
), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139,
109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31,
16), Public.Safety.and.National.Security_scf = c(14, 7),
Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52,
24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599,
3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01",
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager",
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

Subsetting and plotting data by TimeStamp

I have a data.frame P1 (5000rows x 4cols) and would like to save the subset of data in columns 2,3 and 4 when the time-stamp in column 1 falls into a set range determined by a vector TimeStamp (in seconds).
E.g. put all values in columns 2, 3, and 4 into a new data.frame and call each section of data: Condition.1.P1, Condition.2.P1, etc.
The reason I'd like to label separately as I have 35 versions of P1 (P2, P3, P33, etc) and need to be able to melt them together to plot them.
dput(TimeStamp)
c(18, 138, 438, 678, 798, 1278, 1578, 1878, 2178)
dput(head(P1))
structure(list(Time = c(0, 5, 100, 200, 500, 1200), SkinTemp = c(27.781,
27.78, 27.779, 27.779, 27.778, 27.777), HeartRate = c(70, 70,
70, 70, 70, 70), RespirationRate = c(10, 10, 10, 10, 10, 10)), .Names = c("Time",
"SkinTemp", "HeartRate", "RespirationRate"), row.names = c(NA,
6L), class = "data.frame")
Do you want to seperate the data by the timestamp range and put it in a list? Than this might be what you are looking for:
TimeStamp <- c(18, 138, 438, 678, 798, 1278, 1578, 1878, 2178)
dat <- structure(list(Time = c(0, 5, 100, 200, 500, 1200), SkinTemp =(27.781,
27.78, 27.779, 27.779, 27.778, 27.777), HeartRate = c(70, 70,
70, 70, 70, 70), RespirationRate = c(10, 10, 10, 10, 10, 10)), .Names = c ("Time",
"SkinTemp", "HeartRate", "RespirationRate"), row.names = c(NA,
6L), class = "data.frame")
dat$Segment <- cut(dat$Time,c(-Inf,TimeStamp))
split(dat,dat$Segment)
P2 = data.frame(NA, NA, NA, NA) # Create empty data.frame
for (i in 1:length(ts)){
P3 = data.frame() # Create empty changing data.frame
if (i == 1) {ts1 = 0} else {ts1 = ts[i-1]} #First time stamp starts at 0
ts2 = ts[i]
P3 = subset(P1, P1$Time > ts1 & P1$Time < ts2)[,c(2,3,4)] #Subset the columns and assign to P3
if (nrow(P3) == 0){P3 = data.frame(NA, NA, NA)} #If the subset is empty, assign NA
P3$TimeStamp = paste(ts1,ts2,sep="-") # Append TimeStamp to the P3
colnames(P3) = colnames(P2) #Make sure column names are same to allow rbind
P2 = rbind(P2,P3) #Append P3 to P2
}
P2 = P2[c(2:nrow(P2)),] #Remove the first row (that has NA)
colnames(P2) = c("SkinTemp", "HeartRate", "RespirationRate", "TimeStamp") #Provide column names)
rm(P3); rm(i); rm(ts1); rm(ts2) #Cleanup

Compute and save the r-squared value of bootstrap objects in a new dataframe in R

I have a dataframe df
dput(df)
structure(list(x = c(49, 50, 51, 52, 53, 54, 55, 56, 1, 2, 3,
4, 5, 14, 15, 16, 17, 2, 3, 4, 5, 6, 10, 11, 3, 30, 64, 66, 67,
68, 69, 34, 35, 37, 39, 2, 17, 18, 99, 100, 102, 103, 67, 70,
72), y = c(2268.14043972082, 2147.62290922552, 2269.1387550775,
2247.31983098201, 1903.39138268307, 2174.78291538358, 2359.51909126411,
2488.39004804939, 212.851575751527, 461.398994384333, 567.150629704352,
781.775113821961, 918.303706148872, 1107.37695799186, 1160.80594193377,
1412.61328924168, 1689.48879626486, 260.737164468854, 306.72700499362,
283.410379620422, 366.813913489692, 387.570173754128, 388.602676983443,
477.858510450125, 128.198042456082, 535.519377609133, 1028.8780498564,
1098.54431357711, 1265.26965941035, 1129.58344809909, 820.922447928053,
749.343583476846, 779.678206156474, 646.575242339517, 733.953282899613,
461.156280127354, 906.813018662913, 798.186995701282, 831.365377249207,
764.519073183124, 672.076289062505, 669.879217186302, 1341.47673353751,
1401.44881976186, 1640.27575962036)), .Names = c("x", "y"), row.names = c(NA,
-45L), class = "data.frame")
I have created two non-linear regression (nls1 and nls2) based on my dataset.
library(stats)
nls1 <- nls(y~A*(x^B)*(exp(k*x)),
data = df,
start = list(A = 1000, B = 0.170, k = -0.00295))
nls2<-nls(y~A*x^3+B*x^2+C*x+D, data=df,
start = list(A=0.02, B=-0.6, C= 50, D=200))
I then computed bootstrap objects for these two functions to get multiple sets of parameters (A,B and k for nls1 and A, B, C and D for nls2).
library(nlstools)
Boo1 <- nlsBoot(nls1, niter = 200)
Boo2 <- nlsBoot(nls2, niter = 200)
Based on this bootstrap objects, I would like to compute r-squared of each combination of parameters and save the min, max and median of my r-squared values for each bootstrap object into one new dataframe. The dataframe could look like new.df.
structure(list(Median = c(NA, NA), Max = c(NA, NA), Min = c(NA,
NA)), .Names = c("Median", "Max", "Min"), row.names = c("nls1",
"nls2"), class = "data.frame")
The idea is then to do some box plots with the median, min and max values for each non-linear model based on bootstrapping to compare them. Can someone help me out with that? Thanks in advance.
Answer from #bunk
stat <- function(dat, inds) { fit <- try(nls(y~A*(x^B)*(exp(k*x)), data = dat[inds,], start = list(A = 1000, B = 0.170, k = -0.00295)), silent=TRUE); f1 <- if (inherits(fit, "nls")) AIC(fit) else NA; fit2 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = dat[inds,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE); f2 <- if (inherits(fit2, "nls")) AIC(fit2) else NA; c(f1, f2) }; res <- boot(df, stat, R=200). Then, to get medians for example, apply(res$t, 2, median, na.rm=TRUE)

Resources