Related
I have a data frame R_m which looks like this:
data frame R_m
I have used pivot_longer to modify the data for ggplot and then print it:
R_m2 <- R_m %>%
pivot_longer(names_to = "per", values_to="ind", cols=-sim, names_ptypes=list(per=integer()))
ggplot(R_m2, aes(x=per, y=ind,color=sim, group=sim))+geom_line() +
theme(legend.position = "none")
Now, I would like to add a line for mean, median and some quantiles in the graph. Before, I got this data from elsewhere and imported it as a data frame 'stat' into R. It looks like this:
median, mean and quantile
which plotted nicely by adding
geom_line(data=subset(stat,sim=="Median"),colour="black", size = 1)
Outcome looks like this:
Now I would like to achieve the same in R. Creating an array of medians (where median is taken across each column in R_m) was straightforward to do. But then my first stumbling block was that using rbind, cbind etc. I could not create a data frame that looks like 'stat'. I could not rename the zero column to "sim", so the geom_line command above no longer works. Anyway, this requires a lot of data manipulation. Is there a more efficient way of adding a median, mean and percentile lines to a graph?
structure(list(sim = c(166, 37, 163, 65, 95, 92, 98, 168, 19,
157, 177, 200, 115, 177, 149, 130, 66, 114, 96, 12, 138, 39,
80, 33, 157, 107, 180, 159, 166, 14, 126, 67, 190, 86, 147, 182,
43, 5, 109, 141, 53, 186, 49, 68, 168, 107, 67, 28, 158, 178),
per = c(407L, 1763L, 2158L, 1608L, 836L, 1638L, 285L, 1978L,
45L, 1927L, 192L, 1517L, 163L, 789L, 1989L, 2478L, 2410L,
2445L, 1532L, 181L, 1489L, 1434L, 2515L, 676L, 1503L, 2458L,
732L, 1266L, 1705L, 1852L, 1543L, 1568L, 41L, 1992L, 600L,
1314L, 33L, 199L, 370L, 46L, 1171L, 1173L, 2048L, 994L, 836L,
372L, 2374L, 1414L, 1628L, 1188L), ind = c(97.7428137181456,
100.462039003802, 95.2793483563514, 98.3721036305918, 99.0584691732282,
103.301132288618, 102.428408453689, 100.387198613893, 99.7888039221544,
101.017059784079, 106.12288506898, 102.636429823681, 93.8144062244855,
104.280572544198, 97.3182467653953, 96.5603916025096, 96.3529141792467,
98.3149638711415, 98.3629878947972, 94.6106342501915, 99.6835722307572,
98.1716050345778, 103.055895201755, 100.054976695486, 96.1369802984859,
98.5257212288309, 98.8568719059079, 102.900859147552, 99.37215427561,
102.623437273663, 104.128600607447, 102.673062489082, 100.368131206055,
98.3487549118012, 96.4401682804699, 96.4407823981984, 97.9413312935541,
102.122624393907, 98.2979203190445, 101.018531709501, 100.444354410774,
101.118257199515, 100.867412455804, 98.9923953588876, 100.417977446024,
102.21423103019, 102.296794518966, 99.9367239162818, 102.314273028354,
100.80711113148)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
This question aims to receive feedback to make a function more efficient. Apologies for a long, case specific post.
I created a function that calculates percentages in estimates of the American Community Survey (ACS). Because estimates in the ACS have margins of error, calculating percentages (e.g. % of the total population being below 17 years old) requires the recalculation of the error that results from dividing the estimate of both variables (population below 17 / total population).
So to calculate the new margin of error for a proportion calculated as p = estimate_a/estimate_b, the formula to use is MOE(p) = (1/estimate_b)*sqrt((MOE_b^2)-(p^2*MOE_a^2)). If the value inside of the square root was negative, then the substraction should be changed to a sum, with the formula becoming MOE(p) = (1/estimate_b)*sqrt((MOE_b^2)+(p^2*MOE_a^2)). If the result of p = estimate_a/estimate_b is 1, the documentation suggest calculating MOE using another formula: MOE(p) = MOE_a/estimate_b
To make these calculations, I created a function that takes a data frame with estimates and their MOEs, calculates the proportion between two specified variables, and writes two new columns in the original dataframe - one with the proportion, and another one with its margin of error. The function loops through the rows of the data frame carrying out if-else checks to determine what formula to apply, including skipping rows that might have NA values. The original data on which I apply this function is considerably long - ~250000 rows, and the structure of this function makes it go very slowly. Hence, the question is whether there are ways to improve the quality of this code to improve its speed. The function and dummy data are provided below:
percent_calculator <- function(DF, A_e, B_e, A_se, B_se, New_fn){
# arguments legend >> DF = data frame; A_e = estimate_a (string of the fieldname); B_e = estimate_b (string of the fieldname);
# A_se = MOE_a (string of the fieldname); B_se = MOE_b (string of the fieldname); New_fn = root for new fieldname in the data frame (string)
pb<- txtProgressBar(min = 0, max = nrow(DF), initial = 0) # progress bar initialization
for (i in 1:nrow(DF)){ # for loop that iterates through the rows of the DF
setTxtProgressBar(pb,i)
if(is.na(DF[[A_e]][i])==FALSE & is.na(DF[[B_e]][i])==FALSE){ # check if any of the estimates used to calculate the proportion is NA (if so, skip)
if (DF[[B_e]][i]!= 0){ # check if estimate_b is not 0, to avoid creating inf values from A_e/B-e
DF[[paste0(New_fn, "_e")]][i] <- (DF[[A_e]][i]/DF[[B_e]][i])
if(DF[[paste0(New_fn, "_e")]][i] == 1){ # check if P==1 to then use the appropiate formula for MOE
DF[[paste0(New_fn, "_se")]][i] <- (DF[[A_se]][i]/DF[[B_e]][i])
} else {
if((DF[[A_se]][i]^2)-(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2)>= 0){ # check for the sign of the value inside of the square root
DF[[paste0(New_fn, "_se")]][i] <- (1/DF[[B_e]][i])*sqrt((DF[[A_se]][i]^2)-(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2))
} else {
DF[[paste0(New_fn, "_se")]][i] <- (1/DF[[B_e]][i])*sqrt((DF[[A_se]][i]^2)+(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2))
}
}
} else { # assign 0 value if B_e was 0
DF[[paste0(New_fn, "_e")]][i] <- 0
DF[[paste0(New_fn, "_se")]][i] <- 0
}
} else { # assign NA if any of the estimates was NA
DF[[paste0(New_fn, "_e")]][i] <- NA
DF[[paste0(New_fn, "_se")]][i] <- NA
}
DF[[paste0(New_fn, "_e")]][i] <- DF[[paste0(New_fn, "_e")]][i]*100 # switch from proportion to percentage in the estimate value
DF[[paste0(New_fn, "_se")]][i] <- DF[[paste0(New_fn, "_se")]][i]*100 # switch from proportion to percentage in the MOE value
}
return(DF)
}
Dummy <- structure(list(TotPop_e = c(636L, 1287L, 810L, 1218L, 2641L,
835L, 653L, 1903L, 705L, 570L, 2150L, 6013L, 1720L, 2555L, 1150L,
2224L, 1805L, 728L, 2098L, 3099L, 4194L, 1909L, 2401L, 1446L,
1345L, 1573L, 2037L, 634L, 1916L, 1522L, 592L, 831L, 577L, 2196L,
1482L, 1436L, 1668L, 3095L, 3677L, 2641L, 1285L, 932L, 2461L,
1609L, 1143L, 1617L, 1075L, 1280L, 838L, 1447L, 3941L, 2402L,
1130L, 851L, 10316L, 9576L, 2396L, 3484L, 5688L, 2200L, 1856L,
1441L, 2539L, 3056L, 1325L, 2454L, 2010L, 2340L, 1448L, 2435L,
2782L, 3633L, 1766L, 2564L, 1473L, 1214L, 1951L, 2561L, 4262L,
2576L, 4257L, 2314L, 2071L, 3182L, 1839L, 2214L, 1101L, 1898L,
790L, 867L, 1764L, 970L, 1320L, 2850L, 1019L, 1483L, 3720L, 2215L,
3581L, 3391L), TotPop_se = c(132.522796352584, 149.544072948328,
127.051671732523, 130.091185410334, 232.826747720365, 135.562310030395,
100.303951367781, 176.29179331307, 114.285714285714, 96.6565349544073,
339.817629179331, 438.297872340425, 245.592705167173, 324.012158054711,
333.130699088146, 224.924012158055, 321.580547112462, 169.604863221885,
175.075987841945, 469.908814589666, 375.075987841945, 411.550151975684,
378.115501519757, 235.258358662614, 241.337386018237, 291.793313069909,
337.386018237082, 138.601823708207, 145.896656534954, 193.920972644377,
135.562310030395, 117.325227963526, 244.984802431611, 318.54103343465,
207.90273556231, 200, 279.635258358663, 657.750759878419, 401.215805471125,
401.823708206687, 229.787234042553, 139.817629179331, 303.951367781155,
201.215805471125, 200, 252.887537993921, 356.838905775076, 241.945288753799,
238.297872340426, 267.477203647416, 320.9726443769, 255.31914893617,
178.115501519757, 116.109422492401, 891.793313069909, 766.565349544073,
255.31914893617, 463.22188449848, 448.632218844985, 367.781155015198,
269.300911854103, 261.398176291793, 286.93009118541, 446.808510638298,
224.316109422492, 212.158054711246, 233.434650455927, 304.559270516717,
356.231003039514, 275.379939209726, 330.699088145897, 368.996960486322,
248.024316109423, 310.030395136778, 153.799392097264, 243.768996960486,
265.65349544073, 337.386018237082, 436.474164133739, 359.270516717325,
344.072948328268, 196.960486322188, 231.003039513678, 356.231003039514,
212.158054711246, 348.328267477204, 206.079027355623, 240.729483282675,
196.352583586626, 141.033434650456, 215.80547112462, 127.659574468085,
248.024316109423, 589.057750759878, 231.61094224924, 486.93009118541,
605.471124620061, 713.06990881459, 488.753799392097, 382.370820668693
), Under17_se = c(35.8095476596307, 50.9877853224243, 50.0994474845873,
44.7376765786604, 113.994325548832, 59.7386237841673, 22.7862186188344,
95.1285234870203, 42.3093316505904, 35.4621507988699, 143.021311606928,
205.334390935311, 102.292167403598, 115.712493289527, 88.9617416652971,
98.0345650964952, 149.50823698925, 40.0016629212452, 86.7428425216985,
158.047696828218, 173.225615182675, 144.710221534209, 121.094774232467,
76.9999466678128, 88.9160360898593, 97.7665610480423, 133.02517642826,
30.4983051540691, 83.3625069421341, 75.7125713164268, 50.3826325227805,
37.5622898620679, 7.29483282674772, 122.185425418875, 83.4644035953588,
63.8384709681463, 99.5458131127046, 208.446825330589, 150.282359742524,
206.017151858922, 87.7761872483956, 56.194023821941, 120.701992909334,
50.6423479626955, 55.4225960853081, 93.2888100499867, 126.879946773287,
143.069104861932, 86.7747884744339, 79.4517480028886, 140.260959630942,
125.115775875384, 52.187662082273, 38.1819057688564, 365.828168907497,
380.635956883794, 135.735302000757, 213.321896356121, 198.507936644685,
126.535797699776, 141.516048792542, 114.238818548927, 117.737122860635,
165.644292987747, 71.238834852709, 93.0825940979755, 41.8438489710712,
97.0666682368976, 86.5060758100772, 92.8659724484427, 76.6536183156139,
192.822109819002, 101.83958502542, 139.341067042001, 55.3992539361667,
92.106793773051, 78.2330906844691, 115.177918141833, 207.546042154974,
139.609995160777, 153.568552211039, 73.5738128652025, 112.249861520572,
171.38868664475, 66.0687084216098, 181.939713349267, 28.4417934718288,
90.1132509720827, 57.4202669424023, 46.8440239496863, 80.4799857926917,
42.6875862955885, 81.3500156027725, 142.669475129055, 23.4653605661019,
191.159072511375, 159.615857998832, 191.592580855392, 184.123292172321,
125.375425911215), Under17_e = c(123, 284, 189, 228, 661, 180,
49, 500, 121, 115, 686, 1456, 385, 578, 302, 476, 738, 124, 527,
803, 1219, 459, 614, 218, 229, 422, 543, 69, 536, 306, 149, 80,
0, 520, 281, 270, 454, 669, 905, 978, 282, 178, 630, 187, 145,
367, 327, 577, 225, 246, 966, 629, 211, 65, 2857, 3051, 592,
1162, 1322, 464, 490, 264, 576, 617, 326, 695, 169, 381, 309,
476, 355, 915, 431, 869, 269, 358, 335, 650, 1443, 561, 900,
411, 759, 1265, 171, 833, 45, 255, 134, 144, 339, 203, 388, 413,
66, 416, 654, 565, 700, 362)), row.names = c(NA, 100L), class = "data.frame")
# example run to calculate pct of people below 17
Dummy <- percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
You don't need that loop at all. All your operations are simple arithmetic that can take vectors, instead of single values. This is called vectorization. You then implement your logic tree with a nested ifelse. ifelse does compute all three possible outcomes (which is a bit unnecessary), but that is very much worth it in this case. If you want to optimize further have a look here: Is `if` faster than ifelse?
Timings at the bottom.
percent_calculator_vectorized <- function(DF, A_e, B_e, A_se, B_se, New_fn){
# arguments legend >> DF = data frame; A_e = estimate_a (string of the fieldname); B_e = estimate_b (string of the fieldname);
# A_se = MOE_a (string of the fieldname); B_se = MOE_b (string of the fieldname); New_fn = root for new fieldname in the data frame (string)
e_name <- paste0(New_fn, "_e")
se_name <- paste0(New_fn, "_se")
DF[[e_name]] <- DF[[A_e]] / DF[[B_e]]
DF[[se_name]] <- ifelse(
DF[[e_name]] == 1, # check if P==1 to then use the appropriate formula for MOE
DF[[A_se]] / DF[[B_e]],
ifelse(
(DF[[A_se]]^2)-(DF[[e_name]]^2)*(DF[[B_se]]^2)>= 0, # check for the sign of the value inside of the square root
(1/DF[[B_e]])*sqrt((DF[[A_se]]^2)-(DF[[e_name]]^2)*(DF[[B_se]]^2)),
(1/DF[[B_e]])*sqrt((DF[[A_se]]^2)+(DF[[e_name]]^2)*(DF[[B_se]]^2))
)
)
# assign 0 value if B_e was 0
DF[DF[[B_e]] == 0 & !is.na(DF[[B_e]]), c(e_name, se_name)] <- 0
# switch from proportion to percentage in the estimate value
DF[, c(e_name, se_name)] <- DF[, c(e_name, se_name)] * 100
return(DF)
}
Dummy2 <- percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
Dummy3 <- percent_calculator_vectorized(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
all.equal(Dummy2, Dummy3) #TRUE
Timings:
bench::mark(
orig = percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17"),
vect = percent_calculator_vectorized(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17"),
)
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> <list>
1 orig 17.2ms 18.5ms 53.1 331.2KB 14.8 18 5 339ms <df[,6] … <df[,3] … <bch… <tibble …
2 vect 157.4µs 168µs 5700. 19.4KB 11.6 2450 5 430ms <df[,6] … <df[,3] … <bch… <tibble …
Speed up for this small dataset is ~100x, also with a ~10x smaller memory footprint.
I see you have a working solution but for my own reasons I wanted to try a tidyverse solution. This one is just about as fast as the base R solution and for me would be easier to maintain. I also added some more oddities to your toy data to make sure I caught the edge cases.
library(dplyr)
ACS_recalculator <- function(DF, A_e, B_e, A_se, B_se, New_fn){
e_name <- paste0(New_fn, "_e")
se_name <- paste0(New_fn, "_se")
A_ex <- ensym(A_e)
B_ex <- ensym(B_e)
A_sex <- ensym(A_se)
B_sex <- ensym(B_se)
DF <-
DF %>%
mutate(e_value = ifelse(!!B_ex != 0, !!A_ex / !!B_ex, 0),
se_value = case_when(
!!B_ex == 0 ~ 0,
e_value == 1 ~ !!A_sex / !!B_ex,
((!!A_sex)^2) - (e_value^2 * ((!!B_sex)^2)) >= 0 ~ (1/!!B_ex) * sqrt(((!!A_sex)^2) - (e_value^2) * ((!!B_sex)^2)),
((!!A_sex)^2) - (e_value^2 * ((!!B_sex)^2)) < 0 ~ (1/!!B_ex) * sqrt(((!!A_sex)^2) + (e_value^2) * ((!!B_sex)^2)),
TRUE ~ NA_real_),
e_value = e_value * 100,
se_value = se_value * 100) %>%
rename(!!e_name := e_value,
!!se_name := se_value)
return(DF)
}
Dummy2 <- ACS_recalculator(Dummy2, "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
head(Dummy2)
#> TotPop_e TotPop_se Under17_se Under17_e P_Bel17_e P_Bel17_se
#> 1 636 132.5228 35.80955 123 19.33962 3.932255
#> 2 1287 149.5441 50.98779 284 22.06682 3.020104
#> 3 810 127.0517 50.09945 189 23.33333 4.986043
#> 4 1218 130.0912 44.73768 228 18.71921 3.081212
#> 5 2641 232.8267 113.99433 661 25.02840 3.709747
#> 6 835 135.5623 59.73862 180 21.55689 6.239876
Your original example data with more missings and zeros
Dummy2 <- structure(list(TotPop_e = c(636L, 1287L, 810L, 1218L, 2641L,
835L, 653L, 1903L, 0L, 570L, 2150L, 6013L, 1720L, 2555L, 1150L,
2224L, 1805L, 728L, 2098L, 3099L, 4194L, 1909L, 2401L, 1446L,
1345L, 1573L, 2037L, 634L, 1916L, 1522L, 592L, 831L, 577L, 2196L,
1482L, 1436L, 1668L, 3095L, 3677L, 2641L, 1285L, 932L, 2461L,
1609L, 1143L, 1617L, 1075L, 1280L, 838L, 1447L, 3941L, 2402L,
1130L, 851L, 10316L, 9576L, 2396L, 3484L, 5688L, 2200L, 1856L,
1441L, 2539L, 3056L, 1325L, 2454L, 2010L, 2340L, 1448L, 2435L,
2782L, 3633L, 1766L, 2564L, 1473L, 1214L, 1951L, 2561L, 4262L,
2576L, 4257L, 2314L, 2071L, 3182L, 1839L, 2214L, NA, 1898L,
790L, 867L, 1764L, 970L, 1320L, 2850L, 1019L, 1483L, 3720L, 2215L,
3581L, 3391L), TotPop_se = c(132.522796352584, 149.544072948328,
127.051671732523, 130.091185410334, 232.826747720365, 135.562310030395,
100.303951367781, 176.29179331307, 114.285714285714, 0,
339.817629179331, 438.297872340425, 245.592705167173, 324.012158054711,
333.130699088146, 224.924012158055, 321.580547112462, 169.604863221885,
175.075987841945, 469.908814589666, 375.075987841945, 411.550151975684,
378.115501519757, 235.258358662614, 241.337386018237, 291.793313069909,
337.386018237082, 138.601823708207, 145.896656534954, 193.920972644377,
135.562310030395, 117.325227963526, 244.984802431611, 318.54103343465,
207.90273556231, 200, 279.635258358663, 657.750759878419, 401.215805471125,
401.823708206687, 229.787234042553, 139.817629179331, 303.951367781155,
201.215805471125, 200, 252.887537993921, 356.838905775076, 241.945288753799,
238.297872340426, 267.477203647416, 320.9726443769, 255.31914893617,
178.115501519757, 116.109422492401, NA, 766.565349544073,
255.31914893617, 463.22188449848, 448.632218844985, 367.781155015198,
269.300911854103, 261.398176291793, 286.93009118541, 446.808510638298,
224.316109422492, 212.158054711246, 233.434650455927, 304.559270516717,
356.231003039514, 275.379939209726, 330.699088145897, 368.996960486322,
248.024316109423, 310.030395136778, 153.799392097264, 243.768996960486,
265.65349544073, 337.386018237082, 436.474164133739, 359.270516717325,
344.072948328268, 196.960486322188, 231.003039513678, 356.231003039514,
212.158054711246, 348.328267477204, 206.079027355623, 240.729483282675,
196.352583586626, 141.033434650456, 215.80547112462, 127.659574468085,
248.024316109423, 589.057750759878, 231.61094224924, 486.93009118541,
605.471124620061, 713.06990881459, 488.753799392097, 382.370820668693
), Under17_se = c(35.8095476596307, 50.9877853224243, 50.0994474845873,
44.7376765786604, 113.994325548832, 59.7386237841673, 22.7862186188344,
95.1285234870203, 42.3093316505904, 35.4621507988699, 143.021311606928,
205.334390935311, 102.292167403598, 115.712493289527, 88.9617416652971,
98.0345650964952, 149.50823698925, 40.0016629212452, 86.7428425216985,
158.047696828218, 173.225615182675, 144.710221534209, 121.094774232467,
76.9999466678128, 88.9160360898593, 97.7665610480423, 133.02517642826,
30.4983051540691, 83.3625069421341, 75.7125713164268, 50.3826325227805,
37.5622898620679, 7.29483282674772, 122.185425418875, 83.4644035953588,
63.8384709681463, 99.5458131127046, 208.446825330589, 150.282359742524,
206.017151858922, 87.7761872483956, 56.194023821941, 120.701992909334,
50.6423479626955, 55.4225960853081, 93.2888100499867, 126.879946773287,
143.069104861932, 86.7747884744339, 79.4517480028886, 140.260959630942,
125.115775875384, 52.187662082273, 38.1819057688564, 365.828168907497,
380.635956883794, 135.735302000757, 213.321896356121, 198.507936644685,
126.535797699776, 141.516048792542, 114.238818548927, 117.737122860635,
165.644292987747, 71.238834852709, 93.0825940979755, 41.8438489710712,
97.0666682368976, 86.5060758100772, 92.8659724484427, 76.6536183156139,
192.822109819002, 101.83958502542, 139.341067042001, 55.3992539361667,
92.106793773051, 78.2330906844691, 115.177918141833, 207.546042154974,
139.609995160777, 153.568552211039, 73.5738128652025, 112.249861520572,
171.38868664475, 66.0687084216098, 181.939713349267, 28.4417934718288,
90.1132509720827, 57.4202669424023, 46.8440239496863, 80.4799857926917,
42.6875862955885, 81.3500156027725, 142.669475129055, 23.4653605661019,
191.159072511375, 159.615857998832, 191.592580855392, 184.123292172321,
125.375425911215), Under17_e = c(123, 284, 189, 228, 661, 180,
49, 500, 121, 115, 686, 1456, 385, 578, 302, 476, 738, 124, 527,
803, 1219, 459, 614, 218, 229, 422, 543, 69, 536, 306, 149, 80,
0, 520, 281, 270, 454, 669, 905, 978, 282, 178, 630, 187, 145,
367, 327, 577, 225, 246, 966, 629, 211, 65, 2857, 3051, 592,
1162, 1322, 464, 490, 264, 576, 617, 326, 695, 169, 381, 309,
476, 355, 915, 431, 869, 269, 358, 335, 650, 1443, 561, 900,
411, 759, 1265, 171, 833, 45, 255, 134, 144, 339, 203, 388, 413,
66, 416, 654, 565, 700, 362)), row.names = c(NA, 100L), class = "data.frame")
Is there a clean way to divide each row of a dataframe by the sum of squared variables, in a time series database ---- where the events are the rows and the columns are the variables ---. At the moment, my method is the following
for(i in 1:nrow(base)){
base[i,] <- base[i,]/(as.numeric(t(c(base[i,]))) %*% as.numeric(t(c(base[i,]))))
}
In order to use the %*% operator the only way I found is the one shown above which uses the as.numeric %>% t %>% c mechanism, which it doesn't seems clean.
For the sake of a reproducible example:
base <-structure(list(Var1 = c(920, 734, 1001, 1033, 752, 837, 734,
817), Var2 = c(4861, 4966, 4855, 3835, 4782, 5348, 4648, 4595
), Var3 = c(5011, 4618, 2718, 4344, 4872, 5076, 4678, 4563),
Var4 = c(4785, 4610, 4697, 4149, 4693, 4866, 4517, 3271),
Var5 = c(5101, 4220, 4444, 4301, 965, 4557, 3524, 4201),
Var6 = c(5059, 4048, 4217, 4397, 3711, 4032, 5478, 4051),
Var7 = c(2134, 1766, 1640, 1837, 1662, 1711, 1838, 1625)), .Names = c("Var1",
"Var2", "Var3", "Var4", "Var5", "Var6", "Var7"), class = c("data.table",
"data.frame"), row.names = c(NA, -8L), .internal.selfref = <pointer: 0x1b8ec38>)
I would also question what you mean by standardize, however here is a simpler way to accomplish what you're doing in the for loop:
sweep(base, 1, rowSums(base^2), `/`)
This is the structure of my data
> dput(test)
structure(list(MAT = c(4.9, 4.9, 15.5, 14.1, 14.1, 14.1, 11.5,
11.5, 11.5, 17, 6.1, 2.7, 2.2, 2.2, 14.1, 14.1, 14.1, 9.5, 9.5,
9.5, 9.5, 9.3, 8.3, 8.266666651, 8.266666651, 4.3, 4.3, 22.3,
14.1, 14.1, 14.1, 8.5, 8.5, 8.5, 8.5, 21.5, 21.5, 3.8, 3.8, 6,
6, 6, 6, 6), es = c(0.29603085763985, 0.421393627439682, 0.189653473156549,
0.226685054608428, 0.291373762079697, 0.166533544378467, 0.250586529054368,
0.146320008054403, 0.199565119644333, -0.0819047677231083, 0.15963948187092,
-0.154628141843561, 0.201121044198443, 0.0867981239977565, 0.543870310978598,
0.34547921143505, 0.37557241352574, -0.287318919407836, 0.207937483228907, 0.190143660810163, 0.276182673435993, 0.128596803172119, 0.454753165843559,
0.399237234440439, 0.32075358541748, 0.362664873575803, -0.0865925288159671,
0.51290512543514, 0.186308318839249, 0.147936083867325, 0.243792477087184,
0.625169403695832, 0.110317782120045, 0.217836235313289, 0.171468156841181,
0.50548821117127, 0.164418265301427, -0.00246305543239786, 0.325552346507191,
0.381240606108843, 0.19337350462531, 0.0408803528990759, 0.321815078821239,
0.307642815014319), var = c(0.00496277337027962, 0.0130962311273343,
0.0180149624217804, 0.0134568083459063, 0.00139708925143695,
0.000725862546533828, 0.00670831011660164, 0.0190783110089115,
0.0641568910090007, 0.0121596544795352, 0.0653909966557582, 0.0514610437228611,
0.0231592619167496, 0.0108989891148006, 0.0588577146414195, 0.0695760532112402,
0.0744256820906048, 0.00997789089155498, 0.00928124381998638,
0.0145009450673482, 0.00652956018299188, 0.0111886178917916,
0.0265943757419349, 0.142676904340634, 0.110705177803624, 0.0576538348777718,
0.0625171635976251, 0.0131652117394448, 0.00947904166717649,
0.00813569411386797, 0.00444289889858652, 0.0673007030900184,0.00545169559098343, 0.240046081413733, 0.00561125010476281,
0.0185516235174018, 0.0179989506841957, 0.0496806959944248, 0.022478393723115,
0.0521209786580004, 0.282298667080106, 0.0151428845076692, 0.00992945920656693, 0.0145544965304081), MAP = c(810, 810, 1140, 1750, 1750, 1750,
1034, 1034, 1034, 720, 645, 645, 645, 645, 1000, 1000, 1000,
691, 691, 691, 691, 1134, 1750, 1326, 1326, 1140, 1140, 1310,
1750, 1750, 1750, 1003, 1003, 1003, 1003, 1750, 1750, 1750, 1750,
1750, 1750, 1750, 1750, 1750), CO2dif = c(162L, 162L, 190L, 165L,
165L, 165L, 200L, 200L, 200L, 150L, 335L, 335L, 335L, 335L, 348L,
348L, 348L, 200L, 200L, 200L, 200L, 220L, 350L, 350L, 350L, 350L,
350L, 350L, 180L, 180L, 180L, 130L, 130L, 130L, 130L, 320L, 320L,
360L, 360L, 345L, 345L, 350L, 348L, 348L)), row.names = c(NA,
-44L), class = "data.frame", .Names = c("MAT", "es", "var", "MAP",
"CO2dif"))
I run model selection using meta-analysis, and the best model to predict the effects size is:
library(metafor)
summary(rma(es, var, data=test ,control=list(stepadj=.5), mods= ~ 1 + log(MAT) + MAP + CO2dif + log(MAT):CO2dif, knha=TRUE))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 1.2556 0.3719 3.3758 0.0017 0.5033 2.0080 **
log(MAT) -0.5740 0.1694 -3.3882 0.0016 -0.9167 -0.2313 **
MAP 0.0001 0.0001 2.5181 0.0160 0.0000 0.0003 *
CO2dif -0.0042 0.0013 -3.2932 0.0021 -0.0067 -0.0016 **
log(MAT):CO2dif 0.0020 0.0005 3.7500 0.0006 0.0009 0.0031 ***
Now I want to plot es vs MAT, with an example with this model, assuming that MAP=1200 mm and CO2dif=350
MAPi <- 1200
CO2i <- 350
make_pct <- function(x) (exp(x) - 1) * 100
ggplot(test, aes(x = log(MAT), y = make_pct(es))) +
geom_abline(aes(intercept = make_pct(1.2556 + 0.0001 * MAPi - 0.0042 * CO2i),
slope = make_pct(log(0.0020 * CO2i)) - make_pct(log(0.5740))) ,
color = "red", size=0.8) +
geom_point() +
theme_classic()
Effect size (es) is in log format, and I want percentage, so I transform it with the function make_pct. MAT, on the other hand, has to be log-transformed in the plot as indicated in the model output. Is the slope of the ggplot above correct with the log and percentage transformations? It seems to me that the slope is rather low. I am not very familiar with this type of plots and transformations, so any tips are welcome. Thanks
The relationship between exp(es)-1 and the explanatory variable log(MAT) is not linear.
For a given set of values of MAP and CO2dif, this relationship is of the form: y = exp(es)-1 = k1*exp(k2*log(MAT)).
This function can be plotted as follows:
library(metafor)
library(ggplot2)
modfit <- rma(es, var, data=test ,control=list(stepadj=.5),
mods= ~ 1 + MAP + log(MAT)*CO2dif, knha=TRUE)
pars <- coef(modfit)
MAPi <- 1200
CO2i <- 350
make_pct <- function(x) (exp(x) - 1) * 100
mod_fun <- function(MAP, MAT, CO2dif, pars) {
y <- pars[1]+pars[2]*MAP+pars[3]*log(MAT)+
pars[4]*CO2dif+pars[5]*log(MAT)*CO2dif
make_pct(y)
}
test$ESpct <- mod_fun(MAPi, test$MAT, CO2i, coef(modfit))
ggplot(test, aes(x = log(MAT), y = make_pct(es))) +
geom_line(aes(y=ESpct), color = "red", size=0.8) +
geom_point() + theme_classic()
I'd like to output a chart similar to the one represented on this page (on the right) using R and any package that would make it look good:
http://processtrends.com/pg_charts_monthly_cycle_chart.htm
Anyone up to the challenge? :)
Thanks!
The stats package in R base already has a function to do this. Here is my one-liner and the output that it produces
monthplot(redata, labels = month.abb, ylab = 'Listings')
Building on this an example of using monthplot for a weekly cycle plot is here (gives full R code and source data): http://figshare.com/figures/index.php/OpenURL_Router_Data:_Total_Requests_by_Weekday
monthplot(ts(sdf$values, frequency = 7, start=c(12,5)), labels = dow, ylab = "No. requests / day", xlab = "Weekday")
which gives this weekly cycle plot:
Example of monthplot to create a graph showing a weekly cycle http://figshare.com/figures/images/a/a7/Total_requests_by_weekday_01_Apr_to_31_Jul_2011.jpeg
Of course no graphical challenge will be complete without a ggplot solution. The tricky bit is to use ddply to summarise the monthly averages, and to pass this as data to a separate layer to ggplot.
library(lubridate)
library(plyr)
library(ggplot2)
df$month <- factor(month(df$dates), levels=1:12, labels=month.abb, ordered=TRUE)
df$year <- year(df$dates)
hline.data <- ddply(df, .(month), summarize, avgvalue=mean(values))
ggplot() +
geom_line(aes(x=year, y=values, group=month), data=df, colour="blue") +
geom_hline(aes(yintercept=avgvalue), data=hline.data, colour="blue", size=2) +
facet_grid(~month) +
opts(axis.text.x = theme_blank()) +
xlab("")
The data:
df <- structure(list(dates = structure(c(8415, 8446, 8474, 8505, 8535,
8566, 8596, 8627, 8658, 8688, 8719, 8749, 8780, 8811, 8839, 8870,
8900, 8931, 8961, 8992, 9023, 9053, 9084, 9114, 9145, 9176, 9204,
9235, 9265, 9296, 9326, 9357, 9388, 9418, 9449, 9479, 9510, 9541,
9570, 9601, 9631, 9662, 9692, 9723, 9754, 9784, 9815, 9845, 9876,
9907, 9935, 9966, 9996, 10027, 10057, 10088, 10119, 10149, 10180,
10210, 10241, 10272, 10300, 10331, 10361, 10392, 10422, 10453,
10484, 10514, 10545, 10575, 10606, 10637, 10665, 10696, 10726,
10757, 10787, 10818, 10849, 10879, 10910, 10940, 10971, 11002,
11031, 11062, 11092, 11123, 11153, 11184, 11215, 11245, 11276,
11306, 11337, 11368, 11396, 11427, 11457, 11488, 11518, 11549,
11580, 11610, 11641, 11671, 11702, 11733, 11761, 11792, 11822,
11853, 11883, 11914, 11945, 11975, 12006, 12036, 12067, 12098,
12126, 12157, 12187, 12218, 12248, 12279, 12310, 12340, 12371,
12401, 12432, 12463, 12492, 12523, 12553, 12584, 12614, 12645,
12676, 12706, 12737, 12767, 12798, 12829, 12857, 12888, 12918,
12949, 12979, 13010, 13041, 13071, 13102, 13132), class = "Date"),
values = c(1093, 1182, 1299, 1372, 1319, 1362, 1239, 1162,
1059, 921, 815, 720, 835, 853, 1034, 1030, 1240, 1388, 1429,
1319, 1231, 1184, 1076, 825, 991, 1093, 854, 808, 1079, 1092,
1220, 1251, 1130, 1131, 1052, 951, 950, 1006, 1112, 1119,
1250, 1322, 1347, 1310, 1215, 1128, 1035, 992, 1079, 1018,
1112, 1224, 1323, 1344, 1326, 1267, 1171, 1075, 916, 932,
888, 904, 939, 1018, 1140, 1174, 1285, 1311, 1298, 1231,
1091, 1088, 991, 1028, 1177, 1322, 1322, 1398, 1389, 1174,
1196, 1115, 756, 496, 693, 673, 748, 777, 820, 948, 966,
1027, 960, 865, 767, 675, 765, 732, 613, 632, 659, 705, 684,
734, 715, 626, 551, 487, 500, 536, 575, 595, 736, 798, 832,
797, 792, 726, 650, 584, 567, 524, 574, 571, 591, 657, 699,
756, 867, 795, 760, 685, 609, 588, 521, 581, 614, 623, 668,
702, 777, 697, 647, 562, 523, 508, 493, 504, 534, 586, 621,
620, 636, 600, 549, 557)), .Names = c("dates", "values"), row.names = c(NA,
-156L), class = "data.frame")
An awful piece of R coding here by me, but it might give you some ideas on how to do it:
This was the data I used taken from the excel file on the linked site:
> dput(redata)
structure(c(1093L, 1182L, 1299L, 1372L, 1319L, 1362L, 1239L,
1162L, 1059L, 921L, 815L, 720L, 835L, 853L, 1034L, 1030L, 1240L,
1388L, 1429L, 1319L, 1231L, 1184L, 1076L, 825L, 991L, 1093L,
854L, 808L, 1079L, 1092L, 1220L, 1251L, 1130L, 1131L, 1052L,
951L, 950L, 1006L, 1112L, 1119L, 1250L, 1322L, 1347L, 1310L,
1215L, 1128L, 1035L, 992L, 1079L, 1018L, 1112L, 1224L, 1323L,
1344L, 1326L, 1267L, 1171L, 1075L, 916L, 932L, 888L, 904L, 939L,
1018L, 1140L, 1174L, 1285L, 1311L, 1298L, 1231L, 1091L, 1088L,
991L, 1028L, 1177L, 1322L, 1322L, 1398L, 1389L, 1174L, 1196L,
1115L, 756L, 496L, 693L, 673L, 748L, 777L, 820L, 948L, 966L,
1027L, 960L, 865L, 767L, 675L, 765L, 732L, 613L, 632L, 659L,
705L, 684L, 734L, 715L, 626L, 551L, 487L, 500L, 536L, 575L, 595L,
736L, 798L, 832L, 797L, 792L, 726L, 650L, 584L, 567L, 524L, 574L,
571L, 591L, 657L, 699L, 756L, 867L, 795L, 760L, 685L, 609L, 588L,
521L, 581L, 614L, 623L, 668L, 702L, 777L, 697L, 647L, 562L, 523L,
508L, 493L, 504L, 534L, 586L, 621L, 620L, 636L, 600L, 549L, 557L
), .Dim = 12:13, .Dimnames = list(c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), c("X1993",
"X1994", "X1995", "X1996", "X1997", "X1998", "X1999", "X2000",
"X2001", "X2002", "X2003", "X2004", "X2005")))
And here's my woeful coding... so much cleanup is possible here, but it is a quick test of the possibilities.
monthnames <- c(
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec"
)
# size of window
windows(w=6,h=3)
# margins
par(
mar=c(5.1,5.1,2.1,2.1),
cex.axis=0.7
)
# set up plot with the number of categories and the y limits
# yaxs="i" sets the yaxis as having no separation from the corner point
ylimlp <- c(0,max(redata))*1.06
plot(1:156, type="n", xaxt="n", ylim=ylimlp, ann=FALSE, yaxs="i", xaxs="i", bty="l", las="1")
abline(v=seq(13,156,13),lty=1,col="grey")
title(xlab="Month", col.lab=rgb(0,0,0), font.lab=2, cex.lab=0.75)
title(ylab="Listings", col.lab=rgb(0,0,0), font.lab=2, cex.lab=0.75)
lines(redata[1,],type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*1),redata[2,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*2),redata[3,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*3),redata[4,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*4),redata[5,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*5),redata[6,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*6),redata[7,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*7),redata[8,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*8),redata[9,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*9),redata[10,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*10),redata[11,]),type="l",pch=NA,lwd=1,col="grey")
lines(c(rep(NA,13*11),redata[12,]),type="l",pch=NA,lwd=1,col="grey")
redatamonthmean <- apply(redata,1,mean)
lines(rep(redatamonthmean[1],13),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*1),rep(redatamonthmean[2],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*2),rep(redatamonthmean[3],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*3),rep(redatamonthmean[4],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*4),rep(redatamonthmean[5],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*5),rep(redatamonthmean[6],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*6),rep(redatamonthmean[7],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*7),rep(redatamonthmean[8],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*8),rep(redatamonthmean[9],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*9),rep(redatamonthmean[10],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*10),rep(redatamonthmean[11],13)),type="l",pch=NA,lwd=1,col="black")
lines(c(rep(NA,13*11),rep(redatamonthmean[12],13)),type="l",pch=NA,lwd=1,col="black")
mtext(monthnames[1], side=1, cex=0.7, at=6.5)
mtext(monthnames[2], side=1, cex=0.7, at=6.5*3)
mtext(monthnames[3], side=1, cex=0.7, at=6.5*5)
mtext(monthnames[4], side=1, cex=0.7, at=6.5*7)
mtext(monthnames[5], side=1, cex=0.7, at=6.5*9)
mtext(monthnames[6], side=1, cex=0.7, at=6.5*11)
mtext(monthnames[7], side=1, cex=0.7, at=6.5*13)
mtext(monthnames[8], side=1, cex=0.7, at=6.5*15)
mtext(monthnames[9], side=1, cex=0.7, at=6.5*17)
mtext(monthnames[10], side=1, cex=0.7, at=6.5*19)
mtext(monthnames[11], side=1, cex=0.7, at=6.5*21)
mtext(monthnames[12], side=1, cex=0.7, at=6.5*23)
And the example image