Calculate percentage to total using rowPercents

Calculate percentage to total using rowPercents - r

I am trying to calculate a percentage to total for, lets say, the following reproducible example:
structure(c(197.95, 197.95, 197.95, 186.8, 190.51, 195.16, 199.81,
202.59, 202.59, 202.59, 92.28, 92.28, 90.07, 89.82, 87.36, 87.61,
90.56, 89.82, 90.07, 89.82, 20.43, 20.43, 20.43, 20.43, 20.43,
20.43, 20.43, 20.43, 20.43, 20.64, 24.7, 24.95, 24.54, 23.97,
23.97, 24.38, 24.38, 24.38, 24.54, 24.54, 37.4, 37.4, 37.4, 35.43,
35.43, 35.43, 35.43, 35.43, 35.43, 39.37, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 16.05,
16.05, 16.05, 16.05, 15.62, 15.62, 16.05, 15.62, 15.62, 15.62,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), index = structure(c(470620800,
470880000, 470966400, 471052800, 471139200, 471225600, 471484800,
471571200, 471657600, 471744000), tzone = "UTC", tclass = "Date"), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", class = c("xts",
"zoo"), .Dim = c(10L, 9L), .Dimnames = list(NULL, c("AVON", "BA.",
"CMRG", "COB", "MGGT", "QQ.", "RR.", "SNR", "ULE")))
I need to return the same presentation of my data but each value is a percentage of the total of the row it belongs to. I did a lot of research and tried prop.table which returns a subscript error and finally I used rowPercents which is part of RcmdrMisc package. However, I could not find how to let it ignore the NA in my data set.
In the example provides there are two whole columns of NA. I can not drop them as the whole data set has some values for the subsequent rows.
Note the the class of my example is zoo and xts

You don't need any external packages for this.
dat.percent <- dat / rowSums(dat, na.rm = T) * 100
Check that it works:
> all(abs(rowSums(dat.percent, na.rm = T) - 100) < 0.0001)
[1] TRUE

prop.table does not seem to work with xts/zoo objects but this works:
library(xts)
prop.table(coredata(x), 1)
It returns all NAs which is correct since there is an NA in each row (and it is impossible to calculate the proportions without knowing every value). If you want to regard the NA values as zero then:
prop.table( na.fill(coredata(x), 0), 1)

Related

replace values in a data frame with an NA based on the occurrence of NAs in lookup table

I want to replaces values in a data frame in R with NA based on the occurrence of NAs in a lookup table like the below example.
lookup <- data.frame(date1=c("2018-02-21", "2019-01-14", "2019-01-14", "2019-01-14"),
date2=c("2018-08-22", "2019-01-14", "2019-01-14", NA),
date3=c("2018-10-03", "2019-01-14", NA, NA),
date4=c("2018-10-31", NA, NA, NA)
)
values <- data.frame(val1=c(22.2, 42.1, 38.2, 41.9),
val2=c(23.8, 40.5, 38.5, 39.7),
val3=c(24.2, 39.8, 40.2, NA),
val4=c(27.0,40.1, NA, NA)
)
values_new <- data.frame(val1=c(22.2, 42.1, 38.2, 41.9),
val2=c(23.8, 40.5, 38.5, NA),
val3=c(24.2, 39.8, NA, NA),
val4=c(27.0,NA, NA, NA)
)

We may use
values2 <- values * NA^(is.na(lookup))
Or use
values[is.na(lookup)] <- NA
-checking
> identical(values, values_new)
[1] TRUE

Multiple individual graphs from a unique dataframe

I know that some subjects are about similar questions, but even using those I was not able to resolve the issue on my own. Thus, I am sorry if this subject appears as a duplicate but I am a bit stuck.
I have to draw nearly 40 graphs representing body temperature variations accross 24hours (a graph per individual of the study). To do that, I tried to write a loop using dplyrand ggplot2 packages. You may find bellow an exemple of my data. There are numerous missing values but I don't think it represents an issue regarding the current question.
structure(list(heures = structure(1:13, .Label = c("01:00:00",
"03:00:00", "05:00:00", "07:00:00", "08:00:00", "10:00:00", "12:00:00",
"13:30:00", "15:00:00", "17:00:00", "19:00:00", "21:00:00", "23:00:00"
), class = "factor"), x1= c(36.55, 36.5, 36.44444444,
36.6, 36.86666667, 37.26, 37, NA, NA, 37.3, 37.1, 37, 35.6),
x2 = c(NA, 34.5, 35.4, 36.1, NA, NA, NA, NA, NA,
NA, NA, NA, NA), x3 = c(36.9, 36.4, NA, NA, 36.9,
NA, NA, NA, NA, 37.5, 37.5, 36.9, 37.1), x4 = c(36,
35.8, NA, NA, NA, 37.4, 36.7, 36.3, NA, 37.5, 37, NA, NA)), class = "data.frame", row.names = c(NA,
-13L))
So far, I have written the following code with "indiv" being a dataframe containing the above presented data.
names <- c(colnames(indiv))
graph <- list()
test <- function(df, names) {
for (i in 1:length(df)) {
name <- names[i]
stock <- df %>%
filter(heures, !!name)
graph[[i]] <- ggplot(data=stock, aes(x=heures, y=stock[,2])) +
geom_point() +
labs(x="Hours (HH:MM:SS)",
y="Temperature",
title=colnames(stock[2]))
}
return(graph)
}
It returns an error that seems to indicate the filter function does not work properly:
Warning messages:
1: In Ops.factor(~heures, ~"x1") :
‘&’ not meaningful for factors
I can't figure out what I'm doing wrong in this. I also tried a code without the dplyr part present in the current loop, but it didn't gave me the wanted output neither.
Thank you in advance for your advises.

I have came out with this idea : tidy a little the dataset to make it easier to use with ggplot and then split it and store the splitted dataframe in a list. Then I use lapply to avoid using a loop along with a custom function to create plots.
This is not a very fast way if you have a lot of data but I use this trick a lot with small datasets.
This code creates a plot for each individual (not facets).
library(tidyverse) # all functions of these packages are not necessary here
df = structure(list(heures = structure(1:13, .Label = c("01:00:00",
"03:00:00", "05:00:00", "07:00:00", "08:00:00", "10:00:00", "12:00:00",
"13:30:00", "15:00:00", "17:00:00", "19:00:00", "21:00:00", "23:00:00"
), class = "factor"), x1= c(36.55, 36.5, 36.44444444,
36.6, 36.86666667, 37.26, 37, NA, NA, 37.3, 37.1, 37, 35.6),
x2 = c(NA, 34.5, 35.4, 36.1, NA, NA, NA, NA, NA,
NA, NA, NA, NA), x3 = c(36.9, 36.4, NA, NA, 36.9,
NA, NA, NA, NA, 37.5, 37.5, 36.9, 37.1), x4 = c(36,
35.8, NA, NA, NA, 37.4, 36.7, 36.3, NA, 37.5, 37, NA, NA)), class = "data.frame", row.names = c(NA,
-13L))
# tidy your data, good practice makes it easier to plot things with ggplot
df = df %>% pivot_longer(2:ncol(df), names_to = "individual", values_to = "temperature")
# I would do it this way:
df_list = split(df, df$individual)
plot_fun = function(df) {
title = unique(df$individual)
ggplot(df, aes(x=heures, y=temperature))+
geom_point() +
labs(title = title)
#### add here things to save your plots, store them somewhere, etc
}
lapply(df_list, FUN = plot_fun)

Using toy data as your data frame is incomplete:
df <- tibble(
X=rep(1:10, times=2),
Y=c(1:10, seq(10, 1, -1)),
Name=rep(c("Patient 1", "Patient 2"), each=10)
)
df %>% ggplot() +
geom_line(aes(x=X, y=Y)) +
facet_grid(rows=vars(Name))
Giving

How to set scale_x_date for one line in plot

I want to plot a figure with 2 lines (inner bay and outer bay), but I have 2 columns of data for inner bay. I need to set the date limits to switch columns on a certain date. In the data below, I want to plot IB.y from 2015-09-08 to 2015-09-23 and IB.x from 2015-09-24 to 2015-10-07. And then I want to plot all of OB.
The code below plots both IB.x and IB.y for the entire date range rather than split at 2015-09-24 as desired:
AllDailyMean = ggplot(AllMean, aes(x=Date)) + geom_line(aes(y=IB.x,
color = "Inner Bay"), size = 0.5) + geom_ribbon(aes(ymin=IBMin.x,
ymax = IBMax.x), fill = "coral2", alpha = 0.2, linetype = 3) +
scale_x_date(limits = as.Date(c("2015-09-08", "2015-09-23"))) +
geom_line(aes(y=IB.y, color = "Inner Bay"), size = 0.5) +
geom_line(aes(y=OB, color = "Outer Bay"), size = 0.5) +
geom_ribbon(aes(ymin=IBMin.y, ymax=IBMax.y), fill = "coral2", alpha
= 0.2, linetype = 3) + geom_ribbon(aes(ymin=OBMin, ymax=OBMax),
fill = "skyblue4", alpha = 0.2, linetype = 3) +
scale_x_date(labels = date_format("%b '%y"), date_breaks = "2
months") + labs(y = expression(atop("Mean Daily Temp",
paste(("°C"%+-%"Max/Min")))), x = "Date")
structure(list(Date = structure(c(16686, 16687, 16688, 16689,
16690, 16691, 16692, 16693, 16694, 16695, 16696, 16697, 16698,
16699, 16700, 16701, 16702, 16703, 16704, 16705, 16706, 16707,
16708, 16709, 16710, 16711, 16712, 16713, 16714, 16715, 16716
), class = "Date"), IB.x = c(29.7916666666667, 30.0166666666667,
30.075, 30.0875, 29.3666666666667, 29.2291666666667, 28.8875,
28.6826086956522, 28.6041666666667, 28.7125, 28.7416666666667,
28.5166666666667, 28.525, 28.525, 28.5166666666667
28.3916666666667, 28.3, 28.0875, 27.9541666666667, 27.475,
27.1458333333333, 26.9166666666667, 26.85, 26.9625,
26.4041666666667, 25.95, 25.7416666666667, 25.85,
25.6875, 25.7, 25.7958333333333), IBMax.x = c(30.1, 30.3, 30.4,
30.6, 29.7, 29.4, 29.2, 29, 28.9, 29, 29.1, 28.9, 28.8, 28.7,
28.7, 28.5, 28.5, 28.4, 28.1, 27.9, 27.6, 27.1, 27.1, 27.3, 27.1,
26.3, 26.2, 26.1, 25.9, 26.2, 26.1), IBMin.x = c(29.1, 29.7,
29.8, 29.7, 29, 29, 28.4, 28.2, 28.4, 28.5, 28.6, 27.9, 28.2,
28.3, 28.2, 28.2, 28.1, 27.7, 27.8, 27, 26.8, 26.7, 26.5, 26.7,
25.7, 25.5, 25.4, 25.4, 25.2, 25.2, 25.5), IB.y = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 27.671,
27.6027395833333, 27.25271875, 26.7719895833333, 26.3682604166667,
26.3313229166667, 26.4141875, 26.2628020833333, 26.14065625,
26.1491041666667, 26.2293541666667, 25.7827604166667, 25.44615625,
25.6583854166667, 26.0718645833333), IBMax.y = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 28.177, 28.151,
27.632, 27.187, 26.917, 26.843, 27.237, 26.77, 26.573, 26.622,
26.671, 26.059, 25.913, 26.279, 26.328), IBMin.y = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 27.41, 26.917,
26.77, 26.23, 25.766, 25.717, 25.644, 25.352, 25.255, 25.231,
25.498, 25.523, 25.084, 25.036, 25.766), OB = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 27.794625, 27.5463125,
27.0850208333333, 26.8009375, 26.7057083333333, 26.65728125,
26.535375, 26.46721875, 26.4802604166667, 26.7571145833333,
26.3706145833333, 26.0067395833333, 25.9274166666667,
25.8764895833333, 25.9058333333333), OBMax = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 28.147, 27.998, 27.358,
27.014, 27.136, 27.014, 27.186, 26.965, 27.038, 27.308, 26.646,
26.231, 26.256, 26.329, 26.207), OBMin = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 27.628, 27.21, 26.793,
26.524, 26.304, 26.28, 26.158, 26.158, 25.939, 26.329, 25.939,
25.768, 25.671, 25.574, 25.501)), row.names = 70:100, class =
"data.frame")

One option is to use a simple ifelse within your y aesthetic. I've removed a bunch of your code as it's superfluous to the problem at hand (it's helpful to provide a minimal reproducible example, removing any irrelevant details).
ggplot(df, aes(x=Date)) +
geom_line(aes(y=ifelse(Date <= "2015-09-23", IB.x, NA), color = "Inner Bay"), size = 0.5) +
geom_line(aes(y=ifelse(Date > "2015-09-23", IB.x, NA), color = "Outer Bay"), size = 0.5) +
labs(y = expression(atop("Mean Daily Temp", paste(("°C"%+-%"Max/Min")))), x = "Date")

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

I'm having some issues cleaning up a dataset after I manually extracted the data online - I'm guessing these are encoding issues. I have an issue trying to remove the "U+00A0" in the "Athlete" column cels along with the operator brackets. I looked up the corresponding UTF-8 code and it's for "No-Break-Space". I'm also not sure how to replace the other UTF-8 characters to make the names legible - for e.g. getting U+008A to display as Š.
Subset of data
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
This is what I've tried so far to no success:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)

The following would remove the no break space.
head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete)
Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.
One example:
iconv('\u008A', from="UTF-8", to="LATIN1")
this seems to have an effect, contrary to trying to convert U+008A. Although
the output is:
[1] "\x8a"
not the character you want. Hope this helps somehow.

How to pass arguments to a function inside *apply family functions

I have the following dataset:
dput(tt2)
structure(c(1371.25, NA, 1373.95, NA, NA, 1373, NA, 1373.95,
1373.9, NA, NA, 1374, 1374.15, NA, 1374, 1373.85, 1372.55, 1374.05,
1374.15, 1374.75, NA, NA, 1375.9, 1374.05, NA, NA, NA, NA, NA,
NA, NA, 1375, NA, NA, NA, NA, NA, 1376.35, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1376.25, NA, 1378, 1376.5, NA, NA, NA, 1378,
1378, NA, NA, 1378.8, 231.9, 231.85, NA, 231.9, 231.85, 231.9,
231.8, 231.9, 232.6, 231.95, 232.35, 232, 232.1, 232.05, 232.05,
232.05, 231.5, 231.3, NA, NA, 231.1, 231.1, 231.1, 231, 231,
230.95, 230.6, 230.6, 230.7, 230.6, 231, NA, 231, 231, 231.45,
231.65, 231.4, 231.7, 231.3, 231.25, 231.25, 231.4, 231.4, 231.85,
231.75, 231.5, 231.55, 231.35, NA, 231.5, 231.5, NA, 231.5, 231.25,
231.15, 231, 231, 231, 231.05, NA), .Dim = c(60L, 2L), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tzone = "Asia/Calcutta", index = structure(c(1459482300,
1459483766.38983, 1459485231.77966, 1459486697.16949, 1459488162.55932,
1459489627.94915, 1459491093.33898, 1459492558.72881, 1459494025.11864,
1459495490.50847, 1459496955.89831, 1459498421.28814, 1459499887.67797,
1459501353.0678, 1459502818.45763, 1459504283.84746, 1459505749.23729,
1459507214.62712, 1459508680.01695, 1459510145.40678, 1459511610.79661,
1459513076.18644, 1459514541.57627, 1459516007.9661, 1459517474.35593,
1459518939.74576, 1459520405.13559, 1459521870.52542, 1459523335.91525,
1459524804.30508, 1459526269.69492, 1459527735.08475, 1459529200.47458,
1459530667.86441, 1459532134.25424, 1459533600.64407, 1459535066.0339,
1459536531.42373, 1459537996.81356, 1459539702.20339, 1459541167.59322,
1459542634.98305, 1459544100.37288, 1459545565.76271, 1459547031.15254,
1459548496.54237, 1459549961.9322, 1459551429.32203, 1459552894.71186,
1459554360.10169, 1459555829.49153, 1459557294.88136, 1459558760.27119,
1459560225.66102, 1459561691.05085, 1459563160.44068, 1459564625.83051,
1459566091.22034, 1459567557.61017, 1459569028), tclass = c("POSIXct",
"POSIXt"), tzone = "Asia/Calcutta"), .Dimnames = list(NULL, c("A",
"B")), class = c("xts", "zoo"))
I want to learn how to pass arguments to a function inside the apply family functions.
1st example:
Since there are NAs in the data, mean function returns NA. So I want to pass na.rm=TRUE:
tt<-apply.daily(tt2, function(x) sapply(x,mean(na.rm=TRUE)))
But it returns:
Error in mean.default(na.rm = TRUE) :
argument "x" is missing, with no default
2nd example:
I want to use period.sum function that takes only single column values and requires index of the column.
tt<-lapply(tt2, period.sum, endpoints(tt2))
Error in FUN(X[[i]], ...) : NA/NaN/Inf in foreign function call (arg 3)
I know the 2nd example can me solved with period.apply but as lapply is a general type function can the 2nd example be solved using lapply also?

You can pass arguments in all the functions of the apply family through the ellipsis (...) argument, cf. the help page on sapply. Now, apply.daily is just an extension to xts objects, see ?apply.daily.
apply.daily(tt2, mean, na.rm=TRUE)
#apply.daily( x, FUN, ...)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate percentage to total using rowPercents - r

You don't need any external packages for this. dat.percent <- dat / rowSums(dat, na.rm = T) * 100 Check that it works: > all(abs(rowSums(dat.percent, na.rm = T) - 100) < 0.0001) [1] TRUE

Related

replace values in a data frame with an NA based on the occurrence of NAs in lookup table

Multiple individual graphs from a unique dataframe

How to set scale_x_date for one line in plot

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

How to pass arguments to a function inside *apply family functions

Categories

Resources