How to prevent R from rounding in frequency function? - r

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...

The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.

Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Related

Extract multiple values from a dataset by subsetting with a vector

I have a data frame called "Navi", with 72 rows that describe all possible combinations of three variables f,g and h.
head(Navi)
f g h
1 40.00000 80 0.05
2 57.14286 80 0.05
3 74.28571 80 0.05
4 91.42857 80 0.05
5 108.57143 80 0.05
6 125.71429 80 0.05
I have a dataset that also contains these 3 variables f,g and h along with several others.
head(dataset1[,7:14])
# A tibble: 6 x 8
h f g L1 L2 Ref1 Ref2 FR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.02 20 100 53 53 0.501 2.00 2
2 0.02 20 260 67 67 0.200 5.01 5.2
3 0.02 20 420 72 71 0.128 7.83 8.4
4 0.02 20 580 72 72 0.0956 10.5 11.6
5 0.02 20 740 73 73 0.0773 12.9 14.8
6 0.02 20 900 72 71 0.0655 15.3 18
What I'm trying to do is:
for each row in the combinations data frame, filter the dataset by the three variables f,g and h.
Then, if there are exact matches, give me the matching rows of this dataset, then extract the values in the columns "L1" and "FR" in this dataset and calculate the average of them. Save the average value in the vectors "L_M2" and "FR_M2"
If there aren't exact matches, give me the rows where f,g,h in the dataset are closest to f,g,h from the data frame. Then extract all values for L and FR in these rows, and calculate the average. Save the average value in the vectors "L_M2" and "FR_M2".
What I've already tried:
I created two empty vectors where the extracted values shall be saved later on.
Then I am looping over every row of the combinations data frame, filtering the dataset by f,g and h.
The result would be multiple rows, where the values for f,g and h are the same in the dataset as in the row of the combinations data frame.
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
}
The thing is, I don't know what to do from here on. I don't know how to check for rows with closest values by multiple variables, if there are no exact matches...
I only did something more or less similar in the past, but I only checked for the closes "g" value like this:
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
for (i in 1:(nrow(Navi)){
Differences <- abs(Navi[i,2]- matchingRows$G)
indexofMin <- which(Differences == min (Differences))
L_M2 <- append(L_M2, matchingRows$L[[indexofMin]], after = length(L_M2))
FR_M2 <- append(FR_M2, matchingRows$FR[[indexofMin]], after = length(FR_M2))
}
So can anybody tell me how to achieve this extraction process?I am still pretty new to R, so please tell me If I made a rookie mistake or forgot to include some crucial information. Thank you!
First convert your data into dataframe (if not done before).
Navi <- data.frame(Navi)
Savi <- data.frame(dataset1[,7:14])
Then use merge to filter your lines:
df1 <- merge(Navi, Savi, by = c("f","g","h"))
Save "L1" and "FR" average from df1:
Average1 <- ((df1$L1+df1$FR)/2)
Get you your new Navi dataframe which doen not have exact match on f,g,h columns
Navi_new <- Navi[!duplicated(rbind(df1, Navi))[-seq_len(nrow(df1))], ]
For comparing the values with nearest match:
A1 <- vapply(Navi_new$f, function(x) x-Savi$f, numeric(3))
A2 <- apply(abs(A1), 2, which.min)
B1 <- vapply(A1$g, function(x) x-Savi$g, numeric(3))
B2 <- apply(abs(B1), 2, which.min)
C1 <- vapply(B1$g, function(x) x-Savi$g, numeric(3))
C2 <- apply(abs(C1), 2, which.min)
You can use C2 dataframe to get the average of "L1" and "FR" like 3 steps back.

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Elegant way of adding columns on a specific position in a data frame

I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96

R max of multiple categories [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 5 years ago.
I've got data somewhat like this (of course with many more rows):
Age Work Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93
I want to get the maximum SomeNumber for each zone at or below each age. SomeNumber increases with age, so I expect that the highest SomeNumber in Zone 2 by an under-32-y/o is by a guy who's age 31, but it could in fact be a guy age 27.
To do this I've written a nested for loop:
for(i in zonelist){
temp = data[data$zone==i,]
for(j in 1:max(data$age)){
temp.lessequal=c(temp.lessequal,max((temp[temp$Age<=j,])$SomeNumber))
}
#plot temp.lessequal or save it at this point
}
which of course is tremendously slow. How can I do this faster? I've looked at the order function to sort by two columns at once, but that doesn't let me take the max of each group.
Data:
df1 <- read.table(text='Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93',
header = TRUE)
Code:
df2 <- with( df1, df1[ Age <= 32, ] ) # extract rows with Age <= 32
# get maximum of someNumber by aggregating with work_zone and then merging with df1 to combine the age column
merge(aggregate(SomeNumber ~ Work_Zone, data = df2, max), df2)
# Work_Zone SomeNumber Age
# 1 1 2.61 26
# 2 4 8.42 32
It seems OP is looking for max value based on <= condition on a particular column (Age).
The use of sqldf comes very handy in such cases in order to explain the logic. One solution could be:
# Data
df <- read.table(text = "Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93", header = T, stringsAsFactors = F)
library(sqldf)
df3 <- sqldf("select df1.Work_Zone, df1.Age, max(df2.SomeNumber) from df df1
inner join df df2 on df1.Work_Zone = df2.Work_Zone
WHERE df2.Age <= df1.Age
GROUP BY df1.Work_Zone, df1.Age")
# Result:
# Work_Zone Age max(df2.SomeNumber)
# 1 1 26 2.61
# 2 1 56 5.28
# 3 2 41 9.71
# 4 2 45 9.71
# 5 3 64 6.04
# 6 4 32 8.42
# 7 4 37 8.42
Using the library data.table you can select the rows which are less than required age, then output the max(somenumber) and their respective age for each Workzone ie group by workzone.
library(data.table)
setDT(df1)[Age<=32,.(max(SomeNumber),Age),by=Work_Zone]
Work_Zone V1 Age
1: 1 2.61 26
2: 4 8.42 32

data frame column total in R

I have data like this (derived using the table() function):
dat <- read.table(text = "responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6", header = TRUE)
dat
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
All I want are row totals, so to create a new row at the bottom that says Total and then in column freq it will show 35 and in percent it will show 100. I am unable to find a solution. colSums doesn't work because of the first column which is a string.
One option is converting to 'matrix' and using addmargins to get the column sum as a separate row at the bottom. But, this will be a matrix.
m1 <- as.matrix(df1[-1])
rownames(m1) <- df1[,1]
res <- addmargins(m1, 1)
res
# freq percent
#A 9 25.7
#B 13 37.1
#C 10 28.6
#D 3 8.6
#Sum 35 100.0
If you want to convert to data.frame
data.frame(responses=rownames(res), res)
Another option would be getting the sum with colSums for the numeric columns (df1[-1]) (I think here is where the OP got into trouble, ie. applying the colSums on the entire dataset instead of subsetting), create a new data.frame with the responses column and rbind with the original dataset.
rbind(df1, data.frame(responses='Total', as.list(colSums(df1[-1]))))
# responses freq percent
#1 A 9 25.7
#2 B 13 37.1
#3 C 10 28.6
#4 D 3 8.6
#5 Total 35 100.0
data
df1 <- structure(list(responses = c("A", "B", "C", "D"), freq = c(9L,
13L, 10L, 3L), percent = c(25.7, 37.1, 28.6, 8.6)),
.Names = c("responses", "freq", "percent"), class = "data.frame",
row.names = c(NA, -4L))
This might be relevant, using SciencesPo package, see this example:
library(SciencesPo)
tab(mtcars,gear,cyl)
#output
=================================
cyl
--------------------
gear 4 6 8 Total
---------------------------------
3 1 2 12 15
6.7% 13% 80% 100%
4 8 4 0 12
66.7% 33% 0% 100%
5 2 1 2 5
40.0% 20% 40% 100%
---------------------------------
Total 11 7 14 32
34.4% 22% 44% 100%
=================================
Chi-Square Test for Independence
Number of cases in table: 32
Number of factors: 2
Test for independence of all factors:
Chisq = 18.036, df = 4, p-value = 0.001214
Chi-squared approximation may be incorrect
X^2 df P(> X^2)
Likelihood Ratio 23.260 4 0.00011233
Pearson 18.036 4 0.00121407
Phi-Coefficient : NA
Contingency Coeff.: 0.6
Cramer's V : 0.531
#akrun I posted it but you already did the same. Correct me if I'm wrong, I think we can just need this without creating a new data frame or using as.list.
rbind(df1, c("Total", colSums(df1[-1])))
Output:
responses freq percent
1 A 9 25.7
2 B 13 37.1
3 C 10 28.6
4 D 3 8.6
5 Total 35 100
sqldf Classes of the data frame are preserved.
library(sqldf)
sqldf("SELECT * FROM df1
UNION
SELECT 'Total', SUM(freq) AS freq, SUM(percent) AS percent FROM df1")
Or, alternatively you can use margin.table and rbind function within R-base. Two lines and voila...
PS: The lines here are longer as I am recreating the data, but you know what I mean :-)
Data
df1 <- matrix(c(9,25.7,13,37.1,10,28.6,3,8.6),ncol=2,byrow=TRUE)
colnames(df1) <- c("freq","percent")
rownames(df1) <- c("A","B","C","D")
Creating Total Calculation
Total <- margin.table(df1,2)
Combining Total Calculation to Original Data
df2 <- rbind(df,Total)
df2
Inelegant but it gets the job done, please provide reproducible data frames so we don't have to build them first:
data = data.frame(letters[1:4], c(9,13,10,3), c(25.7,37.1, 28.6, 8.6))
colnames(data) = c("X","Y","Z")
data = rbind(data[,1:3], matrix(c("Sum",lapply(data[,2:3], sum)), nrow = 1)[,1:3])
library(janitor)
dat %>%
adorn_totals("row")
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
Total 35 100.0

Resources