When to use which vs subset function when calculating mean

When to use which vs subset function when calculating mean - r

I am trying to make a find the mean of a variable (age) by gender in R. Age and gender are each columns in the a dataset and I want to find the mean of age for women and for men. I believe I need to use the which function, but am unsure how to actually do that and was wondering if using the subset function would make more sense.

You're probably looking for aggregate().
aggregate(age ~ sex, dat, mean)
# sex age
# 1 1 24.5
# 2 2 24.0
Data
dat <- structure(list(sex = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age = c(28L,
19L, 26L, 25L, 22L, 27L, 23L), x = c(0.978226428385824, 0.117487361654639,
0.474997081561014, 0.560332746244967, 0.904031387297437, 0.138710167724639,
0.988891728920862)), class = "data.frame", row.names = c(NA,
-7L))

We can also use dplyr
library(dplyr)
df1 %>%
group_by(sex) %>%
summarise(age = mean(age))

Related

R split in Group for parallel execution

I have a simple Table:
ID|Value
1|10
1|20
1|-5
2|25
3|2
3|15
4|8
5|18
6|33
6|5
6|50
Actual I use this code:
for (row in 1:nrow(Table)) {
ID <- Table[row, 1]
Value <- Table[row, 2]
if ( oldID == ID) {
currentValue <- currentValue * ((100 - Value)/100) }
else {
addrow <- data.frame(oldID, currentValue)
PriceRR <- rbind(PriceRR, addrow)
oldID <- ID
currentValue <- 100 - Value
}
}
To allocated a discount for a later DAX Value in Power BI.
But it slow as hell. So I want to parallelize it.
daply might do the work. But I do not know the inner workings of it.
So basically what I need.
Split table in sets by group of ID.
Set1 1,10 1,20 1,5
Set2 2,25
Set3 3,2 3,15
.
.
.
Apply function to Sets parallel.
First call of function in set, initialize currentValue <- 100
after
currentValue <- currentValue * ((100 - Value)/100)
For Set1.1 90 <- 100 * ((100 - 10)/100)
For Set1.2 72 <- 90 * ((100 - 20)/100)
For Set1.3 68,4 <- 72 * ((100 - 5)/100)
It should return ID=1 Value=68,4
I need to know, is it possible to make a variable persistent in memory for the duration of execute a function an set, as long it lives?
Will daply or a other function create a new working thread to apply it on a set?
I am a R beginner and must jump right in the inner working of the R environment. :-)
Sven

An option with reduce from purrr
library(dplyr)
library(purrr)
data %>%
group_by(ID) %>%
summarise(Result = reduce(Value, ~ .x * (100 -.y)/100, .init = 100))
# A tibble: 6 x 2
# ID Result
#* <int> <dbl>
#1 1 68.4
#2 2 75
#3 3 83.3
#4 4 92
#5 5 82
data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))

Here's an approach with dplyr and Reduce from base R:
library(dplyr)
data %>%
group_by(ID) %>%
summarize(Result = Reduce(function(x,y) x * ((100 - y)/ 100),
Value, init = 100))
# A tibble: 6 x 2
ID Result
<int> <dbl>
1 1 68.4
2 2 75
3 3 83.3
4 4 92
5 5 82
6 6 31.8
Reduce is a tricky function mostly because the documentation is terrible. Reduce applies a function with two arguments to elements in a vector in succession with the previous value as the first argument and the current value as the second argument. You can set an initial value with init =.
I notice in your explaination that your expected output for group 1 is 68.4. This is only true if the value for row 3 is 5 rather than the -5 you posted. Since this was the only negative value in your data, I went ahead and changed it to 5.
Data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))

Your original script is slow for a couple of reason. First you are looping through every element in your initial table and not taking advantage of the vectorized nature of R. Second, there is a rbind function within the loop. Binding is a slow process, especially as the object size grows.
It looks likes the objective is a cumulative product of the the value column grouped by the ID column.
Here is a base R solution using the split, apply and merge strategy.
Table <-structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, -5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
#Create column for the ((100 - Value)/100) factor
Table$factor<- ((100 - Table$Value)/100)
#split by ID
dfs<-split(Table, Table$ID)
currentValue<-sapply(dfs, function(x){
#find the cumulative product of the factor column
product<-cumprod(x$factor)
#return the last value fron the cumprod
return(100*product[length(product)])
})
#create the final answer
PriceRR<-data.frame(oldID=as.integer(names(dfs)), currentValue)
PriceRR
oldID currentValue
1 1 75.600
2 2 75.000
3 3 83.300
4 4 92.000
5 5 82.000
6 6 31.825
This script is using the cumprod function which is vectorized, thus very fast. Also the above script avoids the slow operation of continuing to growing the final dataframe.

How to calculate percentage of mising data in a time series in R dplyr

In the following sample data and script,
How can I calculate the % of missing data between start date strtdt and end date enddt for each ID. What I want to get is: add the missing days with NA between strtdt and enddt separately for each IDs than calculated the % of NA.
I tried following using dplyr but for no luck. Any suggestion will be highly appreciated.
Note: I can achieve same by calculating individually for each ID however that is not possible because I have more than 10000 IDs.
Ultimate goal is to get % of NA between start date and end date for each ID; If the dates are missing completely than i have to add missing date with NA values.
library(dplyr
df<-structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L
), .Label = c("xx", "xyz", "yy", "zz"), class = "factor"), Date = structure(c(8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 19L, 20L, 21L, 22L, 23L), .Label = c("1989-09-12",
"1989-09-13", "1989-09-14", "1989-09-19", "1989-09-23", "1990-01-12",
"1990-01-13", "1996-09-12", "1996-09-13", "1996-09-16", "1996-09-17",
"1996-09-18", "1996-09-19", "2000-09-12", "2000-09-13", "2000-11-10",
"2000-11-11", "2000-11-12", "2001-09-07", "2001-09-08", "2001-09-09",
"2001-09-10", "2001-09-11"), class = "factor"), val = c(3, 5,
9, 3, 5, 6, 8, 7, 9, 5, 3, 2, 8, 8, 5, 3, 2, 1, 5, 7, NA, NA,
NA, NA)), .Names = c("ID", "Date", "val"), row.names = c(NA,
-24L), class = "data.frame")
df$Date<-as.Date(df$Date,format="%Y-%m-%d")
df
df_mis<-df %>%
group_by(ID)%>%
dplyr::mutate(strtdt=min(Date),
enddt=max(Date))
df_mis
df_mis2<-df_mis %>%
group_by(ID) %>%
dplyr::do( data.frame(., Date1= seq(.$strtdt,.$enddt, by = '1 day')))
df_mis2

I assume from the sequence generation in the question's code, that the expected observations are one per day between the first observed date and last observed date per ID. Here's a clunky piece by piece calculation to count the % missing data.
1. Make a data frame of all expected dates for each ID
library(dplyr)
# df as in the question, but coerce Date column
df$Date <- as.Date(df$Date)
# Data frame with date ranges per id
ranges_df <- df %>%
group_by(ID) %>%
summarize(min=min(Date), max=max(Date))
# Data frame with IDs and date for every day expected.
alldays <- ranges_df %>%
group_by(ID) %>%
do(., data.frame(
Date = seq(.$dmin,.$dmax, by = '1 day')
)
)
2. JOIN the expected dates table with the observed dates table.
imputed_df <- left_join(alldays, df)
3. Count NAs
imputed_df %>%
group_by(ID) %>%
summarize(total=n(),
missing=sum(is.na(val)),
percent_missing=missing/total*100
)
result:
# A tibble: 4 x 4
ID total missing percent_missing
<fctr> <int> <int> <dbl>
1 xx 8 2 25.00000
2 xyz 4 4 100.00000
3 yy 62 57 91.93548
4 zz 4380 4371 99.794
Assuming that NAs in the original data should be counted as missing data, this will do so.

Calculate the number of days between the min and max of dates as an intermediate variable.
Then, calculate the number of missing days as number of days - number of observations. Then, calculate percentages.
df %>%
group_by(ID) %>%
mutate(numdays = as.numeric(max(Date) - min(Date)) + 1,
pctmissing = (numdays - n()) / numdays)

Mutate repeats first row value

I have a dataset with taxonomy assignment and I want to extract the genus in a new column.
library(tidyverse)
library(magrittr)
library(stringr)
df <- structure(list(C043 = c(18361L, 59646L, 27575L, 163L, 863L, 3319L,
0L, 6L), C057 = c(20020L, 97610L, 13427L, 1L, 161L, 237L, 2L,
105L), taxonomy = structure(c(3L, 2L, 1L, 6L, 4L, 4L, 5L, 2L), .Label = c("k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri"
), class = "factor")), .Names = c("C043", "C057", "taxonomy"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 10L), class = "data.frame")
So this is my function (it works)
extract_genus <- function(str){
genus <- str_split(str, pattern = ";")[[1]][6]
genus %<>% str_sub(start = 4) #%>% as.character
return(genus)
}
But when I applied it in mutate (with or without as.character), it repeats first row value in the new column.
df %>% mutate(genus = extract_genus(taxonomy))
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Escherichia
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Escherichia
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Escherichia
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Escherichia
When I use sapply (but I don't want to, I want a solution with dplyr pipeline), it works.
df_group_gen$genus <- sapply(df_group_gen$taxonomy, extract_genus)
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Enterobacter
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Enterobacter
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Pseudomonas
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Klebsiella
Why mutate doesn't compute as we can expect? I find this question but no answer is provided, only a had hoc code.
Thank you :)

You can Vectorize your function to allow mutate to occur on every row:
ex_gen <- Vectorize(extract_genus, vectorize.args='str')
df %>% mutate(genus=ex_gen(taxonomy))
Alternatively, you can use rowwise to mutate each row:
df %>%
rowwise() %>%
mutate(genus = extract_genus(taxonomy))

how do you calculate correlation for some columns in a data frame?

I need to be able to calculate correlation coefficient for a data frame by column. For example, my data frame is this:
dput(df)
structure(list(Server = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("server101", "server102"), class = "factor"),
JVM = structure(c(1L, 2L, 3L, 4L, 2L, 3L, 4L),
.Label = c("JVM1", "JVM2", "JVM3", "JVM4"),
class = "factor"), cpu = c(10L, 20L, 30L, 20L, 10L, 20L, 20L),
trans = c(1000L, 2000L, 30L, 30L, 50L, 60L, 30L)),
.Names = c("Server", "JVM", "cpu", "trans"),
class = "data.frame", row.names = c(NA, -7L))
I need to be able to calculate the correlation of coefficient between cpu and trans by each jvm and server.
I have tried this:
apply(df[,c('trans','cpu')], function(x) FUN=cor(x["trans"],x["cpu"]))
any ideas how I could do this?

There is a base R way to do this, but I would probably use dplyr package
If your data frame is called structure, try this:
library(dplyr)
structure %>% group_by(JVM) %>% summarize(cor = cor(trans,cpu))

We can use data.table
library(data.table)
setDT(df)[, .(Cor = cor(trans,cpu)), by = JVM]

You can't get a correlation out of one data point but assuming that your real data has more data points per server/jvm try this code. It does not use any packages:
out <- tapply(1:nrow(df), df[1:2], function(i) with(df[i, ], cor(cpu, trans)))
giving:
> out
JVM
Server JVM1 JVM2 JVM3 JVM4
server101 NA NA NA NA
server102 NA NA NA NA
If you prefer a long form then:
ftable(out, row.vars = 1:2)
giving:
Server JVM
server101 JVM1 NA
JVM2 NA
JVM3 NA
JVM4 NA
server102 JVM1 NA
JVM2 NA
JVM3 NA
JVM4 NA

R: dplyr summarize, sum only values of uniques

I am having trouble with a pesky command I would like to have for an analysis of a summary, for which I'm using the dplyr package. It's easiest to explain with some example data:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
Two simple summaries are my goal: first, I'd like to summarize just by Date, with the code seen below. The part that is wrong is the total_balance_sum calculation, in which I want to sum the balance of each person but only one time for each person. So for instance, the result of my command for Date=1 is total_balance_sum=100, but what it should be is 150 (add total_balance of 100 for Jack once to total_balance of Mary of 50 once). This wrong calculation obviously messes up the final pct calc.
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
In the second summary (below), I group by both date and birth year, and again am calculating total_balance_sum incorrectly.
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
What is the correct way to achieve my goal? Clearly the n_distinct I'm using is only taking one of the values and not summing it properly across names.
Thanks for your help.

I'm a little unclear on what you may be asking for, but does this do what you'd like?: (just for the first example)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176

For the second example (for the first, just remove the Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

When to use which vs subset function when calculating mean - r

We can also use dplyr library(dplyr) df1 %>% group_by(sex) %>% summarise(age = mean(age))

Related

R split in Group for parallel execution

How to calculate percentage of mising data in a time series in R dplyr

Mutate repeats first row value

how do you calculate correlation for some columns in a data frame?

R: dplyr summarize, sum only values of uniques

Categories

Resources