Aggregate data in dataframe - r

I have the following data frame in R
DeptNumber EmployeeTypeId
1 10
1 11
1 11
2 23
2 23
2 30
2 40
3 45
3 46
I need to generate another dataframe with a new column MaxEmployeeType, which should contain the EmployeeTypeId which is repeated the most for a given DeptNumber. The output should be as follows
DeptNumber MaxEmployeeType
1 11
2 23
3 45
In case of departmentNumber=3, there is a tie, but it is ok to present either of the option. I am not sure what is the optimal way to do this? Any help is appreciated.
A similar question is posted already
How to aggregate data in R with mode (most common) value for each row?
but it had a limitation to use only plyr & lubridate. If possible I want a best solution and not limit to these two packages. The question is even down voted possibly due to that it could be homework.

You could try:
library(dplyr)
df %>%
count(DeptNumber, EmployeeTypeId) %>%
top_n(1) %>%
slice(1)
Or as per suggested by #jazzuro:
count(df, DeptNumber, EmployeeTypeId) %>% slice(which(n == max(n))[1])
Which gives:
#Source: local data frame [3 x 3]
#Groups: DeptNumber [3]
#
# DeptNumber EmployeeTypeId n
# (int) (int) (int)
#1 1 11 2
#2 2 23 2
#3 3 45 1

Try this.
# Mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# new data-frame
new_df <- data.frame("DeptNumber" = numeric(0), "MaxEmployeeType" = numeric(0))
# distinct departments
depts <- unique(df$DeptNumber)
# calculate mode for every department
for(dept in depts){
dept_set <- subset(df, DeptNumber == dept)
new_df <- rbind(new_df, c(dept, Mode(dept_set$EmployeeTypeId)))
}
R doesn't have any standard function for calculating Mode. Mode function in the code above is taken from Ken Williams' post here.

Here is another dplyr solution
library(dplyr)
data %>%
count(DeptNumber, EmployeeTypeId) %>%
slice(which.max(n))

Related

R: Conditionally select rows based on ther value and the average value of the rest rows with same key

This should be very simple but I can't figure out how to do It properly.
Given the following example dataframe:
telar <- data.frame(name=c("uno","dos","tres","cuatro","cinco"), id=c(1,2,3,1,2), test=c(10,11,12,13,14))
telar
name id test
1 uno 1 10
2 dos 2 11
3 tres 3 12
4 cuatro 1 13
5 cinco 2 14
I am trying to select all the rows that, for example, have a value of test that is bellow the average of al the values in the dataframe telar that have the same id value.
I have already properly grouped the values by id and computed their average like this, but I do not know how to perform the comparison.
> summarise(group_by(telar, id), test=mean(test))
A tibble: 3 x 2
id test
<dbl> <dbl>
1 1 11.5
2 2 12.5
3 3 12
Thank you!
You can simply create your groups and keep the values that are less than the mean, i.e.
library(dplyr)
telar %>%
group_by(name, id) %>%
filter(test < mean(test)) %>%
ungroup()
There is undoubtedly a way to do this without using data.table, but I offer it as a solution
library(data.table)
setDT(telar)
telar[, avg := mean(test), by = id][test < avg]
note I recommend if you're doing further analysis in data.frame after this, I recommend to return to a data.frame using setDF(telar)
Using base R, this can be done with ave
telar[with(telar, test < ave(test, id, name)),]

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

sapply results with dplyr

In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))
Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))
The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows

Retrieving unique combinations [duplicate]

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources