R max of multiple categories [duplicate] - r

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 5 years ago.
I've got data somewhat like this (of course with many more rows):
Age Work Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93
I want to get the maximum SomeNumber for each zone at or below each age. SomeNumber increases with age, so I expect that the highest SomeNumber in Zone 2 by an under-32-y/o is by a guy who's age 31, but it could in fact be a guy age 27.
To do this I've written a nested for loop:
for(i in zonelist){
temp = data[data$zone==i,]
for(j in 1:max(data$age)){
temp.lessequal=c(temp.lessequal,max((temp[temp$Age<=j,])$SomeNumber))
}
#plot temp.lessequal or save it at this point
}
which of course is tremendously slow. How can I do this faster? I've looked at the order function to sort by two columns at once, but that doesn't let me take the max of each group.

Data:
df1 <- read.table(text='Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93',
header = TRUE)
Code:
df2 <- with( df1, df1[ Age <= 32, ] ) # extract rows with Age <= 32
# get maximum of someNumber by aggregating with work_zone and then merging with df1 to combine the age column
merge(aggregate(SomeNumber ~ Work_Zone, data = df2, max), df2)
# Work_Zone SomeNumber Age
# 1 1 2.61 26
# 2 4 8.42 32

It seems OP is looking for max value based on <= condition on a particular column (Age).
The use of sqldf comes very handy in such cases in order to explain the logic. One solution could be:
# Data
df <- read.table(text = "Age Work_Zone SomeNumber
26 1 2.61
32 4 8.42
41 2 9.71
45 2 4.14
64 3 6.04
56 1 5.28
37 4 7.93", header = T, stringsAsFactors = F)
library(sqldf)
df3 <- sqldf("select df1.Work_Zone, df1.Age, max(df2.SomeNumber) from df df1
inner join df df2 on df1.Work_Zone = df2.Work_Zone
WHERE df2.Age <= df1.Age
GROUP BY df1.Work_Zone, df1.Age")
# Result:
# Work_Zone Age max(df2.SomeNumber)
# 1 1 26 2.61
# 2 1 56 5.28
# 3 2 41 9.71
# 4 2 45 9.71
# 5 3 64 6.04
# 6 4 32 8.42
# 7 4 37 8.42

Using the library data.table you can select the rows which are less than required age, then output the max(somenumber) and their respective age for each Workzone ie group by workzone.
library(data.table)
setDT(df1)[Age<=32,.(max(SomeNumber),Age),by=Work_Zone]
Work_Zone V1 Age
1: 1 2.61 26
2: 4 8.42 32

Related

R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Match two data frames by two columns and extract values from third column

I apologize if is a basic or duplicate question, but I am a beginner R user.
I am attempting to match every row in Dataframe A by Sex and Age to the two corresponding columns in Dataframe B. I know there will be a match for sure, so I want to extract values from the matching rows of two different columns in Dataframe B and store them in Dataframe C.
Dataframe A Dataframe B
ID Sex Age Weight Row Sex Age X1 X2
1 1 24 36 1 1 24 18.2 12.3
2 1 34 56 2 2 87 15.4 16.5
3 2 87 12 3 1 64 16.3 11.2
4 2 21 08 4 2 21 15.6 14.7
5 1 64 33 5 1 34 17.7 18.9
...
Dataframe C
ID Age Sex Weight Y1 Y2
1 1 24 36 18.2 12.3
2 1 34 56 17.7 18.9
3 2 87 12 15.4 16.5
4 2 21 08 15.6 14.7
5 1 64 33 16.3 11.2
There are 9000 IDs in my dataframe. I've looked at similar questions like this one
Fill column values by matching values in each row in two dataframe
But I don't think this I am applying this code correctly. Will a for loop be useful here?
for(i in 1:nrow(ID){
dfC[i,Y1] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X1"]
dfC[i,Y2] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X2"]
}
I know the merge function was also suggested, but these two variables are not actually named the same way in my data set.
Thanks!
Try this bro... reduce function in R for such operations
set.seed(1)
list.of.data.frames = list(data.frame(id=1:10, sex=1:10 , age =1:10 , weight=1:20), data.frame(row=5:14, sex=11:20 , age :1:20 , x1:1:10, x2:1:10), data.frame(id=8:14, sex=11:20 , age :1:20 ,weight:20:30, y1:1:10, y2:1:10))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)

How to approach loop with increasing variable name in R

My dataset is currently a set of answers to twenty questions with 300 observations.
Each of the questions are labled q1, q2, q3, etc.
Each observation gives a 1 to 10 response.
The code below is what I have. What I want is for the q1 to change when the counter changes in R.
totaltenq1 <- sum(UpdatedQualtrix$tenq1)
totalnineq1 <- sum(UpdatedQualtrix$nineq1)
totaleightq1 <- sum(UpdatedQualtrix$eightq1)
totalsevenq1 <- sum(UpdatedQualtrix$sevenq1)
totalsixq1 <- sum(UpdatedQualtrix$sixq1)
totalfiveq1 <- sum(UpdatedQualtrix$fiveq1)
totalfourq1 <- sum(UpdatedQualtrix$fourq1)
totalthreeq1 <- sum(UpdatedQualtrix$threeq1)
totaltwoq1 <- sum(UpdatedQualtrix$twoq1)
totaloneq1 <- sum(UpdatedQualtrix$oneq1)
totaltenq2 <- sum(UpdatedQualtrix$tenq2)
totalnineq2 <- sum(UpdatedQualtrix$nineq2)
totaleightq2 <- sum(UpdatedQualtrix$eightq2)
totalsevenq2 <- sum(UpdatedQualtrix$sevenq2)
totalsixq2 <- sum(UpdatedQualtrix$sixq2)
totalfiveq2 <- sum(UpdatedQualtrix$fiveq2)
totalfourq2 <- sum(UpdatedQualtrix$fourq2)
totalthreeq2 <- sum(UpdatedQualtrix$threeq2)
totaltwoq2 <- sum(UpdatedQualtrix$twoq2)
totaloneq2 <- sum(UpdatedQualtrix$oneq2)
I would like to have code that is
count = 20
for (i in 1:count){
totaltenq(i) <- sum(UpdatedQualtrix$tenq(i)
totalninq(I) <- sum(UpdatedQuatlrix$nineq(I)
etc
}
That way, when I do it again in the future, I can tell R how many questions it has the next time and it will change it. That way I don't have 10,000 lines of code from copying and pasting my code 20 times.
I don't think you need any loops at all. It just all depends on how you want to store those value. I'm a big fan of not having more variables than necessary.
Here's some sample data. I'll just make 10 rows (observations) with values 1-5.
set.seed(15)
Q<-3
numbs<-c("one","two","three","four","five","six","seven","eight","nine","ten")
qs<-paste0("q",1:Q)
qnumbs <- outer(numbs, qs, paste0)
UpdatedQualtrix <-data.frame(ID=1:10,
matrix(sample(1:5, 10*length(numbs)*Q, replace=T), nrow=10))
colnames(UpdatedQualtrix) <- c("ID",qnumbs)
Now I can sum up each of the columns with
( Qsums<-colSums(UpdatedQualtrix[, qnumbs]) )
# oneq1 twoq1 threeq1 fourq1 fiveq1 sixq1 sevenq1 eightq1 nineq1 tenq1
# 37 35 29 26 32 39 40 33 40 26
# oneq2 twoq2 threeq2 fourq2 fiveq2 sixq2 sevenq2 eightq2 nineq2 tenq2
# 37 31 19 29 25 38 36 35 28 27
# oneq3 twoq3 threeq3 fourq3 fiveq3 sixq3 sevenq3 eightq3 nineq3 tenq3
# 37 30 31 31 24 31 29 31 25 41
And if we want the totals per question we can do
sapply(qs, function(a, b) sum(Qsums[paste0(b,a)]), b=numbs)
# q1 q2 q3
# 337 305 310
Or if we want the counts per response we can do
sapply(numbs, function(a, b) sum(Qsums[paste0(a,b)]), b=qs)
# one two three four five six seven eight nine ten
# 111 96 79 86 81 108 105 99 93 94
You might want to also consider melting your data since it's so structured. You can use the reshape2 library to help. You can do
require(reshape2)
mm <- melt(UpdatedQualtrix, id.vars="ID")
mm <- cbind(mm[,-2], colsplit(mm$variable, "q", c("resp","q")))
mm$resp <- factor(mm$resp, levels=numbs)
to turn your data into a "tall" format so each value has it's own row with a column for ID, value, response and question.
str(mm)
# 'data.frame': 300 obs. of 4 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ value: int 4 1 5 4 2 5 5 2 4 5 ...
# $ resp : Factor w/ 10 levels "one","two","three",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ q : int 1 1 1 1 1 1 1 1 1 1 ...
And then we can more easily do other calculations. Of you want the total scores by question, you could do
aggregate(value~q, mm, sum)
# q value
# 1 1 337
# 2 2 305
# 3 3 310
If you wanted the average value for each question/response you could do
with(mm, tapply(value, list(q,resp), mean))
# one two three four five six seven eight nine ten
# 1 3.7 3.5 2.9 2.6 3.2 3.9 4.0 3.3 4.0 2.6
# 2 3.7 3.1 1.9 2.9 2.5 3.8 3.6 3.5 2.8 2.7
# 3 3.7 3.0 3.1 3.1 2.4 3.1 2.9 3.1 2.5 4.1

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources