Dividing all possible rows within a given sub-data in R - r

My data looks like this:
set <- c(1,1,1,2,2,3,3,3,3,3,4,4)
density <- c(1,3,3,1,3,1,1,1,3,3,1,3)
counts <- c(100,2,4,76,33,12,44,13,54,36,65,1)
data <- data.frame(set,density,counts)
data$set <- as.factor(data$set)
data$density <- as.factor(data$density)
Within a given set there are two levels of densities "1" or "3". For a given set, I want to divide all possible combinations of counts of density "1" and density "3". I then want to print the original density associated with density "1", the ratio, and the set
For example, the result for the first few rows should look like:
set counts ratio
1 100 50 #100/2
1 100 25 #100/4
2 76 2.3 #76/33
3 12 0.22 #12/54
3 12 0.33 #12/36
3 44 0.8148 #44/54
...
I thought I could achieve it by dplyr..but it seems a little too complicated for dplyr.

It looks like the comments get you most of the way there. Here's a dplyr solution. With left_join each of the density1's get matched up with all density3's in the same set, providing output in line with your specification.
# Edited below to use dplyr syntax; my base syntax had a typo
library(dplyr)
data_combined <- data %>% filter(density == 1) %>%
# Match each 1 w/ each 3 in the set
left_join(data %>% filter(density == 3), by = "set") %>%
mutate(ratio = counts.x / counts.y) %>%
select(set, counts.x, counts.y, ratio)
data_combined
# set counts.x counts.y ratio
#1 1 100 2 50.0000000
#2 1 100 4 25.0000000
#3 2 76 33 2.3030303
#4 3 12 54 0.2222222
#5 3 12 36 0.3333333
#6 3 44 54 0.8148148
#7 3 44 36 1.2222222
#8 3 13 54 0.2407407
#9 3 13 36 0.3611111
#10 4 65 1 65.0000000

Related

R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

R- Subtracting the mean of a group from each element of that group in a dataframe

I am trying to merge a vector 'means' to a dataframe.
My dataframe looks like this Data = growth
I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command:
means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)
Then, I selected the means for Measurement 1 as follows as I am only interested in these means.
meansT0<-means[which(means$Measurement=="1"),]
Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe.
The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). Maybe there is no need to add the means column to the dataframe? Do you know any command to do that ?
[27.06.18]
I made up this simplified dataframe, I hope this help understanding.
So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1.
So, if I get the means by group (1 group= Population + Temperature + Measurement):
means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means
I got these values of means (in this example) :
Population Temperature Measurement TL
JUB 15 **1** **12.00000**
JUB 20 **1** **15.66667**
JUB 15 2 17.66667
JUB 20 2 18.66667
JUB 15 3 23.66667
JUB 20 3 24.33333
We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command):
-for the group JUB+15+Measurement 1, mean = 12
-for the group JUB+20+Measurement 1, mean = 15.66
growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))
print(growth)
I hope with this, you can understand better what I am trying to do. I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes.
Here is an option with tidyverse. After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column (.) with the mean of it.
library(tidyverse)
growth %>%
group_by(Population, Temperature, Replicat, Size, Measurement) %>%
mutate_at(vars(HL, TL), funs(MeanGroupDiff = .
- mean(.[Measurement == 1])))
Using a reproducible example with mtcars dataset
data(mtcars)
mtcars %>%
group_by(cyl, vs) %>%
mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))
Have you considered using the data.table package? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run.
The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table.
To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr, just throwing out a suggestion based on what I have found the most useful for my personal work.
library(data.table)
## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]
## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]
## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]
## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]
print(DT)
# mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4 6 0 3 19.75 1.65
# 2: 18.1 6 0 3 19.75 -1.65
# 3: 19.2 6 0 4 18.50 0.70
# 4: 17.8 6 0 4 18.50 -0.70
# 5: 21.0 6 1 4 21.00 0.00
# 6: 21.0 6 1 4 21.00 0.00
# 7: 19.7 6 1 5 19.70 0.00
Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). Then, run an ifelse conditional logic calculation for needed columns. Since by will return a list of data frames, bind all outside with do.call():
df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
# TL CORRECTION
sub$Correct_TL <- ifelse(sub$Measurement != 1,
sub$TL - mean(subset(sub, Measurement == 1)$TL),
sub$TL)
# ADD OTHER CORRECTIONS
return(sub)
})
final_df <- do.call(rbind, df_list)
Output (using posted data)
final_df
# Population Measurement Temperature TL New_TL Correct_TL
# 1 JUB 1 15 11 11-12 11.0000000
# 2 JUB 1 15 12 12-12 12.0000000
# 3 JUB 1 15 13 13-12 13.0000000
# 7 JUB 2 15 16 16-12 4.0000000
# 8 JUB 2 15 17 17-12 5.0000000
# 9 JUB 2 15 20 20-12 8.0000000
# 13 JUB 3 15 25 25-12 13.0000000
# 14 JUB 3 15 22 22-12 10.0000000
# 15 JUB 3 15 24 24-12 12.0000000
# 4 JUB 1 20 15 15-15.66 15.0000000
# 5 JUB 1 20 18 18-15.66 18.0000000
# 6 JUB 1 20 14 14-15.66 14.0000000
# 10 JUB 2 20 21 21-15.66 5.3333333
# 11 JUB 2 20 19 19-15.66 3.3333333
# 12 JUB 2 20 16 16-15.66 0.3333333
# 16 JUB 3 20 26 26-15.66 10.3333333
# 17 JUB 3 20 24 24-15.66 8.3333333
# 18 JUB 3 20 23 23-15.66 7.3333333

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Resources