R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently? - r

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!

customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.

We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

Related

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

R sampling with if statement and similar number of sample

I need to to create a sample from my dataframe and to do so I am using the code bellow.
name <- sample(c("Adam","John","Henry","Mike"),100,rep = TRUE)
area <- sample(c("run","develop","test"),100,rep = TRUE)
id <- sample(100:200,100,rep = FALSE)
mydata <- as.data.frame(cbind(id,area,name))
qcsample <- mydata %>%
group_by(area) %>%
nest() %>%
mutate(n = c(20, 15, 15)) %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(area, samp) %>%
unnest()
Now, I am getting these results.
table(qcsample$area)
develop run test
15 15 20
--
table(qcsample$name)
Adam Henry John Mike
9 9 16 16
I would like to create a sample that would have more or less the same number of samples for each name eg. Adam - 12, Henry - 12, John - 13, Mike - 13.
How can I achieve that ? can I somehow request that the sample is equally distributed ?
Also, in this example I used function
sample_n
and specified number of samples.
I am anticipating that sometimes there will not be required number from a given group. In my example I am taking 20 samples from area called "test" but sometimes there will be only let's say 10 rows containing "test". The total number is 50 so I need to make sure if there are only 10 "test" the code has to automatically increase the others, so the sample would be "test" - 10, "run" - 20 and "develop" - 20. This can happen to any of the area so I need to test if there is enough rows to create the sample and increase other areas. If there is only 1 it can be added to any of the remaining areas or if the difference is 3 we add 1 to one area and 2 to the another one.
How could I check that taking into account all the possibilities ? I believe there are eight permutations in this case.
Thanks in advance A.
If you are using made up data then you can create a minimum amount of each row and then create filler to get you up to the total:
set.seed(42)
names <- c("Adam", "John", "Henry", "Mike")
areas <- c("run", "develop", "test")
totalrows <- 100
minname <- 22 # No less than 20 of each name (set to near threshold to test)
minarea <- 30 # No less than 30 of each area (less randomness the higher these are)
qcsample <- data.frame(
name=sample(c(rep(names, minname), sample(names, totalrows-length(names)*minname, replace=T))),
area=sample(c(rep(areas, minarea), sample(areas, totalrows-length(areas)*minarea, replace=T))),
id=sample(99+(1:totalrows))
)
This results in:
R> table(qcsample$name)
Adam Henry John Mike
23 28 24 25
R> table(qcsample$area)
develop run test
37 31 32
Notice that the count of name to area isn't constrained:
R> table(qcsample[,-3])
area
name develop run test
Adam 5 11 7
Henry 11 8 9
John 10 7 7
Mike 11 5 9
R>
Using a loop as suggested by #r2evans:
library(dplyr)
set.seed(42)
mydata <- data.frame(
name = sample(c("Adam","John","Henry","Mike"), 100, rep = TRUE),
area = sample(c("run","develop","test"), 100, rep = TRUE),
id = sample(100:200, 100, rep = FALSE)
)
Nsamples <- 50
mysample <- data.frame(sample_n(mydata, Nsamples))
minname <- 11 # max is 50/4 -> 12
minarea <- 15 # max is 50/3 -> 16
# the test you were asking about
while( (min(table(mysample$name)) < minname) || (min(table(mysample$area)) < minarea) ) {
mysample <- data.frame(sample_n(mydata, Nsamples))
}
This results in:
R> table(mysample$name)
Adam Henry John Mike
13 15 11 11
R> table(mysample$area)
develop run test
15 17 18
And, like before, there's no minimum of name to area.
R> table(mysample[-3])
area
name develop run test
Adam 4 3 6
Henry 2 6 7
John 4 4 3
Mike 5 4 2
If you needed to enforce an minimum number for each permutation add this to the test:
while(... || (min(table(mysample[-3])) < some_min)) {
BTW, the number of permutations, as you can see by the table, is the number of names times the number of areas.
Here's another thought.
Depending on your desired end-size, it might over-create the number of samples so that it can reduce some name/area pairs to bring the total down.
Let's say that you want to end up with a total of 50 rows:
final_size <- 50
For completeness, here are the sets from which we'll choose:
avail_names <- c("Adam", "John", "Henry", "Mike")
avail_areas <- c("run", "develop", "test")
and the minimum we need to create for Adam,run (etc) in order to certainly end up with no less than final_size rows:
size_per_namearea <- ceiling(final_size / (length(avail_names) * length(avail_areas)))
Ok, generate at least as many (likely more than) the number of rows we need:
set.seed(20180920)
qcsample <- crossing(data_frame(rownum = seq_len(size_per_namearea)),
data_frame(name = avail_names),
data_frame(area = avail_areas)) %>%
group_by(name, area) %>%
mutate(id = sample(100, size = n(), replace = FALSE))
qcsample
# # A tibble: 60 x 4
# # Groups: name, area [12]
# rownum name area id
# <int> <chr> <chr> <int>
# 1 1 Adam run 59
# 2 1 Adam develop 51
# 3 1 Adam test 23
# 4 1 John run 71
# 5 1 John develop 5
# 6 1 John test 24
# 7 1 Henry run 4
# 8 1 Henry develop 29
# 9 1 Henry test 79
# 10 1 Mike run 77
# # ... with 50 more rows
Verify we have identical sample sizes for each name/area:
xtabs(~ name + area, data = qcsample) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 5 5 5 15
# Henry 5 5 5 15
# John 5 5 5 15
# Mike 5 5 5 15
# Sum 20 20 20 60
If we just do head(final_size), then we know which names we will be cutting short, which undermines the randomness of your sampling a little. The reason I added rownum up front was so that I can arrange by it plus a jitter, ensuring I get all of max(rownum)-1, and then some sampling of max(rownum), guaranteeing that each name/area pair have either max(rownum)-1 or max(rownum) rows; your tallies are never different by more than 1.
reducedsample <- arrange(qcsample, rownum + runif(n())) %>%
head(final_size) %>%
select(-rownum)
reducedsample %>%
xtabs(~ name + area, data = .) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 4 4 5 13
# Henry 5 4 4 13
# John 4 4 4 12
# Mike 4 4 4 12
# Sum 17 16 17 50

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)
If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

Resources