R - Disaggregate coverage area data based on a ranking preference - r

I have 4G mobile coverage at the Local Authority level in the UK, as a percentage of geographical area covered (for approximately 200 areas). I want to disaggregate this data so I can work with roughly 9000 lower-level postcode sector.
The most appropriate way for me to do this, is to allocate 4G geographic coverage to the most densely populated areas first, as this would best represent how mobile operators would invest in the market. The least populated areas would end up with no coverage. I'm struggling with how I do this in R, however.
I have a data frame that looks like this for the postcode sector data (I've used hypothetical data here):
Name pcd.sect pop area pop.dens rank
Cambridge 1 5546 0.6 8341 1
Cambridge 2 7153 1.1 5970 2
Cambridge 3 5621 2.3 5289 3
Cambridge 4 10403 4.3 4361 4
Cambridge 5 14796 4.2 3495 5
...
I then took the aggregate local authority data and put it on each row (adding the three right columns):
Name pcd.sect pop area pop.dens rank LA.4G LA.area LA.4G(km2)
Cambridge 1 5546 0.6 8341 1 58 140 82
Cambridge 2 7153 1.1 5970 2 58 140 82
Cambridge 3 5621 2.3 5289 3 58 140 82
Cambridge 4 10403 4.3 4361 4 58 140 82
Cambridge 5 14796 4.2 3495 5 58 140 82
...
I had to shorten the headings, so let me just explain them in more detail:
Name - Local Authority name
pcd.sector - postcode sector (so the lower level unit)
pop - the population in the postcode sector
area - surface area of the postcode sector in km2
pop.dens - is the population density of the postcode sector in persons per km2
rank - rank of the postcode sector based on population density within each local authority
LA.4G - % coverage of the local authority with 4G
LA.area - the sum of the area column for each local authority
LA.4G(km2) - the number of km2 with 4G coverage within each local authority
Taking Cambridge as a hypothetical example, there is 58% 4G coverage across the whole Local Authority. I then want to disaggregate this number to achieve 4G coverage for the individual postcode sectors.
Ideally the data would end up looking like this, with an extra column for the postcode sector coverage:
Name pcd.sect ... pcd.sector.coverage (%)
Cambridge 1 ... 100
Cambridge 2 ... 100
Cambridge 3 ... 100
Cambridge 4 ... 34
Cambridge 5 ... 0
... ... ... ...
How do I get R to allocate this 82km2 (58% geographical coverage) out to the postcode sectors in a new column, based on the area column, but then stopping once it hits the maximum coverage level of 82km2 (58% geographical coverage)?

this is how I interpret this question. Correct me if this is not what you meant.
Suppose you have the following data.
dat <- data.frame(
Name = "A", pcd.sector = 1:5,
area = c(2, 3, 1, 5, 3),
areaSum = 14, LA.4G = 8
)
dat
# Name pcd.sector area areaSum LA.4G
#1 A 1 2 14 8
#2 A 2 3 14 8
#3 A 3 1 14 8
#4 A 4 5 14 8
#5 A 5 3 14 8
You have five sectors, with various areas. Although the areas sum up to 14, there are only 8 covered by 4G. You want to allocate the areas from the sectors 1 through 5.
The following code does this job. I used cumsum function to compute the cumulative sum of areas from the top sector, which is capped by the 4G coverage limit. Allocated area can be computed by diff function, which takes the one-step difference of a vector. The sector 1 through 3 gets 100% coverage, which sum up to 6 areas, hence only 2 remains. Although sector 4 has 5 area, it can only enjoy 2, or 40%. This uses up the areas and nothing is left for the sector 5.
dat$area_allocated <- diff(c(0, pmin(cumsum(dat$area), dat$LA.4G)))
dat$area_coverage <- dat$area_allocated / dat$area * 100
dat
# Name pcd.sector area areaSum LA.4G area_allocated area_coverage
# 1 A 1 2 14 8 2 100
# 2 A 2 3 14 8 3 100
# 3 A 3 1 14 8 1 100
# 4 A 4 5 14 8 2 40
# 5 A 5 3 14 8 0 0
If you have a lot of areas, then you may want to use dplyr::group_by function.
dat <- rbind(
data.frame(
Name = "A", pcd.sector = 1:5,
area = c(2, 3, 1, 5, 3),
areaSum = 14, LA.4G = 8
),
data.frame(
Name = "B", pcd.sector = 1:3,
area = c(4, 3, 2),
areaSum = 9, LA.4G = 5
)
)
library(dplyr)
dat <- dat %>% group_by(Name) %>%
mutate(area_allocated = diff(c(0, pmin(cumsum(area), LA.4G)))) %>%
mutate(area_coverage = area_allocated / area * 100)
dat
# Name pcd.sector area areaSum LA.4G area_allocated area_coverage
# <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 2 14 8 2 100.00000
# 2 A 2 3 14 8 3 100.00000
# 3 A 3 1 14 8 1 100.00000
# 4 A 4 5 14 8 2 40.00000
# 5 A 5 3 14 8 0 0.00000
# 6 B 1 4 9 5 4 100.00000
# 7 B 2 3 9 5 1 33.33333
# 8 B 3 2 9 5 0 0.00000

Related

R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Plotting tetrahedron with data points in R

I'm in a little bit of pain at the moment.
I'm looking for a way to plot compositional data.(https://en.wikipedia.org/wiki/Compositional_data). I have four categories so data must be representable in a 3d simplex ( since one category is always 1 minus the sum of others).
So I have to plot a tetrahedron (edges will be my four categories) that contains my data points.
I've found this github https://gist.github.com/rmaia/5439815 but the use of pavo package(tcs, vismodel...) is pretty obscure to me.
I've also found something else in composition package, with function plot3D. But in this case an RGL device is open(?!) and I don't really need a rotating plot but just a static plot, since I want to save as an image and insert into my thesis.
Update: data looks like this. Consider only columns violent_crime (total), rape, murder, robbery, aggravated_assault
[ cities violent_crime murder rape rape(legally revised) robbery
1 Autauga 68 2 8 NA 6
2 Baldwin 98 0 4 NA 18
3 Barbour 17 2 2 NA 2
4 Bibb 4 0 1 NA 0
5 Blount 90 0 6 NA 1
6 Bullock 15 0 0 NA 3
7 Butler 44 1 7 NA 4
8 Calhoun 15 0 3 NA 1
9 Chambers 4 0 0 NA 2
10 Cherokee 49 2 8 NA 2
aggravated_assault
1 52
2 76
3 11
4 3
5 83
6 12
7 32
8 11
9 2
10 37
Update: my final plot with composition package
Here is how you can do this without a dedicated package by using geometry and plot3D. Using the data you provided:
# Load test data
df <- read.csv("test.csv")[, c("murder", "robbery", "rape", "aggravated_assault")]
# Convert absolute data to relative
df <- t(apply(df, 1, function(x) x / sum(x)))
# Compute tetrahedron coordinates according to https://mathoverflow.net/a/184585
simplex <- function(n) {
qr.Q(qr(matrix(1, nrow=n)) ,complete = TRUE)[,-1]
}
tetra <- simplex(4)
# Convert barycentric coordinates (4D) to cartesian coordinates (3D)
library(geometry)
df3D <- bary2cart(tetra, df)
# Plot data
library(plot3D)
scatter3D(df3D[,1], df3D[,2], df3D[,3],
xlim = range(tetra[,1]), ylim = range(tetra[,2]), zlim = range(tetra[,3]),
col = "blue", pch = 16, box = FALSE, theta = 120)
lines3D(tetra[c(1,2,3,4,1,3,1,2,4),1],
tetra[c(1,2,3,4,1,3,1,2,4),2],
tetra[c(1,2,3,4,1,3,1,2,4),3],
col = "grey", add = TRUE)
text3D(tetra[,1], tetra[,2], tetra[,3],
colnames(df), add = TRUE)
You can tweak the orientation with the phi and theta arguments in scatter3D.

Two way ANOVA - repeated measure in r, missing a desired effect

I am trying to do a two way mixed factorial ANOVA with repeated measures. From:
aov(Estimate ~ Dose*Visit, data = AUClast)
I get 3 sums of squares: two main effects (Visit and Dose) and their interaction (Dose:Visit) which I figured out by hand are correct.
Both Dose and Visit are explanatory variables with Dose being a between subject variable with 4 levels, 3, 10, 30, 100 and Visit being a within subjects variable (repeated measure) of 2 levels, 1 and 28. Also, the subjectID variable is 'Animal'
I want to include one more effect into the result but do not know how. The desired effect is variance between Animal within Dose, or how SAS puts it Animal(Dose). The SS is calculated by:
sum((mean(Animal(ik))-mean(Dose(i))^2)
Where k is the animal of a dose i (averaging the Estimates of the observation in Visit 1 and Visit 28 for each Animal and subtracting the mean Estimate of animals in that Dose quantity squared for all Animals in this study).
Does anyone know how to adjust the formula accordingly to include the Animal(Dose) effect?
Thanks in advance for the help and sorry if all of this is too unspecific.
If I understand you correctly, I have a suggestion. First, a sample data set
#sample data
set.seed(15)
AUClast<-data.frame(
expand.grid(
Animal=1:3,
Dose=c(3,10,30,100),
Visit=c(1,28)
), Estimate=runif(24)
)
Now we calculate the interaction term as requested. First, we split the data into dosage groups, then for each does, we subtract the overall mean from the mean for each animal. Then we sum the squared of those differences. Finally, we expand them back out to does group using unsplit.
animaldose<-unsplit(lapply(split(AUClast, AUClast$Dose), function(x) {
rep(
sum((tapply(x$Estimate, x$Animal, mean) - mean(x$Estimate))^2)
, nrow(x))
}), AUClast$Dose)
And we can see what that looks like next to the original data.frame
cbind(AUClast, animaldose)
Which gives the result
Animal Dose Visit Estimate animaldose
1 1 3 1 0.60211404 0.1181935
2 2 3 1 0.19504393 0.1181935
3 3 3 1 0.96645873 0.1181935
4 1 10 1 0.65090553 0.1641363
5 2 10 1 0.36707189 0.1641363
6 3 10 1 0.98885921 0.1641363
7 1 30 1 0.81519341 0.0419291
8 2 30 1 0.25396837 0.0419291
9 3 30 1 0.68723085 0.0419291
10 1 100 1 0.83142902 0.1881314
11 2 100 1 0.10466936 0.1881314
12 3 100 1 0.64615091 0.1881314
13 1 3 28 0.50909039 0.1181935
14 2 3 28 0.70662857 0.1181935
15 3 3 28 0.86231366 0.1181935
16 1 10 28 0.84178515 0.1641363
17 2 10 28 0.44744372 0.1641363
18 3 10 28 0.96466695 0.1641363
19 1 30 28 0.14118707 0.0419291
20 2 30 28 0.77671251 0.0419291
21 3 30 28 0.80372740 0.0419291
22 1 100 28 0.79334595 0.1881314
23 2 100 28 0.35756312 0.1881314
24 3 100 28 0.05800106 0.1881314
So you can see each does group has it's own adjustment.

Resources