Find the values between the range of 100's and their count - r

In excel I can make the group the column and count but iam unable to do it in R.
For doing in R i am using cut function with some breaks.
cut(elapsed, breaks=seq(min(elapsed),max(elapsed)+100,50), include.lowest=T)
here i attached the png of the data and required output.
but above code not give my require output.
this is the my data
and my required output:
400 9
500 4
600 2
700 5
800 3
900 3

This should work:
data.frame(table(elapsed %/% 100))
For example:
elapsed <- c(400, 423, 423, 534, 534, 639, 602, 812, 703)
data.frame(table(elapsed %/% 100))
Var1 Freq
1 4 3
2 5 2
3 6 2
4 7 1
5 8 1
For desired result in hundreds use this:
res <- data.frame(table(elapsed %/% 100))
res$Var1 <- as.numeric(res$Var1) * 100

you can try:
require(magrittr)
elapsed <- runif(100, 400, 1000) %>% round
cut(elapsed, breaks = seq(400,1000,100),
labels = as.character(seq(400,900,100)),
include.lowest=TRUE) %>% table
gives you:
400 500 600 700 800 900
15 22 16 9 20 18

Related

multiply one dataframe with another containing growth rates, but have it compound

I have two dataframes - the first contains a single column with 180k rows(i.e. 1x180k) and the other has a single row with 13 columns containing 13 growth rates (i.e. 13x1)
I am trying to multiply these dataframes so that I have a single dataframe that shows the growth of these values overtime.
I can multiply them but I can't work out how to make it compound overtime.
Effectively the dataframe I want will have the existing values in the first column, the second column will have the first column multiplied by the first growth rate, the third column will have the second column multiplied by the second growth rate etc.
Note - my growth rates are in percentages (i.e. 0.05 or 5%)
I have this, but I am not sure how to reflect compounding in it.
LandValuesForecast <- LandValues[,1] %*% (1+t(unlist(GrowthRates[1,])))
You can loop over the columns of both dataframes, applying each rate to the value computed in the previous iteration.
# example data
values <- data.frame(x0 = 1:10 * 100)
rates <- data.frame(r1 = .1, r2 = .01, r3 = .05)
for (i in seq(ncol(rates))) {
values[[paste0("x", i)]] <- values[, i] * (1 + rates[, i])
}
values
x0 x1 x2 x3
1 100 110 111.1 116.655
2 200 220 222.2 233.310
3 300 330 333.3 349.965
4 400 440 444.4 466.620
5 500 550 555.5 583.275
6 600 660 666.6 699.930
7 700 770 777.7 816.585
8 800 880 888.8 933.240
9 900 990 999.9 1049.895
10 1000 1100 1111.0 1166.550
You can use Reduce() - borrowing #zephryl's data:
values <- data.frame(x0 = 1:10 * 100)
rates <- data.frame(r1 = .1, r2 = .01, r3 = .05)
data.frame(Reduce(`*`, rates + 1, init = values, accumulate = TRUE))
x0 x0.1 x0.2 x0.3
1 100 110 111.1 116.655
2 200 220 222.2 233.310
3 300 330 333.3 349.965
4 400 440 444.4 466.620
5 500 550 555.5 583.275
6 600 660 666.6 699.930
7 700 770 777.7 816.585
8 800 880 888.8 933.240
9 900 990 999.9 1049.895
10 1000 1100 1111.0 1166.550
Or same thing with purrr::accumulate():
library(purrr)
data.frame(accumulate(rates + 1, `*`, .init = values))
If I understood your question correctly, I would prefer conversion of dataframes to matrices with multiplication of results using outer function. It is expected to be fast.
library(dplyr)
df1 <- data.frame(aaa = c(1:10))
df2 <- data.frame(a1 = 1, a2 = 2, a3 = 3)
outer(as.matrix(df1, ncol = 1),
as.matrix(df2, nrow = 1),
`*`) %>% as.data.frame
This script will return:
aaa.1.a1 aaa.1.a2 aaa.1.a3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
7 7 14 21
8 8 16 24
9 9 18 27
10 10 20 30

R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language. Suppose I have the following data:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?
Right now, I am trying to manually do this, e.g. (for histogram #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?
Thanks!
hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks
First, make the problem reproducible by setting the RNG seed.
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist and have findInterval tell the bins where each data points are in.
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").
As a final check, table the values returned by findInterval and see if they match the histogram's counts.
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10

Calculated field based on Ranges in a second data frame in R

I have found similar posts regarding this task, but all of which have a common ID joining the two tables.
I have one data frame which contains sale records (sales_df). For this example I have simplified the data table so that it contains only 5 records. I would like to create a new column in the sales_df that calculates what the fee would be given a sale price amount as defined in the fee table (pricing_fees). Please note that the number of actual pricing fee ranges that I have to account for are around 30, so writing this into a mutate statement is something that I would like to try and avoid.
The two data frames are coded as follows
sales_df <- data.frame(invoice_id = 1:5,
sale_price = c(100, 275, 350, 500, 675))
pricing_fees <- data.frame(min_range = c(0, 50, 100, 200, 300, 400, 500), # >=
max_range = c(50, 100, 200, 300, 400, 500, 1000), # <
buyer_fee = c(1, 1, 25, 50, 75, 110, 125))
In the end I would like the resulting sales_df to look something like this.
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125
Thanks in advance
You can use findInterval function which is supposed to be efficient in splitting values over ranges (since it uses binary search) :
# build consecutive increasing ranges of fees
# (in order to use findInterval, since it works on ranges defined in a single vector)
pricing_fees <- pricing_fees[order(pricing_fees$min_range),]
consecFees <- data.frame(ranges=c(pricing_fees$min_range[1], pricing_fees$max_range),
fees=c(pricing_fees$buyer_fee,NA))
# consecFees now is :
#
# ranges fees
# 1 0 1 ---> it means for price in [0,50) -> 1
# 2 50 1 ---> it means for price in [50,100) -> 1
# 3 100 25 ---> it means for price in [100,200) -> 25
# 4 200 50 ... and so on
# 5 300 75
# 6 400 110
# 7 500 125
# 8 1000 NA ---> NA because for values >= 1000 we set NA
# add the column to sales_df using findInterval
sales_df$buyer_fee <- consecFees$fees[findInterval(sales_df$sale_price,consecFees$ranges)]
Result :
> sales_df
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125
You can also use cut to "bin" sales_df$sale_price values and label bins with corresponding buyer_fee values.
# Make pricing_fee table with unique buyer_fee
brks <- do.call(rbind, by(pricing_fees, pricing_fees$buyer_fee, FUN = function(x)
data.frame(min_range = min(x$min_range), max_range = max(x$max_range), buyer_fee = unique(x$buyer_fee))))
sales_df$buyer_fee = as.numeric(as.character(cut(
sales_df$sale_price,
breaks = c(0, brks$max_range),
labels = brks$buyer_fee,
right = F)))
# invoice_id sale_price buyer_fee
#1 1 100 25
#2 2 275 50
#3 3 350 75
#4 4 500 125
#5 5 675 125

how to subtract a value from one column from a value from a previous row, different column in r

I have a dataframe composed of 3 columns and ~2000 rows.
ID DistA DistB
1 100 200
2 239 390
3 392 550
4 700 760
5 770 900
The first column (ID) is a unique identifier for each row. I'd like my script to read each row, and subtract/compare the value from column "DistA" in each row from the value of column "DistB" from the previous row. If the difference of the distance of any subsequent pairs is <40, to output that they are in the same area.
For example: In the above example comparing row 2 and 1, '239' from row 2 and '200' from row 1 is <40 and therefore in the same area. The same way 2 and 3, are in the same area ie the difference is 2 and 2<40. But rows 3 and 4 are not as the difference is 150.
I have not been able to go far, as I am stuck in the comparison (subtraction/difference) step. I have tried to write a loop to iterate in all the rows, but I keep getting errors. Should I even use a loop, or can I do this without a loop?
I am a new R learner, and this is the 'rookie' code that I have so far. Where am I going wrong. Thanks in advance:
#the function to compare the two columns
funct <- function(x){
for(i in 1:(nrow(dat)))
(as.numeric(dat$DistA[i-1])) - (as.numeric(dat$DistB[i]))}
#creating a new column 'new2' with the differences
dat$new2 <- apply(dat[,c('DistB','DistA')]),1, funct
When I run this, I get the following error:
Error: unexpected ',' in "dat$new2 <- apply(dat[,c('DistB','DistA')]),"
I'll appreciate all the comments/suggestions.
I believe dplyr can help you here.
library(dplyr)
dfData <- data.frame(ID = c(1, 2, 3, 4, 5),
DistA = c(100, 239, 392, 700, 770),
DistB = c(200, 390, 550, 760, 900))
dfData <- mutate(dfData, comparison = DistA - lag(DistB))
This results in...
dfData
ID DistA DistB comparison
1 1 100 200 NA
2 2 239 390 39
3 3 392 550 2
4 4 700 760 150
5 5 770 900 10
You could then check to see if a row is within the same "area" as your previous row.
We could also try data.table (similar to the approach as suggested in the comments by #David Arenburg). shift is a new function introduced in the devel version with type='lag' as the default option. It can be installed from here
library(data.table)#data.table_1.9.5
setDT(df1)[, Categ := c('Diff', 'Same')[
(abs(DistA-shift(DistB)) < 40 )+1L]][]
# ID DistA DistB Categ
#1: 1 100 200 NA
#2: 2 239 390 Same
#3: 3 392 550 Same
#4: 4 700 760 Diff
#5: 5 770 900 Same
If we need both the 'difference' and 'category' columns
setDT(df1)[,c('Dist', 'Categ'):={tmp= abs(DistA-shift(DistB))
list(tmp, c('Diff', 'Same')[(tmp <40)+1L])}]
df1
# ID DistA DistB Dist Categ
#1: 1 100 200 NA NA
#2: 2 239 390 39 Same
#3: 3 392 550 2 Same
#4: 4 700 760 150 Diff
#5: 5 770 900 10 Same

Correct way of vectorizing "lookup" function

I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?
First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190

Resources