R Dplyr sub-setting

R Dplyr sub-setting - r

I need to calculate min, max and mean by customer after sub-setting the population for primary contacts. To do this, I need to drop observations within a customer group if contact == relation and amount < 25. But, the tricky part is: if contact == relation and amount == amount, I need to keep both observations regardless the amount (this accounts for ties where we cannot define the primary contact).
If contact == relation, one can think of this as a household.
Each customer can be comprised of multiple households, so I've included contacts with NULL relationship values.
Sample Data
customer <- c(1,1,1,1,2,2,2,3,3,3,3)
contact <- c(1234,2345,3456,4567,5678,6789,7890,8901,9012,1236,2346)
relationship <- c(2345,1234,"","",6789,5678,"",9012,8901,2346,1236)
amount <- c(26,22,40,12,15,15,70,35,15,25,25)
score <- c(500,300,700,600,400,600,700,650,300,600,700)
creditinfoaggtestdata1 <- data.frame(customer,contact,relationship,amount,score)
Expected Outcome
As a point of reference, if I do not drop the appropriate contacts prior to calculating min, max and mean, by customer, I get an output table as follows:

I assume the requirement "contact = relation and amount = amount" means across different rows within the same customer group. Here's a dplyr solution:
# Create a contact-relationship id where direction doesn't matter
df <- creditinfoaggtestdata1 %>%
rowwise() %>%
mutate(id = paste0(min(contact, relationship), max(contact, relationship)))
# Filter new ID's where duplicates in amounts exist per customer group
dups <- df %>%
group_by(customer, id, amount) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
ungroup() %>%
select(customer, id)
# User inner join to only select contact-relationship combinations from above
a <- df %>%
filter(amount < 25) %>%
inner_join(dups, by=c("customer", "id"))
# Combine with >= 25 data
b <- df %>%
filter(amount >= 25)
c <- rbind(a, b)
c %>%
group_by(customer) %>%
summarise(min_score = min(score), max_score = max(score), avg_score = mean(score))
Output:
customer min_score max_score avg_score
<dbl> <dbl> <dbl> <dbl>
1 1 500 700 600
2 2 400 700 567.
3 3 600 700 650

Related

How to count the number of times a value appears in a 160Million by 2 dataframe - memory issues

I have a data frame that has 160M rows and 2 columns(material name and price). I want to determine how many the frequency at which prices occur.
For example,
the price $10 was given 100 different times. I'd like to sort the values by largest occurrence to smallest occurs (example, $100 was given 1000 times)
There are 2,484,557 unique prices, so a "table" is not the most useful solution.
my issue is I'm dealing with memory issues.
Any suggestions how I can accomplish this?

Here's a 2 GB data frame with 160M rows and about 3M unique prices:
set.seed(42)
n = 160E6
fake_data <- data.frame(material = sample(LETTERS, n, replace = TRUE),
price = sample(1:3E6, n, replace = TRUE))
I like dplyr syntax, but for large data with many groups, data.table and collapse offer much better performance.
We could use dtplyr to translate dplyr code to data.table. This takes 22 seconds on my machine, with the result showing how many times each price appears in the data.
library(dplyr)
library(dtplyr)
fake_data %>%
lazy_dt() %>%
count(price, sort = TRUE)
Result
Source: local data table [3,000,000 x 2]
Call: `_DT2`[, .(n = .N), keyby = .(price)][order(desc(n))]
price n
<int> <int>
1 2586972 97
2 2843789 95
3 753207 92
4 809482 92
5 1735845 92
6 809659 90
# … with 2,999,994 more rows
If you need higher performance and don't mind a heuristic, you could also sample your data to make it 10% or 1% as big; if any placeholder values occur frequently in the whole data, they are also likely to be frequent in a random sample.

I'd probably create price intervals, e.g. $0-50, $51-100, $101-150 etc.
EDIT: more comprehensive solutution
library(tidyverse)
df <- letters %>%
expand_grid(., .) %>%
rename(v1 = `....1`,
v2 = `....2`) %>%
mutate(name = paste0(v1, v2)) %>%
select(name) %>%
bind_rows(., ., ., .)
df
n <- nrow(df)
df <- df %>%
mutate(price = rnorm(n = n, mean = 1000, sd = 200))
df %>%
ggplot(aes(x = price)) +
geom_histogram()
df <- df %>%
mutate(price_grp = case_when(price < 500 ~ "$0-500",
price > 500 & price <= 1000 ~ "$501-1000",
price > 1000 & price <= 1500 ~ "$1001-1500",
price > 1500 ~ "+ $1500"))
df %>%
group_by(price_grp) %>%
summarize(occurences = n()) %>%
arrange(desc(occurences))

Count unique total for combined multiple values

I have a dataset that records the products associated with certain accounts. I want to summarise the total number of accounts for a specific set of products, only counting each account number once, no matter how many products they have. So the total for this sample would be 4. (a + b + c + d)
Account
Product
a
1
a
2
b
1
c
1
c
2
d
3
The code I have tried so far is
filter(Product == 1 | 2 | 3) %>%
summarise(total = n_distinct(), .groups = Account)
This gives message Error in summarise_verbose(.groups, caller_env()) :
object 'Account' not found
I also tried
filter(Product == 1 | 2 | 3) %>%
summarise(total = n_distinct(Account))
But this doesn't reduce the number of rows properly - I'm still getting 300,000 rows when I should get 70,000 based on other data I have. Is there a way of counting the (alphanumeric) account numbers once and once only, no matter what the products are?

In the absence of an example of minimal data, I suppose you want to count the different elements by groups and using filters.
Data %>%
filter(Product %in% c(1,2,3)) %>%
group_by(Account) %>%
summarise(
total = n_distinct(Product)
)

You were close
df %>%
group_by(Account) %>%
summarise(
total = n_distinct(Product)
)

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.

The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

unexpected row when going from long to wide format with dplyr and tidyr

I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)

We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset

Randomly subset each group to satisfy conditions

Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.
reproducible example:
Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"
library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))
Where I got so far:
for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)
}
}
I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.
I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.
Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.

To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:
set.seed(1)
res = AccountDF %>%
mutate(r = runif(n())) %>%
arrange(r) %>%
group_by(Owner) %>%
mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>%
select(-r)
# Test that it worked...
res %>%
filter(!is.na(newOwner)) %>%
group_by(newOwner) %>%
summarise(Count = n(), TotalValue = sum(Value))
# A tibble: 5 x 3
# newOwner Count TotalValue
# <chr> <int> <dbl>
# 1 Alex 14 27000
# 2 David 18 37000
# 3 Jeff 18 39500
# 4 Matt 18 39500
# 5 Steph 17 36500
An extension mentioned by the OP in a comment:
Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?
I'd probably assign a real-valued score to each observation, like...
s = scale(f(x))
where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.
Then, add on some noise and sort using the result as above:
r = s + rnorm(length(s))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Dplyr sub-setting - r

Related

How to count the number of times a value appears in a 160Million by 2 dataframe - memory issues

Count unique total for combined multiple values

R: Home sales in the last year before each sale

unexpected row when going from long to wide format with dplyr and tidyr

Randomly subset each group to satisfy conditions

Categories

Resources