Building off of this question Pass a data.frame with column names and fields as filter
Let's say we have the following data set:
filt = data.table(X1 = c("Gender","Male"),
X2 = c('jobFamilyGroup','Finance'),
X3 = c('jobFamilyGroup','Software Dev')
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'))
and I want to use filt as a filter for df. filt is done by grabbing an input from Shiny and transforming it a bit to get me that data.table above. My goal is to filter df so we have: All rows that are MALE AND (Software Dev OR Finance).
Currently, I'm hardcoding it to always be an AND but that isn't ideal for situations like this. My thought would be to have multiple if conditions to catch things like this, but I feel like there could be an easier approach for building this logic in.
___________UPDATE______________
Once I have a table like filt I can pass code like:
if(!is.null(primary))
{
if(ncol(primary)==1){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1])]
}
else if(length(primary)==2){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1]) &
get(as.character(primary[1,2]))==as.character(primary[2,2])]
}
else{
d2 = df[get(as.character(primary[1,1]))==as.character(primary[1,2]) &
get(as.character(primary[1,2]))==as.character(primary[2,2]) &
get(as.character(primary[1,3]))==as.character(primary[2,3])]
}
}
But this code doesn't account for the OR Logical needed if there are multiple inputs for one type of grouping. Meaning the current code says give me all rows where: Gender == Male & Job Family Group == 'Finance'& Job Family Group == 'Software Dev' When really it should be Gender == Male & (Job Family Group == 'Finance'| Job Family Group == 'Software Dev')
this is a minimal example meaning there are many other columns so ideally the solution has the ability to determine when a multiple input for a grouping is present.
Given your problem, what if you parsed it so your logic looked like:
Gender %in% c("Male") & jobFamilyGroup %in% c('Finance','Software Dev')
By lumping all filter values with the same column name together in an %in% you get your OR and you keep your AND between column names.
UPDATE
Consider the case discussed in comments below.
Your reactive inputs a data.table specifying
Gender IS Male
Country IS China OR US
EmployeeStatus IS Active
In the sample data you provided there is no country column, so I added one. I extract the columns to be filtered and the values to be filtered and split the values to be filtered by the columns. I pass this into an lapply which does the logical check for each column using an %in% rather than a == so that options within the same column are treated as an | instead of a &. Then I rbind the logical results together and apply an all to the columns and then filter df by the results.
This approach handles the & between columns and the | within columns. It supports any number of columns to be searched removing the need for your if/else logic.
library(data.table)
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'),
Country = c('China','China','US','US','China'))
filt = data.table(x1 = c('Gender' , 'Male'),x2 = c('Country' , 'China'),x3 = c('Country','US'), x4 = c('EmployeeStatus','Active'))
column = unlist(filt[1,])
value = unlist(filt[2,])
tofilter = split(value,column)
tokeep = apply(do.call(rbind,lapply(names(tofilter),function(x){
`[[`(df,x) %in% tofilter[[x]]
})),2,all)
df[tokeep==TRUE]
#> Gender EmployeeStatus jobFamilyGroup Country
#> 1: Male Active Finance China
#> 2: Male Active HR US
#> 3: Male Active Finance US
Related
I am calculating the dissimilarity index of several groups compared to the total population with the function "seg" from the identically named package.
The data consists of about 450 rows, each a different district, and around 20 columns (groups that may be segregated). The values are the number of people from respective group living in respective district. Here are the first few rows of my csv file:
Region,Germany,EU15 without Germany,Poland,Former Yugoslavia and successor countries,Former Soviet Union and successor countries,Turkey,Arabic states,West Afrika,Central Afrika,East Afrika,North America,Central America and the Carribean,South America,East and Central Asia,South and Southeast Asia - excluding Vietnam,Australia and Oceania,EU,Vietnam,Non EU Europe,Total Population
1011101,1370,372,108,35,345,91,256,18,6,3,73,36,68,272,98,3,1979,19,437,3445
1011102,117,21,6,0,0,0,6,0,0,0,7,0,6,0,7,0,156,0,3,188
1011103,2180,482,181,102,385,326,358,48,12,12,73,24,75,175,129,12,3152,34,795,5159
Since the seg function only works with two columns as input, my current code to create a table with the index for all groups looks like this:
DI_table <- as.data.frame(0)
DI_table[1,1] <- print (seg(data =dfplrcountrygroups2019[, c( "Germany", "Total.Population")]))
DI_table[1,2] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[3], "Total.Population")]))
DI_table[1,3] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[4], "Total.Population")]))
DI_table[1,4] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[5], "Total.Population")]))
# and so on...
colnames(DI_table)<- (colnames(dfplrcountrygroups2019[2:20]))
Works well, but a hassle to recode every time I change something with my data and I would like to use this method for other datasets too.
I thought I might try something like below but the seg function did not consider it a selection of two columns.
for (i in colnames(dfplrcountrygroups2019)) {
di_matrix [i] <- seg(data =dfplrcountrygroups2019[, c( "i", "Total.Population")])
}
Error in [.data.frame(dfplrcountrygroups2019, , c("i",
"Total.Population")) : undefined columns selected
I also thought of the apply function but not sure how to make it work so it repeats itself while just changing the column where "Germany" is in the example. How do I make the selection of columns change for each time I repeat the seg function?
my_function <- seg(data =dfplrcountrygroups2019[, c("Germany", "Total.Population")])
apply(X = dfplrcountrygroups2019,
FUN = my_function,
MARGIN = 2
)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'my_function' of mode 'function' was not found
The seg package's functions such as dissim (seg::seg is being deprecated in its favor) have a specific expected data format. From the docs:
data - a numeric matrix or data frame with two columns that represent mutually exclusive population groups (e.g., Asians and non-Asians). If more than two columns are given, only the first two will be used for computing the index.
To get a data frame of the d values seg::dissim returns, where each column is a region's dissimilarity index, you can iterate over the columns, making a temporary data frame and calculating the index. Because the data you're starting with isn't made up of mutually-exclusive categories, you'll have to subtract each population from the total population column to get a not-X counterpart for each group X.
A base R option with sapply will return a named list, which you can then convert into a data frame.
di_table <- sapply(names(dat)[2:20], function(col) {
tmp_df <- dat[col]
tmp_df$other <- dat$Total.Population - dat[col]
seg::dissim(data = tmp_df)$d
}, simplify = FALSE)
as.data.frame(di_table)
#> Germany EU15.without.Germany Poland
#> 1 0.03127565 0.03989693 0.02770549
#> Former.Yugoslavia.and.successor.countries
#> 1 0.160239
#> Former.Soviet.Union.and.successor.countries Turkey Arabic.states West.Afrika
#> 1 0.08808277 0.2047 0.02266828 0.1415519
#> Central.Afrika East.Afrika North.America Central.America.and.the.Carribean
#> 1 0.08004711 0.213581 0.1116014 0.2095969
#> South.America East.and.Central.Asia
#> 1 0.08486598 0.2282734
#> South.and.Southeast.Asia...excluding.Vietnam Australia.and.Oceania EU
#> 1 0.0364721 0.213581 0.04394527
#> Vietnam Non.EU.Europe
#> 1 0.05505789 0.06624686
A couple tidyverse options: you can use purrr functions to do something like above in one step.
dat[2:20] %>%
purrr::map(~data.frame(value = ., other = dat$Total.Population - .)) %>%
purrr::map_dfc(~seg::dissim(data = .)$d)
# same output
Or with reshaping the data and splitting by county. This takes more steps, but might fit a larger workflow better.
library(dplyr)
dat %>%
tidyr::pivot_longer(c(-Region, -Total.Population)) %>%
mutate(other = Total.Population - value) %>%
split(.$name) %>%
purrr::map_dfc(~seg::dissim(data = .[c("value", "other")])$d)
# same output
I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data
I have some code which includes a for loop, and nested if statements. The issue is that it is taking too long to run and I want to make it much faster.
I have data on cohorts in a data frame called f2_cebu_davao. There is also a column in this data frame called person_id. There are 3 categories of the cohorts: 'Baseline', 'Other Effects', 'Campaign'.
I want to loop through each person_id in the f2_cebu_davao data frame, and check to see which cohort it is in. If it is in the cohort 'Baseline' or 'Other Effects', then I will check the before_baseline_othereffects table to see if the ID can be found in that table. If it can, I make a new column in the f2_cebu_davao table and the value will be 'returning'. Otherwise, 'new'.
If the cohort name is 'campaign', I will check the before_campaign table and do the same procedure as above.
My data is quite big (all my objects are big) so this is taking a really long time to run (it's been running for more than 30 minutes and still not finished!).
How can I speed this up (possibly by using vectorization, or just by modifying the code a little)?
I tried do loop through but it's taking too long.
before_baseline_othereffects <- subset(loans_final_full, submitted_at_date < '2018-05-21')
before_campaign <- subset(loans_final_full, submitted_at_date < '2019-01-21')
for(i in 1:nrow(f2_cebu_davao)){
if(as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Baseline') {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else if (as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Other Effects'){
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_campaign$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
}
}
Happy to update and test this if you can provide some example data and desired output. I expect something like this should work.
Here I make up some fake data:
f2_cebu_davao <- data.frame(stringsAsFactors = F,
cohort = rep(c("Baseline", "Other Effects", "Something else",
"Another Something"), by = 3),
person_id = 1:12
)
before_baseline_othereffects <- c(1:4)
before_campaign <- c(5:8)
Here I apply it using dplyr's case_when, spelling out four cases. This code will be vectorized and I expect would run much faster than the current loop code.
The cohort is either "Baseline" or "Other Effects, and the person_id appears in before_baseline_othereffects. This creates "Returning" in rows 1 & 2.
Given the first condition wasn't met, but the cohort is still in either "Baseline" or "Other Effects," return "New," as is done in rows 5 & 6.
Given the first two conditions weren't met, but the person was in before_campaign, mark Returning, as in rows 7 & 8.
Otherwise, mark New, as in rows 3&4 and 9-12.
library(dplyr)
output <- f2_cebu_davao %>%
mutate(new_or_returning = case_when(
cohort %in% c("Baseline", "Other Effects") &
person_id %in% before_baseline_othereffects ~ "Returning",
cohort %in% c("Baseline", "Other Effects") ~ "New",
person_id %in% before_campaign ~ "Returning",
TRUE ~ "New"
))
Here's the output:
> output
cohort person_id new_or_returning
1 Baseline 1 Returning
2 Other Effects 2 Returning
3 Something else 3 New
4 Another Something 4 New
5 Baseline 5 New
6 Other Effects 6 New
7 Something else 7 Returning
8 Another Something 8 Returning
9 Baseline 9 New
10 Other Effects 10 New
11 Something else 11 New
12 Another Something 12 New
Let's say I have the following:
filt = data.frame(X1 = c("Gender","EmployeeStatus"),X2 = c('Male','Active'))
df = data.frame(Gender = c('Male','F','Male','Male','F'),EmployeeStatus = c('Active','na','Active','Active','na'))
I want to be able to use the data in filt as a filter for df. I've tried below but i'm getting an Error in get: invalid first argument.
d2 = df[get(filt[1,1])==filt[2,1] &
get(filt[1,2])==filt[2,2]]
What you're looking for is called a semi_join and you need the filter data frame to be in a different format:
library(dplyr)
df <- data.frame(Gender = c('Male','F','Male','Male','F'),EmployeeStatus = c('Active','na','Active','Active','na'),stringsAsFactors = FALSE)
filt <- data.frame(Gender = "Male",EmployeeStatus = "Active",stringsAsFactors = FALSE)
> semi_join(df,filt)
Joining, by = c("Gender", "EmployeeStatus")
Gender EmployeeStatus
1 Male Active
2 Male Active
3 Male Active
(By the way, I think you transposed some indices in your question.)
We can use base R to do this
df[Reduce(`&`, Map(`==`, df[as.character(filt$X1)],
as.character(filt$X2))),]
# Gender EmployeeStatus
#1 Male Active
#3 Male Active
#4 Male Active
Or with rowSums from base R
df[rowSums(df[as.character(filt$X1)] == as.list(as.character(filt$X2))) == 2,]
There are many problem in code by OP. The syntax is valid only in scope of data.table like:
library(data.table)
setDT(df)
d2 = df[get(as.character(filt[1,1]))==as.character(filt[1,2]) &
get(as.character(filt[2,1]))==as.character(filt[2,2])]
# Gender EmployeeStatus
# 1: Male Active
# 2: Male Active
# 3: Male Active
Let me try to explain the reason why OP code is not working.
get(filt[1,1]) => Gender.
As we all know, df[Gender,] or df["Gender",] doesn't work in scope of a data.frame. The variable Gender is not available in scope of data.frame. Above all type of filt[1,1] is factor which cannot be coerced by get.
Hence, the changes which have been implemented in above solution can be described as:
First, as.character(filt[1,1]) will change factor to character. Then df[get("Gender"),] is valid syntax in scope for data.table.
I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you