R - efficiently organize tables on condition over time - r

I'd like to know how to organize a data.frame into tables on conditions over time. I have a politics data set where certain organizations take a position on a bill and whether the bill passed or failed, over the last few decades.
I know how to organize the data individually into tables, but I do it one-by-one, and its really hard to see the trends. The stackoverflow community always seems to have ingenious ways of grouping data. Here's some mock data:
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
biz1 = sample(-2:2, 200, replace = TRUE),
biz2 = sample(-2:2, 200, replace = TRUE),
biz3 = sample(-2:2, 200, replace = TRUE)
)
In biz, a negative number means they oppose the outcome and a positive outcome means they support it. In outcome, a zero means the law did not pass, a 1 means that it did.
I would like to use tables to see how each business has become more or less successful over time, by looking at how their positive numbers match 1s and negative numbers match 0s, compared to ever other organization (and vice verse with positive matching the number of negative numbers).
A few notes
In the data set, I have about 100 businesses as columns, so I definitely need an efficient way to make the tables without naming every single column. I can select them in a range, like 125:300, since they are ordered together.
Of course i'm open to all ideas! Feel free to list any other ways of looking at this.
If i failed to ask this question right, please let me know how I could improve it.

The comments above about your question being too vague are right on target. Having said that this interests me and the vagueness leaves me free to interpret...
First, I'd recode the outcome as -1 if the bill fails. Then ourtcome * bizn is in a sense a success score for that business on that legislation: positive if either a bill that the business supported passed, or if a bill that the business opposed failed. Then there are several ways to visualize the scores. Here are just a few to get you started.
# re-code outcomes
Data$outcome <- ifelse(Data$outcome==0,-1,1)
library(reshape2) # for melt(...)
library(ggplot2)
gg <- melt(Data, id=c("year","outcome"),
variable.name="business", value.name="support")
gg$score <- with(gg,outcome*support) # score represents level of success
# mean success vs. year with +/- 1 sd
ggplot(gg,aes(x=year,y=score, color=business))+
stat_summary(fun.data="mean_sdl")+
stat_summary(fun.y=mean,geom="line")+
facet_grid(business~.)
# boxplot of success scores
ggplot(gg,aes(x=factor(year),y=score))+
geom_boxplot(aes(fill=business))+
facet_grid(business~.)
# barplot of success/failure frequencies
# excludes cases where a business did not take a position pro or con
gg.bar <- aggregate(score~year+business,gg,
function(eff)c(success=sum(eff>0),failure=sum(eff<0)))
gg.bar <- data.frame(gg.bar[1:2],gg.bar$score)
ggplot(gg.bar,aes(x=factor(year)))+
geom_bar(aes(y=success,fill="success"),stat="identity")+
geom_bar(aes(y=-failure,fill="failure"),stat="identity")+
geom_hline(xintercept=0,linetype=2,color="blue")+
scale_fill_discrete(name="",breaks=c("success","failure"))+
labs(x="",y="frequency")+
facet_grid(business~.)
All of these represent rather simplistic ways of looking at the data. If this was a serious project I would probably run a principal components analysis on the businesses to identify groups of businesses that tend to support or oppose the same legislation. Then I'd run a cluster analysis on the principal components to identify groups of legislation that tend to attract the support or opposition of groups of businesses.
Another way to approach this would be to run a logistic regression on the outcomes using the support/opposition of the various businesses as predictors. This would tell you which businesses tend to be more influential.

Related

Is there some way to detect 'wrong' measures in a dataframe?

I'm struggling on how can I remove 'wrong' measures from my dataset. I'm dealing with kind a huge table, where I have a date and the size of an equipment. It can't get bigger with use, at most it can stay the same size, so of course this problem is a measurement error.
My database is extensive and with several particular cases, which makes it impossible for me to place it here, among other business reasons... Therefore, I use an image and a part of the data as an example, but the problem is what I described above...
simplest_example = test = data.frame(data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021","20-11-2021"), measure = c(5,4,3,5,2))
#as result:
# data1 measure
#1 20-09-2020 5
#2 15-10-2020 4
#3 13-05-2021 3
#4 20-11-2021 2
The point is: Select the largest non-ascending sequence possible, and exclude some values that inhibit this from happening.
So I would like to ask for a suggestion, if anyone here has come across something similar, and let me know how to recommend something.
If I understand, you want to detect any time the variable measure is greater than the value at the previous time point? I'd create a lag column, which is just the measure column lagged by one time. Then identify when a previous measure is greater than the current measure
library(dplyr)
simplest_example %>%
mutate(previous_measure = lag(measure)) %>%
filter(previous_measure < measure)

How to extract top features by CATScore in r?

I am running a machine learning algorithm that uses CAT score for feature selection as
library(sda)
train1<- data.matrix(train, rownames.force = NA)
ranking.LDA = sda.ranking(train1[,1:lengthvar], train1[,lengthtrain], diagonal=FALSE)
topfs<-which(ranking.LDA[,"score"] >2)
My question is how to ask the CAT score to give me for example top 20 features? The only way I could extract features was setting a threshold, but this way, it gives me various number of features for different data set. What I want is always having eg. top 20 (or any other number) features.
Thanks in advance for your valuable contribution.
ranking.LDA gives a list of numbers.Hence we use a list function.
#As ranking.LDA gives a ranking of predictors we directly extract column names using this ranking.
colnames(train1[,ranking.LDA[1:20]])

How to run a function to EACH of my observations in R?

My problem is as follows:
I have a dataset of 6000 observation containing information from costumers (each observation is one client's information).
I'm optimizing a given function (in my case is a profit function) in order to find an optimal for my variable of interest. Particularly I'm looking for the optimal interest rate I should offer in order to maximize my expected profits.
I don't have any doubt about my function. The problem is that I don't know how should I proceed in order to apply this function to EACH OBSERVATION in order to obtain an OPTIMAL INTEREST RATE for EACH OF MY 6000 CLIENTS (or observations, as you prefer).
Until now, it has been easy to find the UNIQUE optimal (same for all clients) for this variable that would maximize my profits (This is, the global maximum I guess). But what I need to know is how I should proceed in order to apply my optimization problem to EACH of my 6000 observations, INDIVIDUALLY, in order to have the optimal interest rate to offer to each costumer (this is, 6000 optimal interest rates, one for each of them).
I guess I should do something similar to a for loop, but my experience in this area is limited, and I'm quite frustrated already. What's more, I've tried to use mapply(myfunction, mydata) as usual, but I only get error messages.
This is how my (really) simple code now looks like:
profits<- function(Rate)
sum((Amount*(Rate-1.2)/100)*
(1/(1+exp(0.600002438-0.140799335888812*
((Previous.Rate - Rate)+(Competition.Rate - Rate))))))
And results for ONE optimal for the entire sample:
> optimise(profits, lower = 0, upper = 100, maximum = TRUE)
$maximum
[1] 6.644821
$objective
[1] 1347291
So the thing is, how do I rewrite my code in order to maximize this and obtain the optimal of my variable of interest for EACH of my rows?
Hope I've been clear! Thank you all in advance!
It appears each of your customers are independent. So you just put lapply() around the optimize() call:
lapply(customer_list, function(one_customer){
optimise(profits, lower = 0, upper = 100, maximum = TRUE)
})
This will return a very big list, where each list element has a $maximum and a $objective. You can then run lapply to total the $maximums, to find just how rich you have become!

How to manage factors with mixed data types

I'm afraid this question has two sub parts. My project is to determine which insurance carrier has the lowest cost based on CPT Codes. Since there are so many CPT Codes I wanted to group them using cut like this:
uCPTCode<- unique(data$CPTCode)
uCPTCode <- cut(uCPTCode,
breaks = c(-Inf, "01999", "69979", "79999", "89398", "99091", "99499", Inf),
labels = c("NA","Anesthesia", "Surgery", "Radiology", "Pathology&Laboratory", "Medicine","Evaluation&Management", "Temp"),
right = FALSE)
Not sure unique is required or wise, but seemed to make sense to me. The issue is that some codes have leading zeros and terminating letters like this
2608 Levels: 0014F 0159T 0164T 0191T 0195T 0232T 0319T 0326T 0513F 0517F 0518F
So question 1 is what is the process to convert these ranges into integers corresponding to the labels I have in the cut function so I can graph the grouped results the x axis?
Question 2 is that I expected the ranges to be continuous, but they are not. How to I manage what happens around code 99000 through 99216 where previous groups (Medicine, Anesthesiology and Evaluation and Management) get combined? Here is a link to the CPT grouper file https://www.dropbox.com/s/wm55n17pufoacww/CPTGrouper.xlsx?dl=0
Here is a smattering of results to see where I am going with it
https://www.dropbox.com/s/h6sdnvm9yew6jdg/SampleStudyResults.xlsx?dl=0
Thanks very much for your time and attention

Optimizing dataset based on several conditions

I am trying to construct a (optimal) subset from a large dataset based on several conditions. I know that there are some possibilities to construct such a subset. See for example: this link. I tried this function but it is unsatisfactory since it takes to long to find such a subset and might be not "intelligent" enough. Below you can find some sample data
data <- data.table(id=rep(c("a","b","c","d","e","f"),3),
balance=c(1000,2000,1500,2000,4000,1500,
800,2000,1300,1800,2000,500,
700,1900,1100,1600,500,30),
rate=c(1100,1500,1000,700,300,200,
400,700,500,1300,1600,700,
800,1100,1200,700,400,150),
grade=c(70,100,90,50,150,40,
30,80,55,80,85,20,
35,70,55,75,15,10),
date= rep(c(2012,2013,2014),each=6))
data_agg <- aggregate(cbind(rate, grade) ~ date, data = data.frame(data),sum,na.rm=T)
data_agg$ratio <- data_agg$rate / data_agg$grade
> data_agg$ratio
[1] 9.60000 14.85714 16.73077
Now the objective is (e.g.) to minimize the increase in the data_agg$ratio over the years and at the same time include at least 3 id's in this subset.
By looking at the data we see e.g. dat ID == "e" has a ratio of 300/150=2 in 2012, 1600/85=19 in 2013 and 400/15=27 in 2014. The objective of my answer is to minimze the increase over the years, thus deleting "e" might have a desisarable effect on the subset.
datasubset <-subset(data, subset = id!=c("e"))
data_aggsubset <- aggregate(cbind(rate, grade) ~ date, data = data.frame(datasubset),sum,na.rm=T)
data_aggsubset$ratio <- data_aggsubset$rate / data_aggsubset$grade
data_aggsubset$ratio
[1] 12.85714 13.58491 16.12245
And indeed, the ratio is more stable over the years now. Thus my question is whether there is some optimizer function which seeks IDs such that this ratio is e.g. within a bandwidth of +/- 50% of the starting value (9.6 in this example) and contains at least three IDs. My original dataset is large, thus I am looking for a more intelligent function than the one I attached in the link. Please let me know if anything is unclear. Thank you in advance!

Resources