I have a dataset with an outcome flag which indicates whether a row has 'Good' or 'Bad' performance. My dataset has 83,337 records, of these there are only 150 'Bads' while the remainder are 'Goods'.
I'm using this dataset to get the value of information for the variables compared to the outcome flag, I use the code below for this:
all_data_IV <- smbinning.sumiv(all_data,y="OUTCOME_FLAG")
This gives a list of all variables and their IV for the outcome flag.
My problem is, I don't have enough 'Bads' to make this reliable and would therefore need to reduce the number of 'Goods' in the dataset.
I'd like to reduce the number of 'Goods' to 10,000 rows and keep 150 'Bads', but then weight up the 'Goods' so that it's a fair representation. So that the weight would be as follows:
No Records Sample Weight
Goods 83773 10000 8.37
Bads 150 150 1
In SAS if you want to apply a weight to a dataset you do something like the following:
/*CODING WEIGHT VALUE*/
DATA DEV_SAMPLE; SET DEV_SAMPLE;
IF FLAG = "DV" THEN WEIGHT_VALUE = 1;
ELSE WEIGHT_VALUE = 8.37;
/*APPLYING WEIGHT VALUE*/
PROC FREQ DATA = DEV_SAMPLE;
TABLE ARREARS;
WEIGHT WEIGHT_VALUE;
Does anyone know how to replicate this in R so I can apply a weight to the new sample?
Related
I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}
You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065
Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.
I am trying to generate synthetic user event log data for demonstration purposes. It's going to be very basic feature-wise (about 4 variables altogether). Here is what I have so far:-
require(wakefield)#for generating the Status variable
require(dplyr)
require(stringi)
set.seed(1)
#data<-data.frame()
eventDate<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
eventDate<-sample(rep(eventDate,each=1000),replace = T)
u <- runif(length(eventDate), 0, 60*60*12) # "noise" to add or subtract from some timepoint
eventDateTime<-as.POSIXlt(u, origin = paste0(eventDate,"00:00:00"))
eventDateTime
eventOutcome<-r_sample_factor(x = c("Passed", "Failed", "Ongoing","Unknown"), n=length(eventDate))
eventOutcome
data<-data.frame(eventDate,eventDateTime,eventOutcome)
head(data)
# eventDate eventDateTime eventOutcome
#1 2015-01-25 2015-01-25 04:48:47 Unknown
#2 2015-05-05 2015-05-05 09:35:22 Unknown
#3 2015-11-28 2015-11-28 08:56:16 Failed
#4 2015-05-23 2015-05-23 02:24:52 Ongoing
#5 2015-01-26 2015-01-26 07:43:52 Failed
#6 2015-10-22 2015-10-22 03:07:14 Passed
There is about 365000 rows of data here. All that is left to do is add a user identifier variable. I would like it if some users will maybe have a handful of interactions recorded in the data set, whereas some users may have dozens/hundreds/thousands of interactions (I would like this dataset to have that kind of variability).
I can create a user identifier variable no problem:-
UserId<-stri_rand_strings(1300,6)
But if I add this to the data, it doesn't work:-
data$UserId<-stri_rand_strings(1300,6)
Error in `$<-.data.frame`(`*tmp*`, UserId, value = c("k3QlXs", "gK3eBa", :
replacement has 1300 rows, data has 365000
So my request two-fold: How can I assign a User identifier variable to this kind of data; how can I make it variable, where some users have a 1 or a few interactions whilst others will appear frequently (i.e. dozens, hundreds, thousands of times)?
Thank you in advance for any help, always appreciated :)
One option might be to generate a UUID for each user. A UUID looks like this:
c7f2dde5-dfeb-45cb-9720-87b23effd45d
If you use a good UUID generator, then it is almost impossible to generate the same UUID more than once. R has a uuid package which can be used:
library(uuid)
user_uuid <- UUIDgenerate()
It sounds like the distribution of user IDs you are looking for is something akin to a power law or Pareto distribution. This is a way to generate a vector of relative probabilities of sampling each user ID that follows a power law, then sample the user IDs following those relative probabilities.
Here I've used the function Pareto::rPareto to generate the relative probabilities. You can vary the parameters, especially alpha, to get different distributions. As alpha increases the distribution will become more even. I also supplied a truncation parameter so that you will not get too many users with unrealistically high numbers of purchases (In this example the most frequent ID has ~3700 cases).
Note you are not guaranteed to sample each of the 1300 user IDs at least once with this particular way of doing it.
Next I use the function uuid::UUIDgenerate to generate 1300 unique strings. Finally I use sample to sample the unique IDs with replacement as many times as you have rows in your data frame. I plot the frequencies of the different IDs in the sample. Again, modify the parameters if this distribution is not what you are looking for.
library(Pareto)
library(uuid)
library(ggplot2)
set.seed(1)
n_users <- 1300
n_rows <- 365000
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
id_sample <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
# Check the distribution of IDs
ggplot(as.data.frame(table(id_sample)), aes(x = Freq)) +
geom_histogram() +
scale_x_log10()
You are generating 1300 strings but number of rows in your data is 365000. So you can use sample to repeat those 1300 string randomly.
library(stringi)
data$UserId <- sample(stri_rand_strings(1300,6), nrow(data), replace = TRUE)
I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.
I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.