I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want
Related
I'm working on a large weighted national dataset that includes respondent's state as a variable. I wanted to create subsets with multiple states (i.e. AZ, IL, NJ). All of the threads I found show how to subset from multiple columns but is there a way to create subsets from multiple categorical values in the same column? One of my attempts is included below and all attempts result in either an error or subsets with only the Arizona respondents.
AZILNJ <- subset(FFE.PRAMS, STATE=='AZ', 'IL', 'NJ')
I could just delete unwanted state responses in Excel, it'd be nice if there was a way to do this in R though.
You are looking for STATE %in% c('AZ', 'IL', 'NJ').
I have five tables, each with between 20 to 30 variables and 1,500 to 2,300 observations. I want to make a new dataset similar in values to the original dataset but not the same. Meaning if one variable is gender. I want the new dataset to have gender data, but the values would differ.
I'm unsure what functions to use or even how to search for some methods. My Google searches are coming up with how to make subsets of data, but I need new datasets with the same number of variables and observations with random values based on the values of the source dataset.
Any advice would be helpful.
I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))
I have seven dataframes now, they have same structures including 41 rows and 12 columns. Now I want to choose one row from each dataframe to get a new dataframe. I can get 41^7 new dataframes theoretically so that I can run regressions with all of them. My target is to get a range of coefficient of my most important independent variable, but now my crucial step is get these 41^7 new dataframes. What should I do?
I have done a panel regression with 30 individual and 12 semi-year. For my most important variable, I calculated it due to its partly missing. Now I want to adjust its method of calculation to get a interval of coefficient. My process met an obstacle here.
FIT <- t(cbind(FITgan[1,],FITxin[1,],FITqing[1,],FITnei[1,],FIThe[1,],FITshan[1,],FITshann[1,]))
This is my attempt for one combination, the number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results.
The number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results. Thanks!!
I am not sure if this questions belong here. It rather sounds to be a topic for Cross Validated. As you probably should not do a regression on 41^7 dataframes I would suggest a bootstrapping approach:
sample from your true observations WITH replacement 41 times.
run your regression
save coeffiecients of interest
repeat step 1-3 maybe 10,000 times
estimate a simulated distribution from your coefficients of interest
This will lead you to a better understanding of the underlying uncertainty of your parameters.
How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]