I'm working on a large weighted national dataset that includes respondent's state as a variable. I wanted to create subsets with multiple states (i.e. AZ, IL, NJ). All of the threads I found show how to subset from multiple columns but is there a way to create subsets from multiple categorical values in the same column? One of my attempts is included below and all attempts result in either an error or subsets with only the Arizona respondents.
AZILNJ <- subset(FFE.PRAMS, STATE=='AZ', 'IL', 'NJ')
I could just delete unwanted state responses in Excel, it'd be nice if there was a way to do this in R though.
You are looking for STATE %in% c('AZ', 'IL', 'NJ').
Related
I'm totally new to R and I'm trying to analyze some healthcare data. I have a dataframe containing multiple different types of prices for a given medical procedure, of which there are 6 versions (231-236). The different prices are the Medicare price, chargemaster price, self-pay price, and commercial price. So there are 6x4=24 columns of data containing prices. Each row represents a different hospital. Is there a way to efficiently collect the medians and IQRs for each of the 24 columns and put them into a table?
So far, I'm just using the summary function: summary(cabg$chargemaster_231) but its very tedious to manually copy and paste the output values into a table. Appreciate any help!
I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))
I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]
I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want
I am trying to analyse subsets of a genepop file in the R package adegenet, and would like to apply different analyses to different subsets of populations (rows) and loci (columns). The subsets are defined by various exclusion criteria, and I have lists of loci and populations to be excluded under different circumstances.
Grabbing subsets of loci and populations in adegenet follows the standard conventions using square brackets, to the question is really quite general. By importing and transposing a list of loci and populations (simple csv files) and including them as objects in the spaces for grab parameters, I have been able create an object which includes only the variables included on those lists.
i.e.
#Selecting top 5 loci based on csv, transpose, then display the names
topfivelocs <- read.csv("top5loci.csv")
top5locs <- t(topfivelocs)
#Selecting to 5 pops based on csv, transpose, then display the names
topfivepops <- read.csv("top5pops.csv")
top5pops <- t(topfivepops)
#Grabbing a subset to top 5 loci for top 5 pops, display pop and loci names, and summarise the subset
fiveXfive <- full.dat[pop=top5pops,loc=c(top5locs)]
This successfully creates a genind (dataframe) object containing only the variables on the lists of desired populations and loci. My question is; how to I achieve the negative of this. In otherwords, create a dataframe/genind object including all EXCEPT the listed variables in my csv file.
EDIT
To ask a more simple question, I have a full list of locations in a csv with one column i.e.
Bristol
Edinburgh
London
Cardiff
Bangor
Liverpool
and then a subset of locations to exclude from analysis in the same format i.e.
Bristol
London
Cardiff
How do I program R to create an object comprising my master list but to the exclusion of those listed in the subset list. That seems a very simple instruction, so there must be a way.