Extracting matching rows from data frame - r

I have a data frame with 30+ columns. I want to extract the rows where three specific columns are matching with some reference values. Example, col A has state name, col B has site types, col C has number of annual visitors. I want to find out number of visitors (col C) going to capital (col B) of New Jersey (col A).

How about
subset(my_df,A=="New Jersey" & B=="capitol")$C
or
with(my_df,my_df[A=="New Jersey" & B=="capitol","C"])
You should probably check out some introductory R material: e.g. http://www.ats.ucla.edu/stat/r/faq/subset_R.htm ; http://digitheadslabnotebook.blogspot.ca/2009/07/select-operations-on-r-data-frames.html (results of googling "selection rows from a data frame")

This is pretty easy with a subset command.
subset(data, A=="New Jersey" & B=="capital", select=C)
Or with standard indexing
data$C[ data$A=="New Jersey" & data$B=="capital" ]
I strongly recommend reading a basic introduction to R because this is pretty elementary stuff.

Related

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

Subset around values that follow a pattern

Hopefully, this is a fairly straight forward question. I am using R to help subset some data that I am working with. Below is print() of some of the data that I am currently working with. I am trying to create a subset() of the data based around JobCode. As you can see the JobCode follows a pattern (00 - 0000) where the first 2 numbers are the same for a specific industry.
ID State StateName JobCode
1 AL Alabama 51-9199
2 AL Alabama 27-3011
4 AL Alabama 49-9043
5 AL Alabama 49-2097
My current attempt is to use this test <- subset(data, data$State == "AL" & data$JobCode == ("15-####"))(where # is a placeholder for the remaining 4 values) to subset for JobCode beginning with "15-". Is there any way to tell the subset to look for those remaining 4 values?
I'm sorry for the poor formatting as I am new to StackOverflow and I am also quite inexperienced with R. Thank you for your help.
There are no wild card characters in string equality. You need to use a function. You could use substr() to extract the first three charcters
test <- subset(data, State == "AL" & substr(JobCode,1,3) == ("15-"))
Also note that you don't need to use data$ inside the subset() parameter. Variables are evaulated in the context of the data frame for that function.
You can use the %like% operator of data.table library:
library(data.table)
setDT(df)
df[ State == "AL" & JobCode %like% "15-" ]

subset data frame based on character value

I'm trying to subset a data frame that I imported with read.table using the colClasses='character' option.
A small sample of the data can be found here
Full99<-read.csv("File.csv",header=TRUE,colClasses='character')
After removing duplicates, missing values, and all unnecessary columns I get a data frame of these dimmensions:
>dim(NoMissNoDup99)
[1] 81551 6
I'm interested in reducing the data to only include observations of a specific Service.Type
I've tried with the subset function:
MU99<-subset(NoMissNoDup99,Service.Type=='Apartment'|
Service.Type=='Duplex'|
Service.Type=='Triplex'|
Service.Type=='Fourplex',
select=Service.Type:X.13)
dim(MU99)
[1] 0 6
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type!='Hospital'
& NoMissNoDup99$Service.Type!= 'Hotel or Motel'
& NoMissNoDup99$Service.Type!= 'Industry'
& NoMissNoDup99$Service.Type!= 'Micellaneous'
& NoMissNoDup99$Service.Type!= 'Parks & Municipals'
& NoMissNoDup99$Service.Type!= 'Restaurant'
& NoMissNoDup99$Service.Type!= 'School or Church or Charity'
& NoMissNoDup99$Service.Type!='Single Residence'),]
but that doesn't remove observations.
I've tried that same method but slightly tweaked...
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type=='Apartment'
|NoMissNoDup99$Service.Type=='Duplex'
|NoMissNoDup99$Service.Type=='Triplex'
|NoMissNoDup99$Service.Type=='Fourplex'), ]
but that removes every observation...
The final subset should have somewhere around 8000 observations
I'm pretty new to R and Stack Overflow, so I apologize if there's some convention of posting I've neglected to follow, but if anyone has a magic bullet to get this data to cooperate, I'd love your insights :)
The different methods should work if you were using the right variable values. Your issue likely is extra spaces in your variable names.
You can avoid this kind of issues using grep for example:
NoMissNoDup99[grep("Apartment|Duplex|Business",NoMissNoDup99$Service.Type),]
## exclude
MU99<-subset(NoMissNoDup99,!(Service.Type %in% c('Hospital','Hotel or Motel')))
##include
MU99<-subset(NoMissNoDup99,Service.Type %in% c('Apartment','Duplex'))

Logical in my R function always returns "TRUE"

I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.

Reliability tests for classic content analysis (multiple categorial codes per item)

In classic content analysis (or qualitative content analysis), as typically done with Atlas.TI or Nvivo type tools (sometimes called QACDAS tools), you typically face the situation of having multiple raters rate many objects with many codes, so there are multiple codes that each rater might apply to each object. I think this is what the excellent John Ubersax page on agreement statistics calls "Two Raters, Polytomous Ratings".
For example you might have two raters read articles and code them with some group of topic codes from a coding scheme (e.g., diy, shelving, circular saw), and you are asking how well the coders agree on applying the codes.
What I'd like is to use the irr package functions, agree and kappa2, in these situations. Yet their documentation didn't help me figure out how to proceed, since they expect input in the form of "n*m matrix or dataframe, n subjects m raters." which implies that there is a single rating per rater, per object.
Given two raters using (up to) three codes to code two articles input data that looks like this (two diy articles, the second with some topic tags):
article,rater,code
article1,rater1,diy
article1,rater2,diy
article2,rater1,diy
article2,rater2,diy
article2,rater1,circular-saw
article2,rater1,shelving
article2,rater2,shelving
I'd like to get:
Overall percentage agreement.
Percentage agreement for each code.
Contingency table for each code.
Ideally, I'd also like to get Positive agreement (how often do the raters agree that a code should be present?) and Negative Agreement (how often do the raters agree that a code should not be present). See discussion of these at http://www.john-uebersax.com/stat/raw.htm#binspe
I'm pretty sure that this involves breaking the input data.frame up and processing it code by code, using something like dplyr, but I wondered if others have tackled this problem.
(The kappa functions take the same input, so let's just keep this simple by using the agree function from the irr package, plus the positive and negative agreement only really make sense with percentage agreement).
Looking at the meta.stackexchange threads on answering one's own question, it seems that is an acceptable thing to do. Makes sense, good place to store stuff for others to find :)
I solved most of this with the following code:
library(plyr); library(dplyr); library(reshape2); library(irr)
# The irr package expects input in the form of n x m (objects in rows, raters in columns)
# for multiple coders per coded items that is really confusing. Here we have 10 articles (to be coded) and
# many codes. So each rater rates each combinations of articles and codes as present (or not).
# Basically you send only the ratings columns to agree and kappa2. You can send them all at
# once for overall agreement, or send only those for each code for code-by-code agreement.
# letter,code,rater
# letter1,code1,rater1
# letter1,code2,rater1
# letter2,code3,rater2
coding <- read.csv("CombinedCoding.csv")
# Now want:
# letter, code, rater1, rater2
# where 0 = no (this code wasn't used), 1 = yes (this code was used)
# dcast can do this, collapsing across a group. In this case we're not really
# grouping, so if the code was not present length gives a 0, if it was length
# gives a 1.
# This excludes all the times where we agreed that both codes weren't present.
ccoding <- dcast(coding, letter + code ~ coder, length)
# create data.frame from combination of letters and codes
# this handles the negative agreement parts.
codelist <- unique(coding$code)
letterlist <- unique(coding$letter)
coding_with_negatives <- merge(codelist, letterlist) # Gets cartesion product of these.
names(coding_with_negatives) <- c("code", "letter") # align the names
# merge this with the coding, produces NA for rows that don't exist in ccoding
coding_with_negatives <- merge(coding_with_negatives,ccoding,by=c("letter","code"), all.x=T)
# replace NAs with zeros.
coding_with_negatives[is.na(coding_with_negatives)] <- 0
# Now want agreement per code.
# need a function that returns a df
# this function gets given the split data frame (ie this happens once per code)
getagree <- function(df) {
# for positive agreement remove the cases where we both coded it negative
positive_df <- filter(df, (rater1 == 1 & rater2 == 1) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
# for negative agreement remove the cases where we both coded it positive
negative_df <- filter(df, (rater1 == 0 & rater2 == 0) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
data.frame( positive_agree = round(agree(positive_df[,3:4])$value,2) # Run agree on the raters columns, get the $value, and round it.
, negative_agree = round(agree(negative_df[,3:4])$value,2)
, agree = round(agree(df[,3:4])$value,2)
, used_in_articles = nrow(positive_df) # gives some idea of the prevalance.
)
}
# split the df up by code, run getagree on the sections
# recombine into a data frame.
results <- ddply(coding_with_negatives, .(code), getagree)
The confusion matrices can be gotten with:
print(table(coding_with_negatives[,3],coding_with_negatives[,4],dnn=c("rater1","rater2")))
I haven't done it but I think I could do that per code inside the function using print to push them into a text file.

Resources