I'm working on a paper on electoral politics and tried using this dataset to calculate the share of the electorate that each religion,so I created an if() function and a Christian variable and tried to increase the number of Christians by one whenever a Christian name pops up, but was unable to do so. Would appreciate it if you could help me with this
library(dplyr)
library(ggplot2)
Christian=0
if(Sample...Sheet1$V2=="James"){
Christian=Christian+1
}
PS
The Output
Warning message:
In if (Sample...Sheet1$V2 == "James") { :
the condition has length > 1 and only the first element will be used
Notwithstanding my comment about the fundamental non-validity of this approach, here’s how you would solve this general problem in R:
Generate a lookup table of the different names and categories — this table is independent of your input data:
religion_lookup = tribble(
~ Name, ~ Religion,
'James', 'Christian',
'Christopher', 'Christian',
'Ahmet', 'Muslim',
'Mohammed', 'Muslim',
'Miriam', 'Jewish',
'Tarjinder', 'Sikh'
)
match your input data against the lookup table (I’m using an input table data with a column Name instead of your Sample...Sheet1$V2):
matched = match(data$Name, religion_lookup$Name)
religion = religion_lookup$Religion[matched]
Count the results:
table(religion)
religion
Christian Jewish Muslim Sikh
2 5 3 1
Note the lack of ifs and loops in the above.
Christian <- sum( Sample...Sheet1$V2=="James" )
There goes, don't need the if block.
I am currently working on the so-called "Moneyball" problem. I am basically trying to select the best combination of three baseball players (based on certain baseball-relevant statistics) for the least amount of money.
I have the following dataset (OBP, SLG, and AB are statistics that describe the performance of a player):
# the table has about 100 observations;
# the data frame is called "batting.2001"
playerID OBP SLG AB salary
giambja01 0.3569001 0.6096154 20 410333
heltoto01 0.4316547 0.4948382 57 4950000
berkmla01 0.2102326 0.6204506 277 305000
gonzalu01 0.4285714 0.3880131 409 9200000
martied01 0.4234079 0.5425532 100 5500000
My goal is to pick three players who in combination have the highest possible sum of OBP, SLG, and AB, but at the same time do not exceed a total salary of 15.000.000 dollar.
My approach so far has been rather simple... I just tried to arrange (in descending order) the columns OBP, SLG, and AB and simply picking the three players on the top that in combination do not exceed the salary restriction of 15 Million dollar:
batting.2001 %>%
arrange(desc(OPB), desc(SLG), desc(AB))
Can anyone of you think of a better solution? Also, what if I would like to get the best combination of three players for the least amount of money? What approach would you use in that scenario?
Thanks in advance, and looking forward to reading your solutions.
I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing
I've got a set of data, olympic_height.txt, where each row corresponds to a person. There are 3 columns that tell you their height, their gender and what sport they play respectively. How do I obtain a subset of the data that only contains people that are male and play basketball for example?
I tried this
MBP=read.table(file="olympic_height.txt", header=T)
MBP$sex=="M"
MBP$sport=="Basketball"
t=MBP
boxplot(t)
My goal is to have one boxplot of heights of male basketball players and one of heights of male football players. When I try this, I end up with 2 identical boxplots and I'm certain they should be very different. What am I doing wrong?
I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.