Find if a specific choice is in a Data Frame R - r

I have a Data Frame object which contains a list of possible choices. For example, an analogy of this would be:
FirstName, SurName, Subject, Grade
Brian, Smith, History, 75
Jenny, Jackson, English, 60
How would I...
1) Check to see if a certain pupil-subject combination is in my Data Frame
2) And for those who are there, extract their grade (And potentially other relevant fields)
?
Thanks so much
The only solutions I've found so far include appending the values onto the end of the Data Frame and trying to see if it is unique or not? This seems a crude and ridiculous hack?

learn data subset (extraction) using base R.
To subset any data frame by its rows and column you use [ ]
Let df be your data frame.
FirstName SurName Subject Grade
1 Brian Smith History 75
2 Jenny Jackson English 60
3 Tom Brandon Physics 50
You can subset it by its rows and columns using
df[rows,columns]
Here rows and column can be :
1) Index (Number/Name)
Which means subset that give me that particular row and column like
df[2,3]
this will return second row and third column
[1] English
or
df[2,"Grade"]
returns
[1] 60
2) Range (Indices/List of Names)
Which means subset that give me these rows and columns like
df[1:2,2,drop=F]
Here drop=F to avoid flattening of result and output like a data.frame. It will give you this
SurName
1 Smith
2 Jackson
Range also supports all by leaving either rows or columns empty like
df[,3,drop=F]
this will return all rows for third column
Subject
1 History
2 English
3 Physics
or
df[1:2,c("Grade","Subject")]
Grade Subject
1 75 History
2 60 English
3) Logical
Which means you want to subset using a logical condition.
df[df$FirstName=="Brian",]
meaning give me rows where FirstName is Brian and all columns for it.
FirstName SurName Subject Grade
1 Brian Smith History 75
or
df[df$FirstName=="Brian",1:3]
give me rows where FirstName is Brian and give me only 1 to 3 columns.
or create complex logicals
df[df$FirstName=="Brian" & df$SurName==" Smith",1:3]
output
FirstName SurName Subject
1 Brian Smith History
or complex logical and extract column by name
df[df$FirstName=="Brian" & df$SurName==" Smith","Grade",drop=F]
Grade
1 75
or complex logical and extract multiple columns by name
df[df$FirstName=="Brian" & df$SurName==" Smith",c("Grade","Subject")]
Grade Subject
1 75 History
to use this in a function do
myfunc<-function(input_var1,input_var2,input_var3)
{
df[df$FirstName==input_var1 & df$SurName==input_var2 & df$Subject==input_var3,"Grade",drop=F]
}
run it like this
myfunc("Tom","Brandon","Physics")

I think you are looking for this:
result <- data[data$FirstName == "Brian" & data$Subject == "History", c("Grade") ]

Try subset:
con <- textConnection("FirstName,SurName,Subject,Grade\nBrian,Smith,History,75\nJenny,Jackson,English,60")
dat <- read.csv(con, stringsAsFactors=FALSE)
subset(dat, FirstName=="Brian" & SurName=="Smith" & Subject=="History", Grade)
Maybe aggregate can be helpful, too. The following code gives the mean of the grades for all pupil/subject combinations:
dat <- transform(dat, FullName=paste(FirstName, SurName), stringsAsFactors=FALSE)
aggregate(Grade ~ FullName+Subject, data=dat, FUN=mean)

Related

grepl function in R

I have a data set and I want to find the rows which include a specific word "result". I used the following function but it seems it doesn't work correctly. Any suggestion?
data$new<-data.frame(grepl("result",col1))
data:
col1 col2
ABC result VDCbvdc home 22
fgc school 34
university result home exam 45
exam math stat 65
try data$new <- grepl("result",data$col1)
data$new should be assigned to a vector, but you're trying to feed it a data frame. also, col1 only exists inside data, so you'll need data$col1.

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

R how to get data column to rows of first and second values

Apologies, I'm a novice but I don't seem to be able to find an answer to this question.
I've scraped tabular data from a web page. After some cleaning It appears in a single unnamed column.
[1] John
[2] Smith
[3] Tina
[4] Jordan
and so on.....
I'm obviously looking for the result of:
FirstName | LastName
[1] John Smith
[2] Tina Jordan
et al.
Much of what has gotten me to this point was sourced from: http://statistics.berkeley.edu/computing/r-reading-webpages
A very helpful resource for beginners such as myself.
I would be grateful for any advice you can give me.
Thanks,
C R Eaton
We create a logical index ('i1'), create a data.frame by extracting the elements in the first column of the original dataset ('dat') using 'i1'. The 'i1' elements will recycle to the length of the column, so if we do 'dat[i1,1]`, it will extract 1st element, 3rd, 5th, etc. For the last name, we just negate the 'i1', so that it will extract 2nd, 4th, etc..
i1 <- c(TRUE, FALSE)
d1 <- data.frame(FirstName = dat[i1,1], LastName = dat[!i1, 1], stringsAsFactors=FALSE)

Selecting strings and using in logical expressions to create new variable - R

I have a categorical variable indicating location of flu clinics as well as an "other" category. Participants who select the "other" category give open-ended responses for their location. In most cases, these open-ended responses fit with one of the existing categories (for example, one category is "public health clinic", but some respondents picked "other" and cited "mall" which was a public health clinic). I could easily do this by hand but want to learn the code to select "mall" strings then use logical expressions to assign these people to "public health clinic" (e.g. create a new variable for location of flu clinics).
My categorical variable is "lrecflu2" and my character string variable is "lfother"
So far I have:
mall <- grep("MALL", Motiv82012$lfother, value = TRUE)
This gives me a vector with all the string responses containing "MALL" (all strings are in caps in the dataframe)
How do I use this vector in a logical expression to create a new variable that assigns these people to the "public health clinic" category and assigns the original value of flu clinic location variable for people that did not select "other" (and do not have values in the character string variable) to the new flu clinic location variable?
Perhaps, grep is not even the right function to be using.
As I understand it, you have a column in a data frame, where you want to reassign one character value to another. If so, you were almost there...
set.seed(1) # for generating an example
df1 <- data.frame(flu2=sample(c("MALL","other","PHC"),size=10,replace=TRUE))
df1$flu2[grep("MALL",df1$flu2)] <- "PHC"
Here grep() is giving you the required vector index; you then subset the vector based on this and change those elements.
Update 2
This should produce a data.frame similar to the one you are using:
set.seed(1)
lreflu2 <- sample(c("PHC","Med","Work","other"),size=10,replace=TRUE)
Ifother <- rep("",10) # blank character vector
s1 <- c("Frontenac Mall","Kingston Mall","notMALL")
Ifother[lreflu2=="other"] <- s1
df1 <- data.frame(lreflu2,Ifother)
### alternative:
### df1 <- data.frame(lreflu2,Ifother, stringsAsFactors = FALSE)
df1
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 other Frontenac Mall
5 PHC
6 other Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
If you're looking for an exact string match you don't need grep at all:
df1$lreflu2[df1$Ifother=="MALL"] <- "PHC"
Using a regex:
df1$lreflu2[grep("Mall",df1$Ifother)] <- "PHC"
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 PHC Frontenac Mall
5 PHC
6 PHC Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
Whether Ifother is a factor or vector with mode character doesn't affect things. data.frame will coerce string vectors to factors by default.

Resources