Matching data across two data frames in R - r

I've found a number of answers to my question that almost get me to the result I want, but not quite!
I've got two data sets that include word lists, something like:
df1:
Word | Speaker
apple 1
dog 1
lobster 1
tree 2
df2:
Word | Speaker
car 2
lobster 2
fish 1
bird 1
I want to create a new column in df1 that will tell me whether or not the same word appears in df2, regardless of exactly where in the list it occurs and who the speaker was. So I want to create a new column in df1, similar to this:
df1
Word | Speaker | Match
apple 1 FALSE
dog 1 FALSE
lobster 1 TRUE
tree 2 FALSE
It seems that it should be very easy but I can't quite get it to do the right thing. Any help much appreciated!

You're right - it is easy! You need %in%...
df1$Match <- (df1$Word %in% df2$Word)

Related

Is there a R function to delete an element in data frame [duplicate]

This question already has answers here:
Delete rows containing specific strings in R
(7 answers)
How to remove rows in a dataframe that contain certain words in R?
(2 answers)
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Closed 1 year ago.
I'm looking for a way to delete an element out of a data frame (delete the full row) if a certain word is not found in the Name column. In my case, the element is a color like red,blue,green
The dataset looks like this:
Name|Classification
A red apple | Fruit
A banana | Fruit
A blue carrot | Vegetable
I use the following code to loop through the dataframe and check if a color is in the string of Name column. Currently only looking at red to make it work.
for(i in 1:length(nsl)){
if(!grepl("red",nsl[[1:i]],fixed=TRUE)){
nsl[[1:i]] <- " " #This is where I want to delete the item or set it to empty (so later on I can filter all empty out)
}
}
But I have no clue how it is done on a data frame that has two columns/dimensions. All internet tutorials I find tell me to use nsl[-index_element_here] but that deletes the column Classification. Which they use on examples of 1-D lists. Or tell me to use %in% which is , I think, not what I'm looking for.
The expected end result is a dataframe where only fruit and vegetables with color are in. Non-colour is deleted.
There is no need for a loop:
nsl <- read.table(text = "Name|Classification
'A red apple' | Fruit
'A banana' | Fruit
'A blue carrot' | Vegetable
", header = TRUE, sep = "|")
nsl[!grepl("red", nsl$Name, fixed = TRUE), ]
#> Name Classification
#> 2 A banana Fruit
#> 3 A blue carrot Vegetable

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

Find if a specific choice is in a Data Frame R

I have a Data Frame object which contains a list of possible choices. For example, an analogy of this would be:
FirstName, SurName, Subject, Grade
Brian, Smith, History, 75
Jenny, Jackson, English, 60
How would I...
1) Check to see if a certain pupil-subject combination is in my Data Frame
2) And for those who are there, extract their grade (And potentially other relevant fields)
?
Thanks so much
The only solutions I've found so far include appending the values onto the end of the Data Frame and trying to see if it is unique or not? This seems a crude and ridiculous hack?
learn data subset (extraction) using base R.
To subset any data frame by its rows and column you use [ ]
Let df be your data frame.
FirstName SurName Subject Grade
1 Brian Smith History 75
2 Jenny Jackson English 60
3 Tom Brandon Physics 50
You can subset it by its rows and columns using
df[rows,columns]
Here rows and column can be :
1) Index (Number/Name)
Which means subset that give me that particular row and column like
df[2,3]
this will return second row and third column
[1] English
or
df[2,"Grade"]
returns
[1] 60
2) Range (Indices/List of Names)
Which means subset that give me these rows and columns like
df[1:2,2,drop=F]
Here drop=F to avoid flattening of result and output like a data.frame. It will give you this
SurName
1 Smith
2 Jackson
Range also supports all by leaving either rows or columns empty like
df[,3,drop=F]
this will return all rows for third column
Subject
1 History
2 English
3 Physics
or
df[1:2,c("Grade","Subject")]
Grade Subject
1 75 History
2 60 English
3) Logical
Which means you want to subset using a logical condition.
df[df$FirstName=="Brian",]
meaning give me rows where FirstName is Brian and all columns for it.
FirstName SurName Subject Grade
1 Brian Smith History 75
or
df[df$FirstName=="Brian",1:3]
give me rows where FirstName is Brian and give me only 1 to 3 columns.
or create complex logicals
df[df$FirstName=="Brian" & df$SurName==" Smith",1:3]
output
FirstName SurName Subject
1 Brian Smith History
or complex logical and extract column by name
df[df$FirstName=="Brian" & df$SurName==" Smith","Grade",drop=F]
Grade
1 75
or complex logical and extract multiple columns by name
df[df$FirstName=="Brian" & df$SurName==" Smith",c("Grade","Subject")]
Grade Subject
1 75 History
to use this in a function do
myfunc<-function(input_var1,input_var2,input_var3)
{
df[df$FirstName==input_var1 & df$SurName==input_var2 & df$Subject==input_var3,"Grade",drop=F]
}
run it like this
myfunc("Tom","Brandon","Physics")
I think you are looking for this:
result <- data[data$FirstName == "Brian" & data$Subject == "History", c("Grade") ]
Try subset:
con <- textConnection("FirstName,SurName,Subject,Grade\nBrian,Smith,History,75\nJenny,Jackson,English,60")
dat <- read.csv(con, stringsAsFactors=FALSE)
subset(dat, FirstName=="Brian" & SurName=="Smith" & Subject=="History", Grade)
Maybe aggregate can be helpful, too. The following code gives the mean of the grades for all pupil/subject combinations:
dat <- transform(dat, FullName=paste(FirstName, SurName), stringsAsFactors=FALSE)
aggregate(Grade ~ FullName+Subject, data=dat, FUN=mean)

Changing values in list if that value meets criteria in R

I have a set of data that I am importing from a csv
info <- read.csv("test.csv")
here is an example of what it would look like
name type purchase
1 mark new yes
2 steve old no
3 jim old yes
4 bill new yes
What I want to do:
I want to loop through the purchase column and change all the yes's to True & no's to be False. Then loop through the type column and change all the old's to customer.
I've tried to mess with all the different apply's and couldnt get it to work. Also Ive tried a bunch of the methods in this thread Replace a value in a data frame based on a conditional (`if`) statement in R but still no luck.
Any help or guidance would be much appreciated!
Thanks,
Nico
Here's an approach using within, basic character substitution, and basic tests for character equivalence.
within(mydf, {
type <- gsub("old", "customer", type)
purchase <- purchase == "yes"
})
# name type purchase
# 1 mark new TRUE
# 2 steve customer FALSE
# 3 jim customer TRUE
# 4 bill new TRUE
I've used gsub to replace "type", but there are other approaches that can be taken (eg factor, ifelse, and so on).

count distinct values in spreadsheet

I have a Google spreadsheet with a column that looks like this:
City
----
London
Paris
London
Berlin
Rome
Paris
I want to count the appearances of each distinct city (so I need the city name and the number of appearances).
City | Count
-------+------
London | 2
Paris | 2
Berlin | 1
Rome | 1
How do I do that?
Link to Working Examples
Solution 0
This can be accompished using pivot tables.
Solution 1
Use the unique formula to get all the distinct values. Then use countif to get the count of each value. See the working example link at the top to see exactly how this is implemented.
Unique Values Count
=UNIQUE(A3:A8) =COUNTIF(A3:A8;B3)
=COUNTIF(A3:A8;B4)
...
Solution 2
If you setup your data as such:
City
----
London 1
Paris 1
London 1
Berlin 1
Rome 1
Paris 1
Then the following will produce the desired result.
=sort(transpose(query(A3:B8,"Select sum(B) pivot (A)")),2,FALSE)
I'm sure there is a way to get rid of the second column since all values will be 1. Not an ideal solution in my opinion.
via http://googledocsforlife.blogspot.com/2011/12/counting-unique-values-of-data-set.html
Other Possibly Helpful Links
http://productforums.google.com/forum/#!topic/docs/a5qFC4pFZJ8
You can use the query function, so if your data were in col A where the first row was the column title...
=query(A2:A,"select A, count(A) where A != '' group by A order by count(A) desc label A 'City'", 0)
yields
City count
London 2
Paris 2
Berlin 1
Rome 1
Link to working Google Sheet.
https://docs.google.com/spreadsheets/d/1N5xw8-YP2GEPYOaRkX8iRA6DoeRXI86OkfuYxwXUCbc/edit#gid=0
=iferror(counta(unique(A1:A100))) counts number of unique cells from A1 to A100
Not exactly what the user asked, but an easy way to just count unique values:
Google introduced a new function to count unique values in just one step, and you can use this as an input for other formulas:
=COUNTUNIQUE(A1:B10)
This works if you just want the count of unique values in e.g. the following range
=counta(unique(B4:B21))
This is similar to Solution 1 from #JSuar...
Assume your original city data is a named range called dataCity. In a new sheet, enter the following:
A | B
----------------------------------------------------------
1 | =UNIQUE(dataCity) | Count
2 | | =DCOUNTA(dataCity,"City",{"City";$A2})
3 | | [copy down the formula above]
4 | | ...
5 | | ...
=UNIQUE({filter(Core!L8:L27,isblank(Core!L8:L27)=false),query(ArrayFormula(countif(Core!L8:L27,Core!L8:L27)),"select Col1 where Col1 <> 0")})
Where Core!L8:L27 is the list in the question.

Resources