How to group factor levels? - r

I have a factor column with football position abbreviations, around 17 unique values with 220 observations. I want to have only three factor levels which encompass the 17 unique values.
levels(nfldraft$Pos) <- list(Linemen = c("C","OG","OT","TE","DT","DE"),
Small_Backs = c("CB","WR","FS"),
Big_Backs = c("FB","ILB","OLB","P","QB",
"RB","SS","WR"))
is what I tried, printing nfldraft$Pos to the console shows 3 factor levels but all the values are either "Linemen" or "Small_Backs" and all the other ones are NA. Where am I going wrong?

I made up an example character vector with all of the abbreviations:
my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS",
"FB","ILB","OLB","P","QB","RB","SS","WR")
class(my_example)
[1] "character"
Then I substituted the desired levels for their abbreviations (you could also use gsub here or any of many, many different approaches):
my_example[my_example %in% c("C","OG","OT","TE","DT","DE")] <- "Linemen"
my_example[my_example %in% c("CB","WR","FS")] <- "Small Backs"
my_example[my_example %in% c("FB","ILB","OLB","P",
"QB","RB","SS","WR")] <- "Big Backs"
Then I made it into a factor:
my_example <- as.factor(my_example)
head(my_example)
[1] Linemen Linemen Linemen Linemen Linemen Linemen
Levels: Big Backs Linemen Small Backs
tail(my_example)
[1] Big Backs Big Backs Big Backs Big Backs Big Backs Small Backs
Levels: Big Backs Linemen Small Backs
class(my_example)
[1] "factor"

This is a good example of needing a fully reproducible example. Actually OP's code looks like it should work. Taking from #Hack-R's sample input:
my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS",
"FB","ILB","OLB","P","QB","RB","SS","WR")
OP's original code works as-is:
nfldraft = list(Pos = factor(my_example))
levels(nfldraft$Pos) <- list(
Linemen = c("C","OG","OT","TE","DT","DE"),
Small_Backs = c("CB","WR","FS"),
Big_Backs = c("FB","ILB","OLB","P","QB","RB","SS","WR")
)
table(nfldraft$Pos)
# Linemen Small_Backs Big_Backs
# 6 2 9
This is exactly in line with the documentation for how to use levels<-:
levels(x) <- value
value A valid value for levels(x)... For the factor method, a vector of character strings with length at least the number of levels of x, or a named list specifying how to rename the levels.
So it seems there's something else wrong with OP's input

You can also use the mapvalues() function from dplyr package.
In your example it would be:
Linemen_levels = c("C","OG","OT","TE","DT","DE")
Small_Backs_levels = c("CB","WR","FS")
Big_Backs_levels = c("FB","ILB","OLB","P","QB","RB","SS","WR")
nfldraft <- nfldraft %>% mutate(Pos=mapvalues(Pos,
from = c(Linemen_levels, Small_Backs_levels, Big_Backs_levels),
to = c(rep('Linemen', length(Linemen_levels), rep('Small_Backs', length(Small_Backs_levels), rep('Big_Backs', length(Big_Backs_levels))))))

Related

Find differences betwen 2 dataframes with different lengths

I have two dataframes with each two columns c("price", "size") with different lengths.
Each price must be linked to its size. It's two lists of trade orders. I have to discover the differences between the two dataframes knowing that the two databases can have orders that the other doesn't have and vice versa. I would like an output with the differences or two outputs, it doesn't matter. But I need the row number in the output to find where are the differences in the series.
Here is sample data :
> out
price size
1: 36024.86 0.01431022
2: 36272.00 0.00138692
3: 36272.00 0.00277305
4: 36292.57 0.05420000
5: 36292.07 0.00403948
---
923598: 35053.89 0.30904890
923599: 35072.76 0.00232000
923600: 35065.60 0.00273000
923601: 35049.36 0.01760000
923602: 35037.23 0.00100000
>bit
price size
1: 37279.89 0.01340020
2: 37250.84 0.00930000
3: 37250.32 0.44284049
4: 37240.00 0.00056491
5: 37215.03 0.99891906
---
923806: 35053.89 0.30904890
923807: 35072.76 0.00232000
923808: 35065.60 0.00273000
923809: 35049.36 0.01760000
923810: 35037.23 0.00100000
For example, I need to know if the first row of the database out is in the database bit.
I've tried many functions : comparedf()
summary(comparedf(bit, out, by = c("price","size"))
but I've got error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, :
I've tried compare_df() :
compareout=compare_df(out,bit,c("price","size"))
But I know the results are wrong, I've only 23 results and I know that there are more than 200 differences minimum.
I've tried match(), which() functions but it doesn't get the results I search.
If you have any other methods, I will take them.
Perhaps you could just do inner_join on out and bit by price and size? But first make id variable for both data.frame's
library(dplyr)
out$id <- 1:nrow(out)
bit$id <- 1:nrow(bit)
joined <- inner_join(bit, out, by = c("price", "size"))
Now we can check which id from out and bit are not present in joined table:
id_from_bit_not_included_in_out <- bit$id[!bit$id %in% joined$id.x]
id_from_out_not_included_in_bit <- out$id[!out$id %in% joined$id.y]
And these ids are the rows not included in out or bit, i.e. variable id_from_bit_not_included_in_out contains rows present in bit, but not in out and variable id_from_out_not_included_in_bit contains rows present in out, but not in bit
First attempt here. It will be difficult to do a very clean job with this data tho.
The data I used:
out <- read.table(text = "price size
36024.86 0.01431022
36272.00 0.00138692
36272.00 0.00277305
36292.57 0.05420000
36292.07 0.00403948
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
bit <- read.table(text = "price size
37279.89 0.01340020
37250.84 0.00930000
37250.32 0.44284049
37240.00 0.00056491
37215.03 0.99891906
37240.00 0.00056491
37215.03 0.99891906
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
Assuming purely that row 1 of out should match with row 1 of bit a simple solution could be:
df <- cbind(distinct(out), distinct(bit))
names(df) <- make.unique(names(df))
However judging from the data you have provided I am not sure if this is the way to go (big differences in the first few rows) so maybe try sorting the data first?:
df <- cbind(distinct(out[order(out$price, out$size),]), distinct(bit[order(bit$price, bit$size),]))
names(df) <- make.unique(names(df))

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

r - getting all NA in ordered factor column

Instead of showing more2 or less2 in the column, it only shows up as NA. Why aren't the character names appearing instead?
careermore2 <- vector(mode="character",length=length(mlb$careeryrs))
"less2" <- careermore2[mlb$careeryrs<=2]
"more2" <- careermore2[mlb$careeryrs>=2]
No.seasons <- factor(careermore2,levels=c("more2","less2"),exclude=NA,ordered=TRUE)
mlb2 <- cbind(mlb,No.seasons)
str(mlb2$No.seasons)
head(mlb2$No.seasons)
mlb2[mlb2$No.seasons=="more2",]
Looking at careermore2 I would say you've got these the wrong way round:
"less2" <- careermore2[mlb$careeryrs<=2]
"more2" <- careermore2[mlb$careeryrs>=2]
That creates two objects. You really meant:
careermore2[mlb$careeryrs<=2] = "less2"
careermore2[mlb$careeryrs>=2] = "more2"
ie set the corresponding values in careermore2. And you probably want <2 or >2 rather than have = in both...

Apply conditional selection to sequence of columns R

I use data from the NHANES periodontal dataset (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm) and after cleaning it to only keep the "pc" variables, I have a df=setPD 168 columns that include 6 measurements (pcd, pcm, pcs, pcp, pcl, pca) around 28 teeth numbered from #02 to #31
#names(setPD)
[1] "ohx02pcd" "ohx02pcm" "ohx02pcs" "ohx02pcp" "ohx02pcl" "ohx02pca" "ohx03pcd" "ohx03pcm" "ohx03pcs" "ohx03pcp" "ohx03pcl" "ohx03pca"
[13] "ohx04pcd" "ohx04pcm" "ohx04pcs" "ohx04pcp" "ohx04pcl" "ohx04pca" "ohx05pcd" "ohx05pcm" "ohx05pcs" "ohx05pcp" "ohx05pcl" "ohx05pca"
[25] "ohx06pcd" "ohx06pcm" "ohx06pcs" "ohx06pcp" "ohx06pcl" "ohx06pca" "ohx07pcd" "ohx07pcm" "ohx07pcs" "ohx07pcp" "ohx07pcl" "ohx07pca"
[37] "ohx08pcd" "ohx08pcm" "ohx08pcs" "ohx08pcp" "ohx08pcl" "ohx08pca" "ohx09pcd" "ohx09pcm" "ohx09pcs" "ohx09pcp" "ohx09pcl" "ohx09pca"
[49] "ohx10pcd" "ohx10pcm" "ohx10pcs" "ohx10pcp" "ohx10pcl" "ohx10pca" "ohx11pcd" "ohx11pcm" "ohx11pcs" "ohx11pcp" "ohx11pcl" "ohx11pca"
[61] "ohx12pcd" "ohx12pcm" "ohx12pcs" "ohx12pcp" "ohx12pcl" "ohx12pca" "ohx13pcd" "ohx13pcm" "ohx13pcs" "ohx13pcp" "ohx13pcl" "ohx13pca"
[73] "ohx14pcd" "ohx14pcm" "ohx14pcs" "ohx14pcp" "ohx14pcl" "ohx14pca" "ohx15pcd" "ohx15pcm" "ohx15pcs" "ohx15pcp" "ohx15pcl" "ohx15pca"
[85] "ohx18pcd" "ohx18pcm" "ohx18pcs" "ohx18pcp" "ohx18pcl" "ohx18pca" "ohx19pcd" "ohx19pcm" "ohx19pcs" "ohx19pcp" "ohx19pcl" "ohx19pca"
[97] "ohx20pcd" "ohx20pcm" "ohx20pcs" "ohx20pcp" "ohx20pcl" "ohx20pca" "ohx21pcd" "ohx21pcm" "ohx21pcs" "ohx21pcp" "ohx21pcl" "ohx21pca"
[109] "ohx22pcd" "ohx22pcm" "ohx22pcs" "ohx22pcp" "ohx22pcl" "ohx22pca" "ohx23pcd" "ohx23pcm" "ohx23pcs" "ohx23pcp" "ohx23pcl" "ohx23pca"
[121] "ohx24pcd" "ohx24pcm" "ohx24pcs" "ohx24pcp" "ohx24pcl" "ohx24pca" "ohx25pcd" "ohx25pcm" "ohx25pcs" "ohx25pcp" "ohx25pcl" "ohx25pca"
[133] "ohx26pcd" "ohx26pcm" "ohx26pcs" "ohx26pcp" "ohx26pcl" "ohx26pca" "ohx27pcd" "ohx27pcm" "ohx27pcs" "ohx27pcp" "ohx27pcl" "ohx27pca"
[145] "ohx28pcd" "ohx28pcm" "ohx28pcs" "ohx28pcp" "ohx28pcl" "ohx28pca" "ohx29pcd" "ohx29pcm" "ohx29pcs" "ohx29pcp" "ohx29pcl" "ohx29pca"
[157] "ohx30pcd" "ohx30pcm" "ohx30pcs" "ohx30pcp" "ohx30pcl" "ohx30pca" "ohx31pcd" "ohx31pcm" "ohx31pcs" "ohx31pcp" "ohx31pcl" "ohx31pca"
I am trying to apply a conditional selection in each group of six columns. This is:
transmute(setPD,PD02 = ifelse(setPD$ohx02pcd >5 |
setPD$ohx02pcm>5 |setPD$ohx02pcs >5|
setPD$ohx02pcp >5 | setPD$ohx02pcl >5 |
setPD$ohx02pca >5, 1, 0))
Then for the next tooth (03) I have to write again:
transmute(setPD,PD03 = ifelse(setPD$ohx03pcd >5 |
setPD$ohx03pcm>5|setPD$ohx03pcs >5|
setPD$ohx03pcp >5|setPD$ohx03pcl >5|
setPD$ohx03pca >5, 1, 0))
I tried to firstly do that conditional selection in a more efficient way, something like:
transmute(setPD,PD02 = ifelse(list(setPD$ohx02pcd:setPD$ohx02pcp) >5, 1, 0))
but it does not work.
Then I am looking for a way to write a loop that does that over each tooth without needing to write this 28 times!!
I thought of applying the select function of dplyr in a for loop but I don't know how to do that.
At the end I want to get all the new columns I made with transmute and say that if at least 2 of the 28 columns are 1, then I have disease, if <2 are 1 then I have health. ANy help would be appreciated.
**Note: If you want to get the dataset, it is open access from CDC.org:
https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm **
First, it is useful to point out that the logical statements of the form is A true OR is B true OR is C true are equivalent to asking is ANY of A,B,C true? We can use this to simplify the statements setPD$ohx02pcd >5 | setPD$ohx02pcm>5 |setPD$ohx02pcs >5| ... to ask if for any of these columns it is true that their value is larger than 5.
For example, let us focus on tooth number 02 first. To get all columns that concern this tooth, we can use grep to get a vector of column names. This can be achieved with
current_tooth <- grep("02", names(setPD), value = T)
Note that if there are any other columns in the data that contain the string 02, these columns will also show up. This does not appear to be the case in your data, but it is worthwhile pointing out here in case someone else uses it and this applies in other datasets.
Now, we can use these names to subset the dataframe. For instance,
setPD[,current_tooth]
will give you the corresponding columns. In each row, we want to check if any of the above mentioned conditions are true. Given a vector of logical statements, we can check if any of them is true with the function any. To go through a dataframe by row and apply a function, we can use apply, such as in
setPD$PD02 <-
apply(setPD[,grep("02", names(setPD), value = T)], 1, function(x) any(x>5))
Now, the above applies to one tooth only, namely 02. One way of doing it for all teeth is to create a vector with all tooth indicators and use this to loop over the above lines, replacing the "02" in the above grep call in each iteration and using assign or something similar to get the variable name right. A more elegant and more efficient way is to use the same principle on long data. Consider the following:
library(reshape2)
library(dplyr)
m <- melt(setPD, id.vars="SEQN")
m$num <- substr(m$variable, 4,5) # be careful here and check output!
m <- m %>% group_by(num) %>% mutate(PS = any(value>5))
m$num <- paste0("PS", m$num)
md <- dcast(m, SEQN ~ num, value.var = "PS")
setPD <- merge(setPD, md, by="SEQN")
This melts your data first and creates a variable num that indicates your tooth. Again, make sure that this works. I have used the fact that in your data, the tooth number all appear in the 4th and 5th place in the character string. Make sure this is true, and adjust the code otherwise. Then I create a variable PS which indicates whether any of the columns that contain the tooth identifer has a value larger than 5. Last but not least I recast the data so that you have the values of PD02, PD03, etc in columns again, before I merge this to the old dataset. The line with paste0 merely creates the variable names that you want to have.

using hash to determine whether 2 dataframes are identical (PART 01)

I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison.
The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)
And I have written a little script as below:
library(digest)
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)
ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]
And something really interesting happens when I do the checking:
> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01
TRUE
506
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02
TRUE
506
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE
line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa.
line03 shows that the hash column for both dataframe are the same
line04 shows that the two dataframe are different
What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)
Or do I have any misunderstanding about digest? Thanks.
Read in data as before.
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
Like Marek said, start by checking for differences with all.equal.
all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches"
[2] "Component 8: 24 string mismatches"
So we just need to look at columns 6 and 8.
different.old <- ddd.old[, c(6, 8)]
different.new <- ddd.new[, c(6, 8)]
Hash these columns.
hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)
And find the rows where they don't match.
different_rows <- which(hash.old != hash.new) #which is optional
Finally, combine the datasets.
cbind(different.old[different_rows, ], different.new[different_rows, ])

Resources