Extract list of non-matches in R - r

So I have two dataframes, and both have one column that represents an ID number linked to a DNA sequence, and another column has the DNA sequence. My two dataframes are either the raw data, or data that have been filtered to only include a subset of the raw data. What I'm now interested in doing is generating a .csv of all the sequences in the raw dataframe that don't have a match to the stuff in the filtered dataframe.
So as an example of the goal, I'll define a couple dataframes here with two columns (col1 and col2):
col1a<-c(1,2,3,4,5,6)
col2a<-c("a","t","a","t","a","g")
col1b<-c(1,3,5,6)
col2b<-c("a","a","a","g")
df1<-data.frame(col1a,col2a)
df2<-data.frame(col1b,col2b)
my output wants to be this third dataframe (df3):
col1c <- c(2,4)
col2c <- c("t","t")
df3 <- data.frame(col1c,col2c)
I know I can use %in%. I can get this far:
IN <- sum(df1$col1a %in% df2$col1b) #Output = 4
NOTIN <- sum(!df1$col1a %in% df2$col1b) #Output = 2
So now I'm looking for a way to export the rows referred to from "NOTIN" such that they can be written as a table. I want to generate the example dataframe I called df3 earlier, as my output.
Any help or suggestions are much appreciated :)

If df1 contains all the entries in df2, it's as simple as
df1[!df1$col1a %in% df2$col1b, ]

You can use an anti_join:
library(dplyr)
anti_join(df1, df2, by = c("col1a" = "col1b"))

You can do this in data.table as well:
library(data.table)
df1 <- data.table(df1, key = col1a)
df2 <- data.table(df2, key = col1b)
df1[!df2]
With version 1.9.5 (On GithHub, not on CRAN yet), you can use on = syntax instead of setting a key :
df1[!df2, on = c(col1a = "col1b")]

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

How to separate a column of data.table given conditions

After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!
Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]

Finding the closest character string in a second data frame in R

I have a quite big data.frame with non updated names and I want to get the correct names that are stored in another data.frame.
I am using stringdist function to find the closest match between the two columns and then I want to put the new names in the original data.frame.
I am using a code based on sapply function, as in the following example :
dat1 <- data.frame("name" = paste0("abc", seq(1:5)),
"value" = round(rnorm(5), 1))
dat2 <- data.frame("name" = paste0("abd", seq(1:5)),
"other_info" = seq(11:15))
dat1$name2 <- sapply(dat1$name,
function(x){
char_min <- stringdist::stringdist(x, dat2$name)
dat2[which.min(char_min), "name"]
})
dat1
However, this code is too slow considering the size of my data.frame.
Is there a more optimized alternative solution, using for example data.table R package?
First convert the data frames into data tables:
dat1 <- data.table(dat1)
dat2 <- data.table(dat2)
Then use the ":=" and "amatch" command to create a new column that approximately matches the two names:
dat1[,name2 := dat2[stringdist::amatch(name, dat2$name)]$name]
This should be much faster than the sapply function. Hope this helps!

In R, how can I check to see what values of two different vectors of different lengths are the same?

So, I have two different data frames and they both have different amount of columns. I was just wondering if there is an easy way to check what column names are equal for both data frames when the length of the column names for each data frame is different? I am sure I can do this using for and if loops but I just want to know if there are any commands built in R that can make this easier for me?
Thanks
Given
a <- (1:10)
b <- (11:20)
c <- (21:30)
df1 <- data.frame(a,b)
df2 <- data.frame(a,c)
You can use intersect
> intersect(names(df1), names(df2))
or you can check, which columns of df1 have a match in df2
> df1 %in% df2

How to find out erroneous values in one column based on values in another column in R?

I have two columns of data (say id and master_id) in R. It should be the case that all the values in id should be present in master_id. But, I suspect that is not the case and I want to identify which ones are the erroneous values. I cannot just inspect the data as I am dealing with data of the order of 100k.
How do I go about finding the erroneous values?
the %in% function may come in handy. It will throw an FALSE for those cases that are in the first but not the second set
E.g.
DF$master_id %in% DF$id
id is the subset of master_id, so master_id values without a counterpart will get a FALSE
or, to see how it works run (from R help file)
1:10 %in% c(1,3,5,9)
Here's an answer from 2 days ago:
library(data.table)
DF1<-data.frame(x=1:3,y=4:6,t=10:12)
DF2<-data.frame(x=3:5,y=6:8,s=1:3)
library(data.table)
DF1 <- data.table(DF1, key = c("x", "y"))
DF2 <- data.table(DF2, key = c("x", "y"))
DF1[!DF2] # maybe you want this?
DF2[!DF1] # or maybe you want this?

Resources