Compare two colums in diffrents tables - r

Say I have two tables, Table A and Table B, and I want to compare a certain column.
For example,
Table A has the columns:
Name,Surname ,Family, species
Table B has the columns:
IP,Genes,Types,Species,Models
How do I compare the Species column between the two tables to get the matches , that means that i want to extract name of species that exist in both tables?
for exemple if the first species column have
a b c d e f g h i
information and the second species colum have
k l m n a b y i l
i want this result :
a b i
Can you tell me please the way i can do that , and also if there s anyway i can do it without usin join
Thank you very much

Try any of these options. I have used dummy data:
#Data
TableA <- data.frame(Species=c('a','b','c','d','e','f','g','h','i'),
Var=1,stringsAsFactors = F)
TableB <- data.frame(Species=c('k','l','m','n','a','b','y','i','l'),
Var2=2,stringsAsFactors = F)
#Option1
TableA$Species[TableA$Species %in% TableB$Species]
#Option 2
intersect(TableA$Species,TableB$Species)
In both cases the output will be:
[1] "a" "b" "i"

Related

R: Is there a way to get unique, closest matches with the rows in the same data.table based on multiple columns?

In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
}
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
...
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.
Thanks!

Incorrect output from inner_join of dplyr package

I have two datasets, named "results" and "support2", available here.
I want to merge the two datasets by the only common column name "SNP". Code below:
> library(dplyr)
> results <- read_delim("<path>\\results", delim = "\t", col_name = T)
> support2 <- read_delim("<path>\\support2", delim = "\t", col_name = T)
> head(results)
# A tibble: 6 x 2
SNP p.value
<chr> <dbl>
1 rs28436661 0.334
2 rs9922067 0.322
3 rs2562132 0.848
4 rs3930588 0.332
5 rs2562137 0.323
6 rs3848343 0.363
> head(support2)
# A tibble: 6 x 2
SNP position
<chr> <dbl>
1 rs62028702 60054
2 rs190434815 60085
3 rs62028703 60087
4 rs62028704 60095
5 rs181534180 60164
6 rs186233776 60177
> dim(results)
[1] 188242 2
> dim(support2)
[1] 1210619 2
# determine the number of common SNPs
length(Reduce(intersect, list(results$SNP, support2$SNP)))
[1] 187613
I would expect that after inner_join, the new data would have 187613 rows.
> newdata <- inner_join(results, support2)
Joining, by = "SNP"
> dim(newdata)
[1] 1409812 3
Strangely, instead of have 187613 rows, the new data have 1409812 rows, which is even larger than the sum of the number of rows of the two dataframes.
I switched to the merge function as below:
> newdata2 <- merge(results, support2)
> dim(newdata2)
[1] 1409812 3
This second new dataframe has the same issue. No idea why.
I wish to know how should I obtain a new dataframe whose rows represent the common rows of the two dataframes (should have 187613 rows) and whose columns contain columns of both dataframes.
It could be a result of duplicate elements
results <- data.frame(col1 = rep(letters[1:3], each = 3), col2 = rnorm(9))
support2 <- data.frame(col1 = rep(letters[1:5],each = 2), newcol = runif(10))
library(dplyr)
out <- inner_join(results, support2)
nrow(out)
#[1] 18
Here, the initial datasets in the common column ('col1') are duplicated which confuses the join statement as to which row it should take as a match resulting in a situation similar to a cross join but not exactly that
As already pointed out by #akrun, the data may have duplicates, possibly that is the only explanation of this behavior.
From the documentation of intersect, it always returns a unique value but inner join can have duplicates if the "by" value has duplicates, Hence the count mismatch.
If you truly want to see its right, see the unique counts of by variable (unique key in your case), it should match with your intersect result. But that doesn't mean your join/merge is right, ideally any join which has duplicates in both table A and B is not recommended(unless offcourse you have business/other justification). So, check if the duplicates are present in both the tables or only one of them. If it only found in one of the tables then probably your merge/join should be alright. I hope I am able to explain the scenario.
Please let me know if it doesn't answer your question, I shall remove it.
From Documentations:
intersect:
Each of union, intersect, setdiff and setequal will discard any
duplicated values in the arguments, and they apply as.vector to their
arguments
inner_join():
return all rows from x where there are matching values in y, and all
columns from x and y. If there are multiple matches between x and y,
all combination of the matches are returned.

R : Track Changes Across two columns

I have a data frame which records the changes in the name of companies. A simple representation would be :
df <- data.frame(key = c("A", "B","C", "E","F","G"), Change = c("B", "C","D" ,"F","G","H"))
print(df)
Key Change
1 A B
2 B C
3 C D
4 E F
5 F G
6 G H
I want to track all the changes a value is going through. Here is an output that can help me do so:
Key 1st 2nd 3rd 4th
1 A B C D
2 E F G H
How can I do it in R? I am new to R and Programming. It would be great to get help.
The question was marked duplicate of How to reshape data from long to wide format?
However, it is not an exact duplicate. For the reasons :
1. example used here contains data changing across columns. That is not the case in the question of reshaping data. Here, the two columns are dependent on each other.
2. Before reshaping, I reckon there is another step : maybe giving an id for the changes taking place. I am not sure how to do it.
Could you help me?
Can we assume that a same name never appears (never occurs like A->B->C and D->E->A)? If so, you can do the following.
df <- data.frame(key = c("A","B","C", "E","F","G"),
Change = c("B","C","D" ,"F","G","H"))
print(df)
# mapping from old to new name
next_name <- as.character(df$Change)
names(next_name) <- df$key
all_names <- unique(c(as.character(df$key), as.character(df$Change)))
get_id <- function(x) {
# for each name, repeatedly traverse until the final name
ss <- x %in% names(next_name)
if (any(ss)) {
x[ss] <- get_id(next_name[x[ss]])
}
x
}
ids <- get_id(all_names)
lapply(unique(ids), function(i) c(all_names[ids==i]))
# out come is a list of company names,
# each entry represents a history of a firm
##[[1]]
##[1] "A" "B" "C" "D"
##[[2]]
##[1] "E" "F" "G" "H"
The outcome is a list, not data frame since the number of name sequences may not be unique (firms may have different number of names).

Splitting a dataframe if rows are numeric or not in R

I have a data frame (let's call it 'df') it consists of two columns
Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com
I want to split the dataframe into two dataframe based on whether a row in column "Contact" is numeric or not
Expected Output:
Name Contact
A 34552325
B 423424
C 4324234242
and
Name Contact
D hello1#company.com
I tired using:
df$IsNum <- !(is.na(as.numeric(df$Contact)))
But this classified "hello1#company.com" also as numeric.
Basically if there is even a single non-numeric value in column "Contact", then code must classify it as non-numeric
You may use grepl..
x <- " Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com"
df <- read.table(text=x, header = T)
x <- df[grepl("^\\d+$",df$Contact),]
y <- df[!grepl("^\\d+$",df$Contact),]
x
# Name Contact
# 1 A 34552325
# 2 B 423424
# 3 C 4324234242
y
# Name Contact
# 4 D hello1#company.com
We can create a grouping variable with grepl (same as how #Avinash Raj created), split the dataframe with that to create a list of data.frames.
split(df, grepl('^\\d+$', df$Contact))

How to find the rows in a table that in specific columns contain value from another table in R

I have a vector a with these elements:
ENSMUST00000000094 ENSMUST00000000137 ENSMUST00000000305 ENSMUST00000000349 ENSMUST00000000356 ENSMUST00000000384 ENSMUST00000000430 ENSMUST00000000449
and a data.frame b that in some of its rows have the elements from a.
The "b" is a data.frame with 2 columns:
gene <- c( "ENSMUSG00000026427(Lgtn)", "ENSMUSG00000026427(Lgtn)", "ENSMUSG00000026427(Lgtn)", "ENSMUSG00000055184(Fam72a)", ENSMUSG00000013275(Slc41a1)")
and
transcripts <- c("ENSMUST00000112446 ENSMUST00000149119 ENSMUST00000151874 ENSMUST00000068791 ENSMUST00000068805 ENSMUST00000131855 ENSMUST00000153651 ENSMUST00000086578", "ENSMUST00000149119 ENSMUST00000151874 ENSMUST00000068791 ENSMUST00000068805 ENSMUST00000131855 ENSMUST00000086578", "ENSMUST00000151874 ENSMUST00000068791 ENSMUST00000131855 ENSMUST00000086578", "ENSMUST00000068613
ENSMUSG00000052688(5430435G22Rik)
ENSMUST00000064679", "ENSMUST00000086559")
b <- cbind(gene, transcripts)
I want to find rows in "b" that in transcripts columns have one of the "a" elements.
Thanks a lot for your help.
Just convert a to a dataframe and merge
library(dplyr)
data_frame(transcripts = a) %>%
left_join(b)

Resources