Merge two tables together in R - r

Currently I am trying to merge two tables in R together. Both of them have different contents and no ID, just the columns have numbers given by R.
My problem now is that I cannot add the columns and their values from table 2 to table 1. I also want to mention that both of them have the same amount of rows, which means table 1 has got 1000 rows and and table 2 as well. I also cannot add an ID field otherwise it is not possible to run further steps of my code.
Basically my tables look like this:
I would really appreciate it, if someone can help me.

The simplest (and perhaps blind way) is to use cbind to combine the two tables, as long as the number of rows in each table are equal.
x<-tribble(~Value1, ~Value2, ~Value3,
a,b,c,
aa,bb,cc)
y<-tribble(~Value4, ~Value5, ~Value6,
d,e,f,
dd,ee,ff)
cbind(x,y)
Output becomes
Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
1 a b c d e f
2 aa bb cc dd ee ff
Since the two tables are (I assume) mutually exclusive, there is no way to meaningfully join them if you don't have relations to work with. If you seek to merge them in R, it will merge the two tables and return a dataframe that has all the different unique combinations of merging them. This means that, if you have 1000 rows in each, you may end up with a 1000*1000 dataframe.

This will reproduce your example
Value1=c("a","aa")
Value2=c("b","bb")
Value3=c("c","cc")
Value4=c("d","dd")
Value5=c("e","ee")
Value6=c("f","ff")
table1=data.frame(Value1,Value2,Value3)
table2=data.frame(Value4,Value5,Value6)
Result=cbind(table1,table2)

Related

Count co-occurences from a dataframe with ids

I realize there are a lot of similar questions but they all tackle a slightly different problem and I have been stuck for a while.
I have a dataframe of all unique combinations of 2 variables as follows:
df = data.frame(id = c('c1','c2','c3','c2','c3','c1','c3'),
groupid = c('g1','g1','g1','g2','g2','g3','g3'))
And I need the following output:
c1 c2 c3
c1 3 1 2
c2 1 3 2
c3 2 2 3
In other words I need to count how often each pair of customer ids occur in the same group.
Seems like a basic question, but I can't figure it out. I tried:
making a cross join to find all possible combinations of (cid1,groupid,cid2)
looping through all of them and retrieving unique groups that match cid1 and unique groups that match cid2
taking the length of the intersection
..but this would take forever to run, so I am looking for an efficient and preferably clean solution (using tidyr/dplyr).
We may use crossprod after getting the frequency count with table on the two columns
crossprod(table(df[2:1]))

Filter data (data.frame) according an attribute and assign it to a vector

Good afternoon,
I have a large data frame (+20000 rows) with three columns, two columns are the x-y coordinates of a point and the third one indicates an important attribute of that point (+100 different attributes in total).
I would like to filter the data for each attribute, so basically classify the points according to each attribute. The part of the problem that makes it difficult for me is the +100 attributes, as it then needs to be done in a loop (e.g for loop)
#data looks like this:
x y att
1 1 a
2 3 a
4 6 a
3 5 b
5 5 b
4 1 c
etc.
Notice that each attribute doesn't have the same amount of points...
Thank you very much,
any suggestion would help
Are you saying you'd like to split your data frame into separate data frames, one for each attribute? That can be done with:
require(dplyr)
data %>%
group_by(att)
group_split(att)
We can use split like below
split(df,df$att)

remove duplicates based on a+b logic applied to 2 non-numeric columns

This may be a failure to know the right keywords to search, but I'm looking for a way remove duplicates based on an an order reversal between two non-numeric columns. Here is a very small subset of my data:
ANIMAL1<-c("20074674_K.v1","20085105_K.v1","20085638_K.v1","20085646_K.v1")
ANIMAL2<-c("20085105_K.v1","20074674_K.v1","20074674_K.v1","20074674_K.v1")
exclusions<-c(13,13,5,10)
data<-data.frame(ANIMAL1,ANIMAL2,exclusions)
ANIMAL1 ANIMAL2 exclusions
1 20074674_K.v1 20085105_K.v1 13
2 20085105_K.v1 20074674_K.v1 13
3 20085638_K.v1 20074674_K.v1 5
4 20085646_K.v1 20074674_K.v1 10
The first and second row are duplicate comparisons, the order of animals is just reversed between the first two columns. It doesn't matter which one is deleted, but I want to delete one of the duplicates... and all the rest of the duplicates that fit this logic in my larger dataframe. I'm used to subsetting according to the logic in these questions: Remove duplicate column pairs, sort rows based on 2 columns and the other posts that come up with searching "remove duplicates based on 2 columns" but I haven't yet found anything yet that approximates my use case. Here is what I would like my data to look like after the duplication removal:
ANIMAL1 ANIMAL2 exclusions
1 20085105_K.v1 20074674_K.v1 13
2 20085638_K.v1 20074674_K.v1 5
3 20085646_K.v1 20074674_K.v1 10
Thanks much!
data[duplicated(t(apply(data,1,sort))) == FALSE,]
Sort by each row so that I make each row's combo of ANIMAL1 or ANIMAL2 same across each row if they are in different columns. Exclusions are sorted, too, but in this case you don't have to.
When it is sorted by rows, data needs to be transposed back to columns as original data set
Flag row duplicates and strip them out.

Concatenating column values of identical rows (except for this column) of two different data tables

I have two data tables of genes aggHuman and aggRat
> aggHuman
Human Rat RNAtype
1 ASAP2 Asap2 Hy
2 BBS1 Bbs1 Hn
3 BBS2 Bbs2 Hn
4 SPATA22 Spata22 Hn
and
> aggRat
Human Rat RNAtype
1 ASAP2 Asap2 Rn
2 BBS1 Bbs1 Ry
3 BBS2 Bbs2 Rn
4 SPATA22 Spata22 Rn
Now, I want to stitch the values in column RNAtype of these two tables. For example, in aggHuman for ASAP2 we have Hy, whereas in aggRat, we have Rn. Now, I want to make another similar table of the following form by stitching HyRn.
Human Rat RNAtype
1 ASAP2 Asap2 HyRn
But the initial two tables can have genes in different order. So, what I need to do is find the row corresponding to ASAP2 in aggHuman and "find" the same gene row in aggRat and then do the stitch thing.
Could anyone help me on how to do this?
Try this:
Step 1: Load data.table library (you may need to install this):
library(data.table)
Step 2: Convert your data.frame to data.table, and set appropriate keys:
setDT(aggHuman)
setkey(aggHuman,Human,Rat)
setDT(aggRat)
setkey(aggRat,Human,Rat)
Step 3: Join the two data tables, and perform desired combination:
aggHumanRat <- aggHuman[aggRat]
aggHumanRat[,RNAtype := paste(RNAtype,RNAtype.1,sep="")][,RNAtype.1:=NULL]
aggHumanRat
Hope this helps!!

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources