Count co-occurences from a dataframe with ids - r

I realize there are a lot of similar questions but they all tackle a slightly different problem and I have been stuck for a while.
I have a dataframe of all unique combinations of 2 variables as follows:
df = data.frame(id = c('c1','c2','c3','c2','c3','c1','c3'),
groupid = c('g1','g1','g1','g2','g2','g3','g3'))
And I need the following output:
c1 c2 c3
c1 3 1 2
c2 1 3 2
c3 2 2 3
In other words I need to count how often each pair of customer ids occur in the same group.
Seems like a basic question, but I can't figure it out. I tried:
making a cross join to find all possible combinations of (cid1,groupid,cid2)
looping through all of them and retrieving unique groups that match cid1 and unique groups that match cid2
taking the length of the intersection
..but this would take forever to run, so I am looking for an efficient and preferably clean solution (using tidyr/dplyr).

We may use crossprod after getting the frequency count with table on the two columns
crossprod(table(df[2:1]))

Related

Selecting rows based on grepl results in multiple columns in R

I have data (df) like this with 50 diagnosis codes (dx.1 through dx.50) per patient:
ID dx.1 dx.2 dx.50
1 150200 140650 250400
2 752802 851812 NA
3 441402 450220 NA
4 853406 853200 150404
5 250604 NA NA
I would like to select the rows that have any of the diagnosis codes starting with "250". So in the example, it would be ID 1 and 5.
After stumbling around for awhile, I finally came up with this:
df$select = rowSums(sapply(df[,2:ncol(df)], function(x) grepl("\\<250", x)))
selected = df[df$select>0,]
It's kind of clanky and takes a while since I'm running it on several thousand rows.
Is there a better/faster way to do this?
Is there an easy way to extend this to multiple search criteria?

Merge two tables together in R

Currently I am trying to merge two tables in R together. Both of them have different contents and no ID, just the columns have numbers given by R.
My problem now is that I cannot add the columns and their values from table 2 to table 1. I also want to mention that both of them have the same amount of rows, which means table 1 has got 1000 rows and and table 2 as well. I also cannot add an ID field otherwise it is not possible to run further steps of my code.
Basically my tables look like this:
I would really appreciate it, if someone can help me.
The simplest (and perhaps blind way) is to use cbind to combine the two tables, as long as the number of rows in each table are equal.
x<-tribble(~Value1, ~Value2, ~Value3,
a,b,c,
aa,bb,cc)
y<-tribble(~Value4, ~Value5, ~Value6,
d,e,f,
dd,ee,ff)
cbind(x,y)
Output becomes
Value 1 Value 2 Value 3 Value 4 Value 5 Value 6
1 a b c d e f
2 aa bb cc dd ee ff
Since the two tables are (I assume) mutually exclusive, there is no way to meaningfully join them if you don't have relations to work with. If you seek to merge them in R, it will merge the two tables and return a dataframe that has all the different unique combinations of merging them. This means that, if you have 1000 rows in each, you may end up with a 1000*1000 dataframe.
This will reproduce your example
Value1=c("a","aa")
Value2=c("b","bb")
Value3=c("c","cc")
Value4=c("d","dd")
Value5=c("e","ee")
Value6=c("f","ff")
table1=data.frame(Value1,Value2,Value3)
table2=data.frame(Value4,Value5,Value6)
Result=cbind(table1,table2)

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

finding "almost" duplicates indices in a data table and calculate the delta

i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA

Concatenating column values of identical rows (except for this column) of two different data tables

I have two data tables of genes aggHuman and aggRat
> aggHuman
Human Rat RNAtype
1 ASAP2 Asap2 Hy
2 BBS1 Bbs1 Hn
3 BBS2 Bbs2 Hn
4 SPATA22 Spata22 Hn
and
> aggRat
Human Rat RNAtype
1 ASAP2 Asap2 Rn
2 BBS1 Bbs1 Ry
3 BBS2 Bbs2 Rn
4 SPATA22 Spata22 Rn
Now, I want to stitch the values in column RNAtype of these two tables. For example, in aggHuman for ASAP2 we have Hy, whereas in aggRat, we have Rn. Now, I want to make another similar table of the following form by stitching HyRn.
Human Rat RNAtype
1 ASAP2 Asap2 HyRn
But the initial two tables can have genes in different order. So, what I need to do is find the row corresponding to ASAP2 in aggHuman and "find" the same gene row in aggRat and then do the stitch thing.
Could anyone help me on how to do this?
Try this:
Step 1: Load data.table library (you may need to install this):
library(data.table)
Step 2: Convert your data.frame to data.table, and set appropriate keys:
setDT(aggHuman)
setkey(aggHuman,Human,Rat)
setDT(aggRat)
setkey(aggRat,Human,Rat)
Step 3: Join the two data tables, and perform desired combination:
aggHumanRat <- aggHuman[aggRat]
aggHumanRat[,RNAtype := paste(RNAtype,RNAtype.1,sep="")][,RNAtype.1:=NULL]
aggHumanRat
Hope this helps!!

Resources