using a lookup table in R with varying counts of data - r

Hiiii, I've been working on this issue all weekend. I'm trying to do a simple lookup, but my lookup table has different counts of data per lookup key.
Let's say I have two tables:
Table1: (there are some extra columns of data, but irrelevant to my problem)
Table1: (sample of 3 rows)
GeneName
col1 col2
HGGR .554444
BRAC4 .333222
FAM34 .111222
My lookup table is table of Gene groups followed by their respective genes. The lookup table can varying amount of columns depending on how many genes are in the group... This is a small example, the table often has 20-30 genes per group...
Table2: (example of 2 rows)
GeneGroupName
col1 col2 col3
CHR1_45000_46000 HGGR BRAC4
CHR1_67000_70000 FAM34
What I want is another column in Table1 which shows the corresponding gene group!
FinalResultTable
col1 col2 col3
CHR1_45000_46000 HGGR .554444
CHR1_45000_46000 BRAC4 .333222
CHR1_67000_70000 FAM34 .111222
The code I have so far is:
finalresult<-cbind( gene_group[match(table1[,1], gene_group[,2]),1], table1)
but of course that only works for genes found in the 2nd column of the gene group table! I need it to search thru the whole table and return the row number....
Any help? Thanks in advance
David

One way to do it is to convert your Table 2 to long format, with a column for GeneGroupName and a single column for the member genes, and then use match.
(table1 <- data.frame(GeneName=sample(LETTERS[1:12]), col2=runif(12)))
# GeneName col2
# 1 F 0.6116285
# 2 L 0.5752088
# 3 J 0.7499011
# 4 D 0.9405068
# 5 A 0.9360968
# 6 K 0.6549850
# 7 I 0.7070163
# 8 E 0.3521952
# 9 C 0.4234293
# 10 G 0.7750203
# 11 B 0.1418680
# 12 H 0.6632382
(table2 <- data.frame(GeneGroupName=1:4, g1=LETTERS[1:4], g2=LETTERS[5:8],
g3=LETTERS[9:12]))
# GeneGroupName g1 g2 g3
# 1 1 A E I
# 2 2 B F J
# 3 3 C G K
# 4 4 D H L
(table2.long <- reshape(table2, direction='long', varying=list(-1), timevar='gene'))
# GeneGroupName gene g1 id
# 1.1 1 1 A 1
# 2.1 2 1 B 2
# 3.1 3 1 C 3
# 4.1 4 1 D 4
# 1.2 1 2 E 1
# 2.2 2 2 F 2
# 3.2 3 2 G 3
# 4.2 4 2 H 4
# 1.3 1 3 I 1
# 2.3 2 3 J 2
# 3.3 3 3 K 3
# 4.3 4 3 L 4
table1$grp <- table2.long$GeneGroupName[match(table1$GeneName,
table2.long$g1)]
table1
# GeneName col2 GeneGroupName
# 1 F 0.6116285 2
# 2 L 0.5752088 4
# 3 J 0.7499011 2
# 4 D 0.9405068 4
# 5 A 0.9360968 1
# 6 K 0.6549850 3
# 7 I 0.7070163 1
# 8 E 0.3521952 1
# 9 C 0.4234293 3
# 10 G 0.7750203 3
# 11 B 0.1418680 2
# 12 H 0.6632382 4

On solution could be to use the data.table package.
Reproducing an atomic example:
table1 = data.table(col1=c("HGGR","BRAC4","FAM34"),col2=c(.55,.33,.11))
table2 = data.table(col2=c("HGGR","FAM34"),col1=c("CHR1_45000_46000", "CHR1_67000_70000"), col3=c("BRAC4",NA))
# > table1
# col1 col2
# 1: BRAC4 0.33
# 2: FAM34 0.11
# 3: HGGR 0.55
# > table2
# col2 col1 col3
# 1: HGGR CHR1_45000_46000 BRAC4
# 2: FAM34 CHR1_67000_70000 NA
First deal with the second data.table to merge col2 and col3 with melt:
table2=melt(table2, id=c("col1"), value.name="col2", na.rm=TRUE)
table2[,variable:=NULL]
Then merge the two data.table to get the wanted result:
setkey(table1, col1)
setkey(table2, col2)
table2[table1]
# col2 col1 col2.1
# BRAC4 CHR1_45000_46000 0.33
# FAM34 CHR1_67000_70000 0.11
# HGGR CHR1_45000_46000 0.55

Modying #jbaums's sample data a bit (adding NA in table2), here is one way with dplyr and tidyr.
table1 <- data.frame(GeneName=sample(LETTERS[1:12]), col2=runif(12),
stringsAsFactors = FALSE)
table2 <- data.frame(GeneGroupName=1:4, g1=LETTERS[1:4], g2=LETTERS[5:8],
g3=c(LETTERS[9:11], NA), stringsAsFactors = FALSE)
table2 %>%
gather(gene, whatever, - GeneGroupName) %>%
left_join(., table1, by = c("whatever" = "GeneName")) %>%
select(-gene, GeneGroupName, gene = whatever, value = col2)
# GeneGroupName gene value
#1 1 A 0.9926841
#2 2 B 0.3531973
#3 3 C 0.6547239
#4 4 D 0.4781180
#5 1 E 0.1293723
#6 2 F 0.6334933
#7 3 G 0.2132081
#8 4 H 0.5987610
#9 1 I 0.7317925
#10 2 J 0.9761707
#11 3 K 0.9240745
#12 4 <NA> NA

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

different print.gap value for specific column

Is there any way to have a different print.gap for a particular column?
Example data:
dd <- data.frame(col1 = 1:5, col2 = 1:5, col3 = I(letters[1:5]))
print (dd, quote=F, right=T, print.gap=5)
Output with print.gap=5:
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
Desired output (print.gap mix, first two with print.gap=5, third with print.gap=12)
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
I realise this may not be achievable with any change of the print statement, but perhaps some have an alternative method or suggestion. The output is to be saved in a text file. Also please note, the solution should be flexible enough to not just increase the gap for the last column, it could be any column, or multiple columns with different print.gaps in a data frame.
There's probably a way to do this by defining a "proper" alternative print method, but here's a hackish solution that can be used to adjust each column width independently.
rbind(
data.frame(lapply(dd, as.character), stringsAsFactors=FALSE),
substring(" ", 1, c(1,7,12))
)
# col1 col2 col3
#1 1 1 a
#2 2 1 b
#3 3 1 c
#4 4 1 d
#5 5 2 e
#6 6 2 f
#7 7 2 g
#8 8 2 h

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

Pass variable as column name to dplyr?

I have a very ugly dataset that is a flat file of a relational database. A minimal reproducible example is here:
df <- data.frame(col1 = c(letters[1:4],"c"),
col1.p = 1:5,
col2 = c("a","c","l","c","l"),
col2.p = 6:10,
col3= letters[3:7],
col3.p = 11:20)
I need to be able to identify the '.p' value for the 'col#' that has the "c". My previous question on SO got the first part: In R, find the column that contains a string in for each row. Which I'm providing for context.
tmp <- which(projectdata=='Transmission and Distribution of Electricity', arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
projectdata[maxnames] <- NA
projectdata[maxnames][cbind(tmp[,"row"],cnt)] <- names(projectdata)[tmp[,"col"]]
rm(tmp, cnt, maxnames)
This results in a dataframe that looks like this:
df
col1 col1.p col2 col2.p col3 col3.p max1
1 a 1 a 6 c 11 col3
2 b 2 c 7 d 12 col2
3 c 3 l 8 e 13 col1
4 d 4 c 9 f 14 col2
5 c 5 l 10 g 15 col1
6 a 1 a 6 c 16 col3
7 b 2 c 7 d 17 col2
8 c 3 l 8 e 18 col1
9 d 4 c 9 f 19 col2
10 c 5 l 10 g 20 col1
When I tried to get the ".p" that matched the value in "max1", I kept getting errors. I thought the approach would be:
df %>%
mutate(my.p = eval(as.name(paste0(max1,'.p'))))
Error: object 'col3.p' not found
Clearly, this did not work, so I thought maybe this was similar to passing a column name in a function, where I need to use 'get'. That also didn't work.
df %>%
mutate(my.p = get(as.name(paste0(max1,'.p'))))
Error: invalid first argument
df %>%
mutate(my.p = get(paste0(max1,'.p')))
Error: object 'col3.p' not found
I found something that gets rid of this error, using data.table from a different, but related problem, here: http://codereply.com/answer/7y2ra3/dplyr-error-object-found-using-rle-mutate.html. However, it gives me "col3.p" for every row. This is max1 for the first row, df$max1[1]
library('dplyr')
library('data.table') # must have the data.table package
df %>%
tbl_dt(df) %>%
mutate(my.p = get(paste0(max1,'.p')))
Source: local data table [10 x 8]
col1 col1.p col2 col2.p col3 col3.p max1 my.p
1 a 1 a 6 c 11 col3 11
2 b 2 c 7 d 12 col2 12
3 c 3 l 8 e 13 col1 13
4 d 4 c 9 f 14 col2 14
5 c 5 l 10 g 15 col1 15
6 a 1 a 6 c 16 col3 16
7 b 2 c 7 d 17 col2 17
8 c 3 l 8 e 18 col1 18
9 d 4 c 9 f 19 col2 19
10 c 5 l 10 g 20 col1 20
Using the lazyeval interp approach (from this SO: Hot to pass dynamic column names in dplyr into custom function?) doesn't work for me. Perhaps I am implementing it incorrectly?
library(lazyeval)
library(dplyr)
df %>%
mutate_(my.p = interp(~colp, colp = as.name(paste0(max1,'.p'))))
I get an error:
Error in paste0(max1, ".p") : object 'max1' not found
Ideally, I will have the new column my.p equal the appropriate p based on the column identified in max1.
I can do this all with ifelse, but I am trying to do it with less code and to make it applicable to the next ugly flat table.
We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the the row sequence, we get the value of the paste output, and assign (:=) it to a new column ('my.p').
library(data.table)
setDT(df)[, my.p:= get(paste0(max1, '.p')), 1:nrow(df)]
df
# col1 col1.p col2 col2.p col3 col3.p max1 my.p
# 1: a 1 a 6 c 11 col3 11
# 2: b 2 c 7 d 12 col2 7
# 3: c 3 l 8 e 13 col1 3
# 4: d 4 c 9 f 14 col2 9
# 5: c 5 l 10 g 15 col1 5
# 6: a 1 a 6 c 16 col3 16
# 7: b 2 c 7 d 17 col2 7
# 8: c 3 l 8 e 18 col1 3
# 9: d 4 c 9 f 19 col2 9
#10: c 5 l 10 g 20 col1 5

How to calculate the frequency of each value in a column corresponding to each value in another column in R?

I have a dataset as follows:
col1 col2
A 1
A 2
A 2
B 1
B 1
C 1
C 1
C 2
I want the output as:
col1 col2 Frequency
A 1 1
A 2 2
B 1 2
C 1 2
C 2 1
I tried using the aggregate function and also the table function but I am unable to get desired result.
You can add a dummy column or use the rownames to aggregate on:
aggregate(rownames(mydf) ~ ., mydf, length)
# col1 col2 rownames(mydf)
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 C 2 1
table also works fine but will report combinations that may not be in your data as "0":
data.frame(table(mydf))
# col1 col2 Freq
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 B 2 0
# 6 C 2 1
Another nice approach is to use "data.table":
library(data.table)
as.data.table(mydf)[, .N, by = names(mydf)]
if your data is
col1 <- c("A","A","A","B","B","C","C","C")
col2 <- c(1,2,2,1,1,1,1,2)
df <- data.frame(col1,col2)
you can use dplyr
1) group_by both both variables, since your output is supposed to include every combination of them
2) count the number of observations for each group using n()
library(dplyr)
df %>% group_by(col1,col2) %>% summarize(frequency=n())
# output
col1 col2 frequency
1 A 1 1
2 A 2 2
3 B 1 2
4 C 1 2
5 C 2 1

Resources