Creating a lookup based on two values - r

I have an excel that contains a matrix. Here you find a screenshot of the matrix I want to use: https://www.flickr.com/photos/113328996#N07/23026818939/in/dateposted-public/
What I would like to do now is to create some kind of lookup function. So when i have the rows:
Arsenal - Aston Villa
It should look up 114.6.
Of course I could create rows with all distances like:
Arsenal - Aston Villa - 144.6
And perform a lookup function but my instincts tell me this is not the most efficient way.
Any feedback on how I can deal with above most efficiently?

This lookup-function is the basic [ operator for data.frames and matrices in R.
Take this example data (from Here)
a <- cbind(c(0.1,0.5,0.25),c(0.2,0.3,0.65),c(0.7,0.2,0.1))
rownames(a) <- c("Lilo","Chops","Henmans")
colnames(a) <- c("Product A","Product B","Product C")
a
Product A Product B Product C
Lilo 0.10 0.20 0.7
Chops 0.50 0.30 0.2
Henmans 0.25 0.65 0.1
The lookupfunktion is this:
a["Lilo","Product A"] # 0.1
a["Henmans","Product B"] # 0.65

Related

Split data.frame into groups by column name

I'm new to R. I have a data frame with column names of such type:
file_001 file_002 block_001 block_002 red_001 red_002 ....etc'
0.05 0.2 0.4 0.006 0.05 0.3
0.01 0.87 0.56 0.4 0.12 0.06
I want to split them into groups by the column name, to get a result like this:
group_file
file_001 file_002
0.05 0.2
0.01 0.87
group_block
block_001 block_002
0.4 0.006
0.56 0.4
group_red
red_001 red_002
0.05 0.3
0.12 0.06
...etc'
My file is huge. I don't have a certain number of groups.
It needs to be just by the column name's start.
In base R, you can use sub and split.default like this to return a list of data.frames:
myDfList <- split.default(dat, sub("_\\d+", "", names(dat)))
this returns
myDfList
$block
block_001 block_002
1 0.40 0.006
2 0.56 0.400
$file
file_001 file_002
1 0.05 0.20
2 0.01 0.87
$red
red_001 red_002
1 0.05 0.30
2 0.12 0.06
split.default will split data.frames by variable according to its second argument. Here, we use sub and the regular expression "_\d+" to remove the underscore and all numeric values following it in order to return the splitting values "block", "file", and "red".
As a side note, it is typically a good idea to keep these data.frames in a list and work with them through functions like lapply. See gregor's answer to this post for some motivating examples.
Thank you lmo,
after using your code, it didn't work as I wanted, but I came with a solution thanks to your guidance.
So, in order to divide a Data Frame list:
myDfList <- split.default(dat, sub(x = as.character(names(dat)), pattern = "\\_.*", ""))
hope it'll help people in the future!

How to select the best rank according to quantitative values of another variable

I created a dataframe that looks like this:
# Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000146396 0.20 431801 3
ENSMUSG00000089809 ENSMUST00000161516 0.23 354036 2
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000117098 0.05 4400 2
ENSMUSG00000044681 ENSMUST00000141196 0.10 1118 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
Now I would like to select for each GeneId the TrID that has the higher PSI value with the respective Ranking. At the end the output will be like this:
# Desired Output Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
After that, I will create a distribution of the ranking values and check in which PSI value the rank corresponds. I will permute the Length values and the TrID values in order to perform a control of the distribution.
You can use base R and do:
byGeneId = split(1:nrow(Dataframe), Dataframe$GeneId)
whichTopPsi = sapply(byGeneId, function(i) i[which.max(Dataframe[i,'PSI'])])
Dataframe[whichTopPsi,]
You could also use ddply, which is more general.
require(plyr)
ddply(Dataframe, "GeneId", function(d) d[which.max(d[,'PSI']),,drop=FALSE])

Vectorizing with R instead of for loop

I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much
You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'
# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current

Complex subsetting of dataframe

Consider the following dataframe:
df <- data.frame(Asset = c("A", "B", "C"), Historical = c(0.05,0.04,0.03), Forecast = c(0.04,0.02,NA))
# Asset Historical Forecast
#1 A 0.05 0.04
#2 B 0.04 0.02
#3 C 0.03 NA
as well as the variable x. x is set by the user at the beginning of the R script, and can take two values: either x = "Forecast" or x = "Historical".
If x = "Forecast", I would like to return the following: for each asset, if a forecast is available, return the appropriate number from the column "Forecast", otherwise, return the appropriate number from the column "Historical". As you can see below, both A and B have a forecast value which is returned below. C is missing a forecast value, so the historical value is returned.
Asset Return
1 A 0.04
2 B 0.02
3 C 0.03
If, however, x= "Historical",simply return the Historical column:
Asset Historical
1 A 0.05
2 B 0.04
3 C 0.03
I can't come up with an easy way of doing it, and brute force is very inefficient if you have a large number of rows. Any ideas?
Thanks!
First, pre-process your data:
df2 <- transform(df, Forecast = ifelse(!is.na(Forecast), Forecast, Historical))
Then extract the two columns of choice:
df2[c("Asset", x)]

How to summarize multiple files into one file based on an assigned rule?

I have ~ 100 files in the following format, each file has its own file name, but all these files are save in the same directory, let's said, filecd is follows:
A B C D
ab 0.3 0.0 0.2 0.20
cd 0.7 0.0 0.3 0.77
ef 0.8 0.1 0.5 0.91
gh 0.3 0.5 0.6 0.78
fileabb is as follows:
A B C D
ab 0.3 0.9 1.0 0.20
gh 0.3 0.5 0.6 0.9
All these files have same number of columns but different number of rows.
For each file I want to summarize them as one row (0 for all cells in the same column are < 0.8; 1 for ANY of the cells in the same column is larger than or equal to 0.8), and the summerized results will be saved in a separate csv file as follows:
A B C D
filecd 1 0 0 1
fileabb 0 1 1 1
..... till 100
Instead of reading files and processing each files separately, could it be done by R efficiently? Could you give me help on how to do so? Thanks.
For the ease of discussion. I have add following lines for sample input files:
file1 <- data.frame(A=c(0.3, 0.7, 0.8, 0.3), B=c(0,0,0.1,0.5), C=c(0.2,0.3,0.5,0.6), D=c(0.2,0.77,0.91, 0.78))
file2 <- data.frame(A=c(0.3, 0.3), B=c(0.9,0.5), C=c(1,0.6), D=c(0.2,0.9))
Please kindly give me some more advice. Many thanks.
First make a vector of all the filenames.
filenames <- dir(your_data_dir) #you may also need the pattern argument
Then read the data into a list of data frames.
data_list <- lapply(filenames, function(fn) as.matrix(read.delim(fn)))
#maybe with other arguments passed to read.delim
Now calculate the summary.
summarised <- lapply(data_list, function(dfr)
{
apply(x, 2, function(row) any(row >= 0.8))
})
Convert this list into a matrix.
summary_matrix <- do.call(rbind, summarised)
Make the rownames match the file.
rownames(summary_matrix) <- filenames
Now write out to CSV.
write.csv(summary_matrix, "my_summary_matrix.csv")

Resources