Comparing specific columns in 2 different files using R - r

I find myself having to do this very often -- compare specific columns from 2 different files. The columns, formats are the same, but the columns that need comparison have floating point/exponential format data, e.g. 0.0058104642437413175, -3.459017050577087E-4, etc.
I'm currently using the below R code:
test <- read.csv("C:/VBG_TEST/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
prod <- read.csv("C:/VBG_PROD/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
sqldf("select sum(V10), sum(V15) from test")
sqldf("select sum(V10), sum(V15) from prod")
I read in the files, and sum the specific columns -- V10, V15 and then observe the values. This way I can ignore very small differences in floating point data per row.
However, going forward, I would like to set a tolerance percent, ie. if abs( (prod.V10 - test.V10)/prod.V10 ) > 0.01%, and only print those row numbers that exceed this tolerance limit.
Also, if the data is not in the sane order, how can I do a comparison by specifying columns that will act like a composite primary key?
For e.g., if I did this in Sybase, I'd have written something like:
select A.*, B.*
from tableA A, tableB B
where abs( (A.Col15-B.Col15)/A.Col15) ) > 0.01%
and A.Col1 = B.Col1
and A.Col4 = B.Col4
and A.Col6 = B.Col6
If I try doing the same thing using sqldf in R, it does NOT work as the files contain 500K+ rows of data.
Can anyone point me to how I can do the above in R?
Many thanks,
Chapax.

Au, this sqldf hurts my mind -- better use plain R capabilities than torture yourself with SQL:
which(abs(prod$V10-test$V10)/prod$V10>0.0001)
In a more general version:
which(abs(prod[,colTest]-test[,colTest])/prod[,colTest]>tolerance)
where colTest is an index of column that you want to test and tolerance is tolerance.

I don't know R but I'm suggesting this as a general advice. You should paginate your table and then use your query. I mean I think in general is not wise to execute specific comparison instructions over a table that big.

Related

full_join by date plus one or minus one

I want to use full_join to join two tables. Below is my pseudo code:
join <- full_join(a, b, by = c("a_ID" = "b_ID" , "a_DATE_MONTH" = "b_DATE_MONTH" +1 | "a_DATE_MONTH" = "b_DATE_MONTH" -1 | "a_DATE_MONTH" = "b_DATE_MONTH"))
a_DATE_MONTH and b_DATE_MONTH are in date format "%Y-%m".
I want to do full join based on condition that a_DATE_MONTH can be one month prior to b_DATE_MONTH, OR one month after b_DATE_MONTH, OR exactly equal to b_DATE_MONTH. Thank you!
While SQL allows for (almost) arbitrary conditions in a join statement (such as a_month = b_month + 1 OR a_month + 1 = b_month) I have not found dplyr to allow the same flexibility.
The only way I have found to join in dplyr on anything other than a_column = b_column is to do a more general join and filter afterwards. Hence I recommend you try something like the following:
join <- full_join(a, b, by = c("a_ID" = "b_ID")) %>%
filter(abs(a_DATE_MONTH - b_DATE_MONTH) <= 1)
This approach still produces the same records in your final results.
It perform worse / slower if R does a complete full join before doing any filtering. However, dplyr is designed to use lazy evaluation, which means that (unless you do something unusual) both commands should be evaluated together (as they would be in a more complex SQL join).

Optimization of R Data.table combination with for loop function

I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta
Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))
This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.

How to not use loops & IF-statements in R

I have two dataframes in R, one big but imcomplete (import) and I want to create a smaller, complete subset of it (export). Every ID in the $unique_name column is unique, and does not appear twice. Other columns might be for example body mass, but also other categories that correspond to the unique ID. I've made this code, a double-loop and an if-statement and it does work, but it is slow:
for (j in 1:length(export$unique_name)){
for (i in 1:length(import$unique_name)){
if (toString(export$unique_name[j]) == toString(import$unique_name[i])){
export$body_mass[j] <- import$body_mass[i]
}
}
}
I'm not very good with R but I know this is a bad way to do it. Any tips on how I can do it with functions like apply() or perhaps the plyr package?
Bjørn
There are many functions to do this. check out..
library(compare)
compare(DF1,DF2,allowAll=TRUE)
or as mentioned by #A.Webb Merge is pretty handy function.
merge(x = DF1, y = DF2, by.x = "Unique_ID",by.y = "Unique_ID", all.x = T, sort = F)
If you prefer SQL style statements then
library(sqldf)
sqldf('SELECT * FROM DF1 INTERSECT SELECT * FROM DF2')
easy to implement and to avoid for and if conditions
As A.Webb suggested you need join:
# join data on unique_name
joined=merge(export, import[c("unique_name", "body_mass")], c('unique_name'))
joined$body_mass=joined$body_mass.y # update body_mass from import to export
joined$body_mass.x=NULL # remove not needed column
joined$body_mass.y=NULL # remove not needed column
export=joined;
Note:As shown below use "which" function .This would reduce the loop iterations
for (j in 1 : nrow(export)){
index<- which(import$unique_name %in% export$unique_name[j])
if(length(index)=1)
{
export$body_mass[j] <- import[index[1],"body_mass"]
}
}

R equivalent of SAS OBS= option

SAS has OBS= option to limit the number of observation to read. Once put as system options, it can be applied on all the dataset that to be read by the program. It become useful to test the program before running on large full dataset.
Wondering if there is similar option/function in R? Or we have to specify number of observations for each input data frame in R?
Expanding the comments into an answer, at the top of your script you can define
OBS = 100 # however many rows you want to start
When reading in data with read.csv, read.table, etc.,
... = read.table(..., nrows = OBS)
As described in ?read.table, if you set nrows (hence OBS) to a negative number (such as the default, -1), it will be ignored.
If you have less then 100 rows you can use:
head(my_dataframe,100)
If you have a dataframe with at least 100 variables, will error on you otherwise:
my_dataframe[1:100,]
be aware that 'obs' is a short form for lastobs,
the companion option is firstobs
e.g. read rows 1--5: (firstobs=1)
set sashelp.class(obs=5);
e.g. read rows 5--10:
set sashelp.class(firstobs=5,obs=10);

How to quickly split values in column to create a table for plotting in R

I was wondering if anyone could offer any advice on speeding
the following up in R.
I’ve got a table in a format like this
chr1, A, G, v1,v2,v3;w1w2w3, ...
...
The header is
chr, ref, alt, sample1, sample2 ...(many samples)
In each row for each sample I’ve got 3 values for v and 3 values for w,
separated by “;"
I want to extract v1 and w1 for each sample make a table
that can be plotted using ggplot, it would look like this
chr, ref, alt, sam, v1, w1
I am doing this by strsplit and rbind one by one like the
following
varsam <- c()
for(i in 1:n.var){
chrm <- variants[i,1]
ref <- as.character(variants[i,3])
alt <- as.character(variants[i,4])
amp <- as.character(variants[i,5])
for(j in 1:n.sam){
vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
vsc <- strsplit(vs[1], split=",")[[1]]
vsp <- strsplit(vs[2], split=",")[[1]]
varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}
This is very slow as you would expect. Any idea how to speed this up?
As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:
Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis
Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.
the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table
As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.
Thanks for the comments. I think I've found way to improve this.
I used melt in "reshape" to firstly convert my input table to
chr, ref, alt, variable
I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

Resources