Avoiding redundancy when selecting rows in a data frame - r

My code is littered with statements of the following taste:
selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" &
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]
The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?

You can use with
For instance
sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]

Several ways come to mind.
If you think about it, you are subsetting your data. Hence use the subset function (base package):
your_subset <- subset(long_data_frame_name,
col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
select = selected_columns)
This is in my opinion the most "talking" code to accomplish your task.

Use data tables.
library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" &
col2 == "condition2" &
!is.na(col3),
list(col4,col5,col6,col7)]
You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...), as below, does require a keyed data.table, but is even faster.
selected <- long_data_table_name[J("condition1","condition2",NA),
list(col4,col5,col6,col7)]

You have several possibilities:
attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.
with which creates a whole new environment temporarilly.
In very limited cases you want to use other options such as within.
df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))
within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.

Related

Conditional Lookup in R

I am trying to replace the blank (missing) zipcodes in the df table with the zipcodes in another table called zipless, based on names.
What would be the best approach? A for loop is probably very slow.
I was trying with something like this, but it does not work.
df$zip_new <- ifelse(df, is.na(zip_new),
left_join(df,zipless, by = c("contbr_nm" = "contbr_nm")),
zip_new)
I was able to make it work using this approach, but I am sure it is not the best one.
I first added a new column from the lookup table and in the next step selectively used it, where necessary.
library(dplyr)
#temporarly renaming the lookup column in the lookup table
zipless <- plyr::rename(zipless, c("zip_new"="zip_new_temp"))
#adding the lookup column to the main table
df <- left_join(df, zipless, by = c("contbr_nm" = "contbr_nm"))
#taking over the value from the lookup column zip_new_temp if the condition is met, else, do nothing.
df$zip_new <- ifelse((df$zip_new == "") &
(df$contbr_nm %in% zipless$contbr_nm),
df$zip_new_temp,
df$zip_new)
What would be a proper way to do this?
Thank you very much!
I'd suggest using match to just grab the zips you need. Something like:
miss_zips = is.na(df$zip_new)
df$zip_new[miss_zips] = zipless$zip_new[match(
df$contbr_nm[miss_zips],
zipless$contbr_nm
)]
Without sample data I'm not wholly sure of your column names, but something like that should work.
I can only recommend the data.table-package for things like these. But your general approach is correct. The data.table-package has a much nicer syntax and is designed to handle large data sets.
In data.table it would probably look like this:
zipcodes <- data.table(left_join(df, zipless, by = "contbr_nm"))
zipcodes[, zip_new := ifelse(is.na(zip_new), zip_new_temp, zip_new)]

Efficiently combining multiple conditions on R data.frame

I'm working with data frames in R and often use a lot of conditions on my dataframes, mainly & and | operators. I'm doing that like so:
df = data.frame("col1"=c("success", "failure", "success"), "col2"=c("success", "success", "failure"), "col3"=c(1,1,100))
#multiple conditions, option 1
df[(df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1), ]
#multiple conditions, option 2
df[which((df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1)),]
However, my expressions tend to get really long and hard to read that way.
Is there a better, more readable way of doing it?
EDIT: Preferably, I'd like to work within the base R environment w/out external libraries.
I put this together based on other posts here on SO about not using subset, using | correctly, but didnt' find anything addressing this specific issue.
I hope this is not too opinion-based, otherwise I'll retract my question. Thanks!
One option is to use the filter() function in the dplyr package:
library(dplyr)
filter(df, col1=="success" & col2=="success" & col3==1)
You can also use commas (equivalent to &) to separate multiple arguments:
filter(df, col1=="success", col2=="success", col3==1)
Try this: (may be not clumsy but with same '&')
> df[df$col1=="success" & df$col2=="success" & df$col3==1,]
col1 col2 col3
1 success success 1

How to quickly split values in column to create a table for plotting in R

I was wondering if anyone could offer any advice on speeding
the following up in R.
I’ve got a table in a format like this
chr1, A, G, v1,v2,v3;w1w2w3, ...
...
The header is
chr, ref, alt, sample1, sample2 ...(many samples)
In each row for each sample I’ve got 3 values for v and 3 values for w,
separated by “;"
I want to extract v1 and w1 for each sample make a table
that can be plotted using ggplot, it would look like this
chr, ref, alt, sam, v1, w1
I am doing this by strsplit and rbind one by one like the
following
varsam <- c()
for(i in 1:n.var){
chrm <- variants[i,1]
ref <- as.character(variants[i,3])
alt <- as.character(variants[i,4])
amp <- as.character(variants[i,5])
for(j in 1:n.sam){
vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
vsc <- strsplit(vs[1], split=",")[[1]]
vsp <- strsplit(vs[2], split=",")[[1]]
varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}
This is very slow as you would expect. Any idea how to speed this up?
As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:
Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis
Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.
the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table
As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.
Thanks for the comments. I think I've found way to improve this.
I used melt in "reshape" to firstly convert my input table to
chr, ref, alt, variable
I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

How to efficiently chunk data.frame into smaller ones and process them

I have a bigger data.frame which i want to cut into small ones, depending on some "unique_keys" ( In reffer to MySQL ). At the moment I am doing this with this loop, but it takes awfully long ~45sec for 10k rows.
for( i in 1:nrow(identifiers_test) ) {
data_test_offer = data_test[(identifiers_test[i,"m_id"]==data_test[,"m_id"] &
identifiers_test[i,"a_id"]==data_test[,"a_id"] &
identifiers_test[i,"condition"]==data_test[,"condition"] &
identifiers_test[i,"time_of_change"]==data_test[,"time_of_change"]),]
# Sort data by highest prediction
data_test_offer = data_test_offer[order(-data_test_offer[,"prediction"]),]
if(data_test_offer[1,"is_v"]==1){
true_counter <- true_counter+1
}
}
How can i refactor this, to make it more "R" - and faster?
Before applying groups you are filtering your data.frame using another data.frame. I would use merge then by.
ID <- c("m_id","a_id","condition","time_of_change")
filter_data <- merge(data_test,identifiers_test,by=ID)
by(filter_data, do.call(paste,filter_data[,ID]),
FUN=function(x)x[order(-x[,"prediction"]),])
Of course the same thing can be written using data.table more efficiently:
library(data.table)
setkeyv(setDT(identifiers_test),ID)
setkeyv(setDT(data_test),ID)
data_test[identifiers_test][rev(order(prediction)),,ID]
NOTE: the answer below is not tested since you don't provide a data to test it.

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources