Conditional Lookup in R - r

I am trying to replace the blank (missing) zipcodes in the df table with the zipcodes in another table called zipless, based on names.
What would be the best approach? A for loop is probably very slow.
I was trying with something like this, but it does not work.
df$zip_new <- ifelse(df, is.na(zip_new),
left_join(df,zipless, by = c("contbr_nm" = "contbr_nm")),
zip_new)
I was able to make it work using this approach, but I am sure it is not the best one.
I first added a new column from the lookup table and in the next step selectively used it, where necessary.
library(dplyr)
#temporarly renaming the lookup column in the lookup table
zipless <- plyr::rename(zipless, c("zip_new"="zip_new_temp"))
#adding the lookup column to the main table
df <- left_join(df, zipless, by = c("contbr_nm" = "contbr_nm"))
#taking over the value from the lookup column zip_new_temp if the condition is met, else, do nothing.
df$zip_new <- ifelse((df$zip_new == "") &
(df$contbr_nm %in% zipless$contbr_nm),
df$zip_new_temp,
df$zip_new)
What would be a proper way to do this?
Thank you very much!

I'd suggest using match to just grab the zips you need. Something like:
miss_zips = is.na(df$zip_new)
df$zip_new[miss_zips] = zipless$zip_new[match(
df$contbr_nm[miss_zips],
zipless$contbr_nm
)]
Without sample data I'm not wholly sure of your column names, but something like that should work.

I can only recommend the data.table-package for things like these. But your general approach is correct. The data.table-package has a much nicer syntax and is designed to handle large data sets.
In data.table it would probably look like this:
zipcodes <- data.table(left_join(df, zipless, by = "contbr_nm"))
zipcodes[, zip_new := ifelse(is.na(zip_new), zip_new_temp, zip_new)]

Related

Creating a simple for loop in R

I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).

Creating a data frame using a single line code

I need to select data for 3 variables and place them in a new data frame using a single line of code. The data frame I'm pulling from is Dance, the 3 variables are Lindy, Blues and Contra.
I have this:
Dance$new<-subset(Dance$Type==Lindy, Dance$Type==Blues, Dance$Type==Contra)
Can you tell what I'm doing wrong?
There are a number of ways you can do this, but I'd forget the subset part
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",]
If you only want specific columns
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",c("Col1", "Col2")]
Alternatively
danceNew <- Dance[Dance$Type %in% c("Blues", "Contra", "Lindy"),]
Again, if you only want specific columns do the same. The advantage with the final options is you can pass the values in as a variable, thereby making it more dynamic, e.g
danceNames <- c("Lindy", "Blues", "Contra")
danceNew <- Dance[Dance$Type %in% danceNames,]
you're mixing up the variables and the dataframes
this should do the trick..
if your initial dataframe is called "Dance" and the new dataframe is called "Dance.new":
Dance.new <- subset(Dance, Dance$Type=="Lindy" & Dance$Type=="Blues" & Dance$Type=="Contra"); row.names(Dance.new) <- NULL
I like using "row.names(Dance.new) <- NULL" line so I won't have the useless column of "row.names" in the new dataframe
Thanks for your help everyone. This is what ended up working for me.
dancenew<-subset(Dance, Type=="Lindy" | Type== "Blues" | Type=="Contra")

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

How to quickly split values in column to create a table for plotting in R

I was wondering if anyone could offer any advice on speeding
the following up in R.
I’ve got a table in a format like this
chr1, A, G, v1,v2,v3;w1w2w3, ...
...
The header is
chr, ref, alt, sample1, sample2 ...(many samples)
In each row for each sample I’ve got 3 values for v and 3 values for w,
separated by “;"
I want to extract v1 and w1 for each sample make a table
that can be plotted using ggplot, it would look like this
chr, ref, alt, sam, v1, w1
I am doing this by strsplit and rbind one by one like the
following
varsam <- c()
for(i in 1:n.var){
chrm <- variants[i,1]
ref <- as.character(variants[i,3])
alt <- as.character(variants[i,4])
amp <- as.character(variants[i,5])
for(j in 1:n.sam){
vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
vsc <- strsplit(vs[1], split=",")[[1]]
vsp <- strsplit(vs[2], split=",")[[1]]
varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}
This is very slow as you would expect. Any idea how to speed this up?
As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:
Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis
Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.
the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table
As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.
Thanks for the comments. I think I've found way to improve this.
I used melt in "reshape" to firstly convert my input table to
chr, ref, alt, variable
I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

Avoiding redundancy when selecting rows in a data frame

My code is littered with statements of the following taste:
selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" &
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]
The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?
You can use with
For instance
sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]
Several ways come to mind.
If you think about it, you are subsetting your data. Hence use the subset function (base package):
your_subset <- subset(long_data_frame_name,
col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
select = selected_columns)
This is in my opinion the most "talking" code to accomplish your task.
Use data tables.
library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" &
col2 == "condition2" &
!is.na(col3),
list(col4,col5,col6,col7)]
You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...), as below, does require a keyed data.table, but is even faster.
selected <- long_data_table_name[J("condition1","condition2",NA),
list(col4,col5,col6,col7)]
You have several possibilities:
attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.
with which creates a whole new environment temporarilly.
In very limited cases you want to use other options such as within.
df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))
within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.

Resources