Lookup tables in R - r

I have a tibble with a ton of data in it, but most importantly, I have a column that references a row in a lookup table by number (ex. 1,2,3 etc).
df <- tibble(ref = c(1,1,1,2,5)
data = c(33,34,35,35,32))
lkup <- tibble(CurveID <- c(1,2,3,4,5)
Slope <- c(-3.8,-3.5,-3.1,-3.3,-3.3)
Intercept <- c(40,38,40,38,36)
Min <- c(25,25,21,21,18)
Max <- c(36,36,38,37,32))
I need to do a calculation for each row in the original tibble based on the information in the referenced row in the lookup table.
df$result <- df$data - lkup$intercept[lkup$CurveID == df$ref]/lkup$slope[lkup$CurveID == df$ref]
The idea is to access the slope or intercept (etc) value from the correct row of the lookup table based on the number in the data table, and to do this for each data point in the column. But I keep getting an error telling me my data isn't compatible, and that my objects need to be of the same length.

You could also do it with match()
df$result <- df$data - lkup$Intercept[match(df$ref, lkup$CurveID)]/lkup$Slope[match(df$ref, lkup$CurveID)]
df$result
# [1] 43.52632 44.52632 45.52632 45.85714 42.90909

You could use the dplyr package to join the tibbles together. If the ref column and CurveID column have the same name then left_join will combine the two tibbles by the matching rows.
library(dplyr)
df <- tibble(CurveID = c(1,1,1,2,5),
data = c(33,34,35,35,32))
lkup <- tibble(CurveID = c(1,2,3,4,5),
Slope = c(-3.8,-3.5,-3.1,-3.3,-3.3),
Intercept = c(40,38,40,38,36),
Min = c(25,25,21,21,18),
Max = c(36,36,38,37,32))
df <- df %>% left_join(lkup, by = "CurveID")
Then do the calcuation on each row
df <- df %>% mutate(result = data - (Intercept/Slope)) %>%
select(CurveID, data, result)

For completeness' sake, here's one way to literally do what OP was trying:
library(slider)
df %>%
mutate(result = slide_dbl(ref, ~ slice(lkup, .x)$Intercept /
slice(lkup, .x)$Slope))
though since slice goes by row number, this relies on CurveID equalling the row number (we make no reference to CurveID at all). You can write it differently with filter but it ends up being more code.

Related

Mutate ifelse on a vector

Let's say I have this data frame:
set.seed(2)
df <- iris[c(1:5,51:55,101:105),]
df_long <- gather(df, key = "flower_att", value = "measurement",
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
df_long$setosa_sub <-sample(5,size = 60, replace = TRUE)
df_long$versicolor_sub <-sample(5,size = 60, replace = TRUE)
df_long$virginica_sub <-sample(5,size = 60, replace = TRUE)
df_long$sub_q<-0
Now I want to copy a value to sub_q variable based on Species variable and sub values.
I know how to do it one by one:
df_long2 <- df_long %>%
mutate(sub_q =ifelse(Species =="setosa", setosa_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="versicolor", versicolor_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="virginica", virginica_sub,sub_q))
But I can't figure out what is the right way to apply on a vector of the Species values instead.
species_vector <- c("setosa","versicolor","virginica")
I'm actually not sure if I need to make new function or just loop it somehow. Hope it's make sense...
I don't see anything wrong with the way you are doing it. Another way, using an apply function (sapply in this case) would work like this:
# a helper function to find the right value for the xth row
get_correct_sub <- function(x){
col_name = paste0(df_long$Species[x],'_sub')
df_long[[ col_name ]][x] }
# apply each row index to the helper function
df_long2 = df_long
df_long2$sub_q = sapply(1:nrow(df_long), get_correct_sub)
The helper function adds "_sub" to the species name, treats that as a column name, and then gets the value for that column.
Here is a datastep() solution. I created a vector lookup to map the Species to the desired column, then step through the data row by row and assign the value using the lookup. data is the input dataset and n. is the current row number:
library(libr)
# Create vector lookup
species_vector <- c("setosa" = "setosa_sub", "versicolor" = "versicolor_sub", "virginica" = "virginica_sub")
# Step through data row by row, and assign value using lookup
df_long2 <- df_long %>%
datastep({
sub_q <- data[n., species_vector[Species]]
})

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

select highest pairs from complex table

I want to make a new dataframe from a selection of rows in a complex table of pairwise comparisons. I want to select the rows such that the 2 highest values of each pairwise comparison is selected.
Below is an example dataset:
dataframe <- data.frame(X1 = c("OP2413iiia","OP2413iiib","OP2413iiic","OP2645ii_a","OP2645ii_b","OP2645ii_c","OP2645ii_d","OP2645ii_e","OP3088i__a","OP5043___a","OP5043___b","OP5044___a","OP5044___b","OP5044___c","OP5046___a","OP5046___b","OP5046___c","OP5046___d","OP5046___e","OP5047___a","OP5047___b","OP5048___b","OP5048___c","OP5048___d","OP5048___e","OP5048___f","OP5048___g","OP5048___h","OP5049___a","OP5049___b","OP5051DNAa","OP5051DNAb","OP5051DNAc","OP5052DNAa","OP5053DNAa"),
gr1 = c("2","2","2","3","3","3","3","3","3","4","4","4","3","4","2","3","3","3","4","2","4","3","3","3","4","2","4","2","3","3","3","4","2","4","2"),
X2 = c("OP2413iiib","OP2413iiic","OP5046___a","OP2645ii_a","OP2645ii_a","OP2645ii_a","OP2645ii_b","OP2645ii_b","OP5046___a","OP2645ii_b","OP2645ii_c","OP2645ii_c","OP2645ii_c","OP2645ii_c","OP5048___e","OP2645ii_d","OP5046___a","OP2645ii_d","OP2645ii_d","OP2645ii_d","OP2645ii_d","OP2645ii_e","OP5048___e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP2645ii_e","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a","OP3088i__a"),
gr2 = c("3","3","3","4","4","4","2","2","2","2","4","4","4","4","4","2","2","2","2","2","2","4","4","4","4","4","4","4","3","3","3","3","3","3","3"),
value = c("1.610613e+00","1.609732e+00","8.829263e-04","1.080257e+01","1.111006e+01","1.110978e+01","4.048302e+00","5.610458e+00","5.609584e+00","9.911490e+00","1.078518e+01","1.133728e+01","1.133686e+01","1.738092e+00","9.247411e+00","5.170646e+00","6.074909e+00","6.074287e+00","6.212711e+00","3.769029e+00","5.793390e+00","1.124045e+01","1.163326e+01","1.163293e+01","7.752766e-01","1.008434e+01","1.222854e+00","6.469443e+00","1.610828e+00","1.784774e+00","1.784235e+00","9.434803e+00","4.512563e+00","9.582847e+00","4.309312e+00"))
expected_output_dataframe <- rbind(dataframe[10,],dataframe[34,],dataframe[32,],dataframe[15,],dataframe[3,],dataframe[17,])
Many thanks in advance
Cheers
The method works using dplyr. I created an extra column, gr_pair, to identify the pairwise groups.
library(dplyr)
library(magrittr)
dataframe %>%
filter(gr1 != gr2) %>% # This case is excluded from your expected output
mutate(gr_pair = paste(pmin(gr1, gr2), pmax(gr1, gr2), sep = ",")) %>%
group_by(gr_pair) %>%
top_n(2, value) # Keep the top two rows in each group, sorted by value

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Resources