Method in R to find difference between rows with varying row spacing - r

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA

Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA

df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

Related

avoid repeated unquoting in dplyr non standard evaluation

Suppose we have the following data:
tib <- tibble::tibble(x = 1:10)
Then, suppose we want to make a function that takes a column as input and returns a tibble with several added columns such as:
library(dplyr)
generate_transformations <- function(data, column){
transform <- sym(column)
data %>%
mutate(
sqrt = sqrt(!!transform),
recip = 1 / !!transform,
log = log(!!transform)
)
}
# Usage is great:
tib %>%
generate_transformations('x')
# A tibble: 10 x 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Now my question is, is there a way to avoid unquoting (!!) transform repeatedly?
Yes, I could, e.g., temporarily rename column and then rename it back after I am done, but that is not my interest in this question.
I am interested if there is a way to produce a variable that does not need the !!.
While it does not work, I was looking for something like:
generate_transformations <- function(data, column){
transform <- !!sym(column) # cannot unquote here :(
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
Convert to string and subset from the data and use transform
generate_transformations <- function(data, column){
transform <- data[[rlang::as_string(ensym(column))]]
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Or create a temporary column and remove it later
generate_transformations <- function(data, column){
data %>%
mutate(transform = !! rlang::ensym(column),
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform),
transform = NULL
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
You can do it in one, if you swap !! for {{}} and use across:
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x x_sqrt x_log x_recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To restore your original column names, use
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs, .names="{.fn}"))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x sqrt log recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To handle multiple columns:
data_transformations <- function(d, cols, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{cols}}, .fns=funs))
}
d1 <- tibble(x=1:10, y=seq(2, 20, 2))
d1 %>% data_transformations(c(x, y), list(sqrt=sqrt, log=log))
A tibble: 10 × 6
x y x_sqrt x_log y_sqrt y_log
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 1.41 0.693
2 2 4 1.41 0.693 2 1.39
3 3 6 1.73 1.10 2.45 1.79
4 4 8 2 1.39 2.83 2.08
5 5 10 2.24 1.61 3.16 2.30
6 6 12 2.45 1.79 3.46 2.48
7 7 14 2.65 1.95 3.74 2.64
8 8 16 2.83 2.08 4 2.77
9 9 18 3 2.20 4.24 2.89
10 10 20 3.16 2.30 4.47 3.00

Using pivot_longer in tidyr with a complex separator [duplicate]

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
In a previous post here I tried to get the equivalent of an rbind using tidyr::pivotlonger(). This is the data and the solution.
set.seed(1)
df1 <- data.frame(group = rep(letters[1:2],each=3),
day = rep(1:3,2),
var1_mean = round(rnorm(6),2),
var1_sd = round(rnorm(6,5),2),
var2_mean = round(rnorm(6),2),
var2_sd = round(rnorm(6,5),2))
# group day var1_mean var1_sd var2_mean var2_sd
# 1 a 1 -0.63 5.49 -0.62 5.82
# 2 a 2 0.18 5.74 -2.21 5.59
# 3 a 3 -0.84 5.58 1.12 5.92
# 4 b 1 1.60 4.69 -0.04 5.78
# 5 b 2 0.33 6.51 -0.02 5.07
# 6 b 3 -0.82 5.39 0.94 3.01
df1 %>%
pivot_longer(cols = starts_with('var'),
names_to = c('grp', '.value'),
names_sep="_")
# group day grp mean sd
# <fct> <int> <chr> <dbl> <dbl>
# 1 a 1 var1 -0.63 5.49
# 2 a 1 var2 -0.62 5.82
# 3 a 2 var1 0.18 5.74
# 4 a 2 var2 -2.21 5.59
# 5 a 3 var1 -0.84 5.58
# 6 a 3 var2 1.12 5.92
# 7 b 1 var1 1.6 4.69
# 8 b 1 var2 -0.04 5.78
# 9 b 2 var1 0.33 6.51
# 10 b 2 var2 -0.02 5.07
# 11 b 3 var1 -0.82 5.39
# 12 b 3 var2 0.94 3.01
This solution is quite contingent on the naming convention used for the mean and sd variables. If there is a different naming convention, with a more complex separator between the two important nodes of the column names, like so...
df2 <- data.frame(group = rep(letters[1:2],each=3),
day = rep(1:3,2),
mean_var_1 = round(rnorm(6),2),
sd_var_1 = round(rnorm(6,5),2),
mean_var_2 = round(rnorm(6),2),
sd_var_2 = round(rnorm(6,5),2))
df2
# group day mean_var_1 sd_var_1 mean_var_2 sd_var_2
# 1 a 1 0.62 6.36 -0.39 5.70
# 2 a 2 -0.06 4.90 -0.06 5.56
# 3 a 3 -0.16 5.39 1.10 4.31
# 4 b 1 -1.47 4.95 0.76 4.29
# 5 b 2 -0.48 3.62 -0.16 5.36
# 6 b 3 0.42 4.59 -0.25 5.77
How would I achieve a similar result to the first example, with a single mean and sd column and with var_1 and var_2 as the grouping variable?
If you have names that are complicated you can use names_pattern argument where you can specify how each part of column name would be used to get data in long format.
tidyr::pivot_longer(df2,
cols = contains('var'),
names_to = c('.value', 'grp'),
names_pattern = '(.*?)_(.*)')
# group day grp mean sd
# <chr> <int> <chr> <dbl> <dbl>
# 1 a 1 var_1 0.62 6.36
# 2 a 1 var_2 -0.39 5.7
# 3 a 2 var_1 -0.06 4.9
# 4 a 2 var_2 -0.06 5.56
# 5 a 3 var_1 -0.16 5.39
# 6 a 3 var_2 1.1 4.31
# 7 b 1 var_1 -1.47 4.95
# 8 b 1 var_2 0.76 4.29
# 9 b 2 var_1 -0.48 3.62
#10 b 2 var_2 -0.16 5.36
#11 b 3 var_1 0.42 4.59
#12 b 3 var_2 -0.25 5.77
'(.*?)_(.*)' uses two groups of data where the first group is everything until the first underscore ((.*?)) in the column name and the second group is everything after the underscore following the first group ((.*)).

Cumulative sum based on factor on R

I have the following dataset, and I need to acumulate the value and
sum, if the factor is 0, and then put the cummulated sum when I found
the factor != 0.
I've tried the loop bellow, but it didn't worked at all.
for(i in dataset$Variable.1) {
ifelse(dataset$Factor == 0,
dataset$teste <- dataset$Variable.1 + i,
dataset$teste <- dataset$Variable.1)
i<- dataset$Variable.1
print(i)
}
Any ideas?
Bellow an example of the dataset. I wish to get the "Result" Column.
On the real one, I also have a negative factor (-1).
Date Factor Variable.1 Result
1 03/02/2018 0 0.75 0.75
2 04/02/2018 0 0.75 1.50
3 05/02/2018 1 0.96 2.46
4 06/02/2018 1 0.76 0.76
5 07/02/2018 0 1.35 1.35
6 08/02/2018 1 0.70 2.05
7 09/02/2018 1 2.02 2.02
8 10/02/2018 0 0.00 0.00
9 11/02/2018 0 0.00 0.00
10 12/02/2018 0 0.20 0.20
11 13/02/2018 0 0.13 0.33
12 14/02/2018 0 1.64 1.97
13 15/02/2018 0 0.03 2.00
14 16/02/2018 1 0.51 2.51
15 17/02/2018 1 0.00 0.00
16 18/02/2018 0 0.00 0.00
17 19/02/2018 0 0.83 0.83
18 20/02/2018 1 0.42 1.25
19 21/02/2018 1 0.17 0.17
20 22/02/2018 1 0.97 0.97
21 23/02/2018 0 0.92 0.92
22 24/02/2018 0 0.00 0.92
23 25/02/2018 0 0.00 0.92
24 26/02/2018 1 0.19 1.11
25 27/02/2018 1 0.87 0.87
26 28/02/2018 1 0.85 0.85
27 01/03/2018 1 1.95 1.95
28 02/03/2018 1 0.54 0.54
29 03/03/2018 1 0.00 0.00
30 04/03/2018 0 0.00 0.00
31 05/03/2018 0 1.17 1.17
32 06/03/2018 1 0.25 1.42
33 07/03/2018 1 1.45 1.45
Thanks In advance.
If you want to stick with the for-loop, you can try this code :
DF$Result <- NA
prev <- 0
for(i in seq_len(nrow(DF))){
DF$Result[i] <- DF$Variable.1[i] + prev
if(DF$Factor[i] == 1)
prev <- 0
else
prev <- DF$Result[i]
}
Iteratively, try something like:
a=as.data.frame(cbind(Factor=c(0,0,1,1,0,1,1,
rep(0,3),1),Variable.1=c(0.75,0.75,0.96,0.71,1.35,0.7,
0.75,0.96,0.71,1.35,0.7)))
Result=0
aux=NULL
for (i in 1:nrow(a)){
if (a$Factor[i]==0){
Result=Result+a$Variable.1[i]
aux=c(aux,Result)
} else{
Result=Result+a$Variable.1[i]
aux=c(aux,Result)
Result=0
}
}
a$Results=aux
a
Factor Variable.1 Results
1 0 0.75 0.75
2 0 0.75 1.50
3 1 0.96 2.46
4 1 0.71 0.71
5 0 1.35 1.35
6 1 0.70 2.05
7 1 0.75 0.75
8 0 0.96 0.96
9 0 0.71 1.67
10 0 1.35 3.02
11 1 0.70 3.72
A possibility using tidyverse and data.table:
df %>%
mutate(temp = ifelse(Factor == 1 & lag(Factor) == 1, NA, 1), #Marking the rows after the first 1 in "Factor" as NA
temp = ifelse(!is.na(temp), rleid(temp), NA)) %>% #Run length along non-NA values
group_by(temp) %>% #Grouping by run length
mutate(Result = ifelse(!is.na(temp), cumsum(Variable.1), Variable.1)) %>% #Cumulative sum of desired rows
ungroup() %>%
select(-temp) #Removing the redundant variable
Date Factor Variable.1 Result
<chr> <int> <dbl> <dbl>
1 03/02/2018 0 0.750 0.750
2 04/02/2018 0 0.750 1.50
3 05/02/2018 1 0.960 2.46
4 06/02/2018 1 0.760 0.760
5 07/02/2018 0 1.35 1.35
6 08/02/2018 1 0.700 2.05
7 09/02/2018 1 2.02 2.02
8 10/02/2018 0 0. 0.
9 11/02/2018 0 0. 0.
10 12/02/2018 0 0.200 0.200

Replace values in data frame based on a table in R

Data Frame:
set.seed(90)
df <- data.frame(id = 1:10, values = round(rnorm(10),1))
id values
1 1 0.1
2 2 -0.2
3 3 -0.9
4 4 -0.7
5 5 0.7
6 6 0.4
7 7 1.0
8 8 0.9
9 9 -0.6
10 10 2.4
Table:
table <- data.frame(values = c(-2.0001,1.0023,0.0005,1.0002,2.00009), final_values = round(rnorm(5),2))
values final_values
1 -2.00010 -0.81
2 1.00230 -0.08
3 0.00050 0.87
4 1.00020 1.66
5 2.00009 -0.24
I need to replace the values in data frame based on the closest match of the values in table.
Final Output:
id final_values
1 1 0.87
2 2 0.87
3 3 -0.08
4 4 -0.08
5 5 1.66
6 6 0.87
7 7 1.66
8 8 1.66
9 9 -0.08
10 10 -0.24
What is the best way to do this with base R?
Here is a way and you can overwrite the result back to df:
sapply(df$values, function(x) table$final_values[which.min(abs(x - table$values))])
[1] 0.87 0.87 -0.08 -0.08 1.66 0.87 1.66 1.66 -0.08 -0.24

Obtaining Probabilities in KNN Classifier in R

I have the following the data set:
TRAIN dataset
Sr A B C XX
1 0.09 0.52 11.1 high
2 0.13 0.25 11.1 low
3 0.20 0.28 11.1 high
4 0.29 0.50 11.1 low
5 0.31 0.58 11.1 high
6 0.32 0.37 11.1 high
7 0.37 0.58 11.1 low
8 0.38 0.40 11.1 low
9 0.42 0.65 11.1 high
10 0.42 0.79 11.1 low
11 0.44 0.34 11.1 high
12 0.45 0.89 11.1 low
13 0.57 0.72 11.1 low
TEST dataset
Sr A B C XX
1 0.54 1.36 9.80 low
2 0.72 0.82 9.80 low
3 0.19 0.38 9.90 high
4 0.25 0.44 9.90 high
5 0.29 0.54 9.90 high
6 0.30 0.54 9.90 high
7 0.42 0.86 9.90 low
8 0.44 0.86 9.90 low
9 0.49 0.66 9.90 low
10 0.54 0.76 9.90 low
11 0.54 0.76 9.90 low
12 0.68 1.08 9.90 low
13 0.88 0.51 9.90 high
Sr : Serial Number
A-C : Parameters
XX : Output Binary Parameter
I am trying to use the KNN classifier to develop a predictor model with 5 nearest neighbors. Following is the code that I have written:
train_input <- as.matrix(train[,-ncol(train)])
train_output <- as.factor(train[,ncol(train)])
test_input <- as.matrix(test[,-ncol(test)])
prediction <- knn(train_input, test_input, train_output, k=5, prob=TRUE)
resultdf <- as.data.frame(cbind(test[,ncol(test)], prediction))
colnames(resultdf) <- c("Actual","Predicted")
RESULT dataset
A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
I have the following concerns:
What should I do to obtain probability values? Is this a probability of getting high or low i.e. P(high) or P(low)?
The levels are set to 1 (high) and 2 (low), which is based on the order of first appearance. If low appeared before high in the train dataset, it would have a value 1. I feel this is not good practice. Is there anyway I can avoid this?
If there were more classes (more than 2) in the classifier, how would I handle this in the classifier?
I am using the class and e1071 library.
Thanks.
Utility function built before the "text" argument to scan was introduced:
rd.txt <- function (txt, header = TRUE, ...)
{ tconn <- textConnection(txt)
rd <- read.table(tconn, header = header, ...)
close(tconn)
rd}
RESULT <- rd.txt(" A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
")
> prop.table(table(RESULT))
P
A 1 2
1 0.15385 0.23077
2 0.07692 0.53846
You can also set up prop.table to deliver row or column proportions (AKA probabilities).

Resources