Replace values in data frame based on a table in R - r

Data Frame:
set.seed(90)
df <- data.frame(id = 1:10, values = round(rnorm(10),1))
id values
1 1 0.1
2 2 -0.2
3 3 -0.9
4 4 -0.7
5 5 0.7
6 6 0.4
7 7 1.0
8 8 0.9
9 9 -0.6
10 10 2.4
Table:
table <- data.frame(values = c(-2.0001,1.0023,0.0005,1.0002,2.00009), final_values = round(rnorm(5),2))
values final_values
1 -2.00010 -0.81
2 1.00230 -0.08
3 0.00050 0.87
4 1.00020 1.66
5 2.00009 -0.24
I need to replace the values in data frame based on the closest match of the values in table.
Final Output:
id final_values
1 1 0.87
2 2 0.87
3 3 -0.08
4 4 -0.08
5 5 1.66
6 6 0.87
7 7 1.66
8 8 1.66
9 9 -0.08
10 10 -0.24
What is the best way to do this with base R?

Here is a way and you can overwrite the result back to df:
sapply(df$values, function(x) table$final_values[which.min(abs(x - table$values))])
[1] 0.87 0.87 -0.08 -0.08 1.66 0.87 1.66 1.66 -0.08 -0.24

Related

Method in R to find difference between rows with varying row spacing

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

Creating new variable in wide data format, R

I have transformed my data into a wide format using the mlogit.data function in order to be able to perform an mlogit multinomial logit regression in R. The data has three different "choices" and looks like this (in its wide format):
Observation Choice Variable A Variable B Variable C
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
However, as the variables A, B and C are linked to the different outcomes I would now like to create a new variable that looks like this:
Observation Choice Variable A Variable B Variable C Variable D
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 0.2
1 -1 1.27 0.2 0.81 0.81
2 1 0.20 0.45 0.70 0.20
2 0 0.20 0.45 0.70 0.45
2 -1 0.20 0.45 0.70 0.70
I have tried the following code:
Variable D <- ifelse(Choice == "1", Variable A, ifelse(Choice == "-1", Variable B, Variable C))
However, the ifelse function only considers one choice from each observation, creating this:
Observation Choice Variable A Variable B Variable C Variable D
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 -
1 -1 1.27 0.2 0.81 -
2 1 0.20 0.45 0.70 -
2 0 0.20 0.45 0.70 0.2
2 -1 0.20 0.45 0.70 -
Anyone know how to solve this?
Thanks!
You can create a table mapping choices to variables and then use match
choice_map <-
data.frame(choice = c(1, 0, -1), var = grep('Variable[A-C]', names(df)))
# choice var
# 1 1 3
# 2 0 4
# 3 -1 5
df$VariableD <-
df[cbind(seq_len(nrow(df)), with(choice_map, var[match(df$Choice, choice)]))]
df
# Observation Choice VariableA VariableB VariableC VariableD
# 1 1 1 1.27 0.20 0.81 1.27
# 2 1 0 1.27 0.20 0.81 0.20
# 3 1 -1 1.27 0.20 0.81 0.81
# 4 2 1 0.20 0.45 0.70 0.20
# 5 2 0 0.20 0.45 0.70 0.45
# 6 2 -1 0.20 0.45 0.70 0.70
Data used (removed spaces in colnames)
df <- data.table::fread('
Observation Choice VariableA VariableB VariableC
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
', data.table = F)
df$`Variable D`= sapply(1:nrow(df),function(x){
df[x,4-df$Choice[x]]
})
> df
Observation Choice Variable A Variable B Variable C Variable D
1 1 1 1.27 0.20 0.81 1.27
2 1 0 1.27 0.20 0.81 0.20
3 1 -1 1.27 0.20 0.81 0.81
4 2 1 0.20 0.45 0.70 0.20
5 2 0 0.20 0.45 0.70 0.45
6 2 -1 0.20 0.45 0.70 0.70

Find position of elements of a dataframe inside other dataframe with R

I have the following dataframe (DF_A):
PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1: 1000000 0.03 0.58 0.01 0.42 0.69 0.98 0.55 0.96
2: 1000001 0.80 0.37 0.10 0.95 0.77 0.69 0.23 0.07
3: 1000002 0.25 0.73 0.79 0.83 0.24 0.82 0.81 0.01
4: 1000003 0.10 0.96 0.53 0.59 0.96 0.10 0.98 0.76
5: 1000004 0.36 0.87 0.76 0.03 0.95 0.40 0.53 0.89
6: 1000005 0.15 0.78 0.24 0.21 0.03 0.87 0.67 0.64
And I have this other dataframe (DF_B):
V1 V2 V3 V4 PARTY_ID
1 0.58 0.69 0.96 0.98 1000000
2 0.69 0.77 0.80 0.95 1000001
3 0.79 0.81 0.82 0.83 1000002
4 0.76 0.96 0.96 0.98 1000003
5 0.76 0.87 0.89 0.95 1000004
6 0.64 0.67 0.78 0.87 1000005
I need to find the position of the elements of the DF_A in the DF_B to have something like this:
PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...
Currently I'm working with match function but it takes a lot of time (I have 400K rows). I'm doing this:
i <- 1
while(i < nrow(DF_A)){
position <- match(DF_B[i,],DF_A[i,])
i <- i + 1
}
Although it works, it's very slow and I know that it's not the best answer to my problem. Can anyone help me please??
You can merge and then Map with a by group operation:
df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
Map(f = function(x,y) which(x==y),
x = list(.SD[,names(df_a), with = FALSE]),
y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]
setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
# PARTY_ID P1 P2 P3 P4
#1: 1000000 3 6 9 7
#2: 1000001 7 6 2 5
#3: 1000002 4 8 7 5
#4: 1000003 9 3 3 8
#5: 1000003 9 6 6 8
#6: 1000004 4 3 9 6
#7: 1000005 9 8 3 7
Here is an example on 1 milion rows with two columns. It takes 14 ms on my computer.
# create data tables with matching ids but on different positions
x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))
# add column to both data tables which will store the position in x and y
x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)
# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")
# merge data tables into one
z <- merge(x,y)
# now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table
z[x_row_nr==100, y_row_nr]
z will contain matching row records from both datasets with there columns attached.

r check and replace stuck data

There are two sensors. The collected data should be changing with time. How can identify the data stuck and replace it with another sensor?
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
so the data has
d
a b c
1 0.1 0.05
2 0.2 0.20
3 0.3 0.30
4 0.4 0.40
5 0.5 0.40
6 0.6 0.40
7 0.7 0.40
8 0.8 0.40
9 0.9 0.40
10 1.0 0.40
11 1.1 0.40
12 1.2 0.40
13 1.3 0.40
14 1.4 0.40
15 1.5 0.40
16 1.6 0.40
17 1.7 0.40
18 1.8 0.40
19 1.9 0.40
20 2.0 0.40
21 2.1 0.40
22 2.2 2.20
23 2.3 2.30
24 2.4 2.40
Sensor c stuck at 0.4 from time a4 to a20, is there a quick way to identify it and replace the stuck part using data from sensor b?
The new column c_updated is what you want. I've created some helpful columns (c_previous and c_is_stuck) which you can remove if you want.
library(dplyr)
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d %>%
mutate(c_previous = lag(c, default = 0), # get previous measurement for sensor c
c_is_stuck = ifelse(c == c_previous, 1 ,0), # flag stuck for sensor c when current measurement is same as previous one
c_updated = ifelse(c_is_stuck == 1, b, c)) # if sensor c is stuck use measurement from sensor b
# a b c c_previous c_is_stuck c_updated
# 1 1 0.1 0.05 0.00 0 0.05
# 2 2 0.2 0.20 0.05 0 0.20
# 3 3 0.3 0.30 0.20 0 0.30
# 4 4 0.4 0.40 0.30 0 0.40
# 5 5 0.5 0.40 0.40 1 0.50
# 6 6 0.6 0.40 0.40 1 0.60
# 7 7 0.7 0.40 0.40 1 0.70
# 8 8 0.8 0.40 0.40 1 0.80
# 9 9 0.9 0.40 0.40 1 0.90
# 10 10 1.0 0.40 0.40 1 1.00
# 11 11 1.1 0.40 0.40 1 1.10
# 12 12 1.2 0.40 0.40 1 1.20
# 13 13 1.3 0.40 0.40 1 1.30
# 14 14 1.4 0.40 0.40 1 1.40
# 15 15 1.5 0.40 0.40 1 1.50
# 16 16 1.6 0.40 0.40 1 1.60
# 17 17 1.7 0.40 0.40 1 1.70
# 18 18 1.8 0.40 0.40 1 1.80
# 19 19 1.9 0.40 0.40 1 1.90
# 20 20 2.0 0.40 0.40 1 2.00
# 21 21 2.1 0.40 0.40 1 2.10
# 22 22 2.2 2.20 0.40 0 2.20
# 23 23 2.3 2.30 2.20 0 2.30
# 24 24 2.4 2.40 2.30 0 2.40
This is a pretty simple way. Duplicate the c column with an offset of 1 and check if the two values are identical. If so, take the value from b.
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$d <- c(NA, d$c[1:23])
d$replaced <- ifelse(d$c == d$d, d$b, d$c)
a b c d replaced
1 1 0.1 0.05 NA NA
2 2 0.2 0.20 0.05 0.2
3 3 0.3 0.30 0.20 0.3
4 4 0.4 0.40 0.30 0.4
5 5 0.5 0.40 0.40 0.5
6 6 0.6 0.40 0.40 0.6
7 7 0.7 0.40 0.40 0.7
8 8 0.8 0.40 0.40 0.8
9 9 0.9 0.40 0.40 0.9
10 10 1.0 0.40 0.40 1.0
11 11 1.1 0.40 0.40 1.1
12 12 1.2 0.40 0.40 1.2
13 13 1.3 0.40 0.40 1.3
14 14 1.4 0.40 0.40 1.4
15 15 1.5 0.40 0.40 1.5
16 16 1.6 0.40 0.40 1.6
17 17 1.7 0.40 0.40 1.7
18 18 1.8 0.40 0.40 1.8
19 19 1.9 0.40 0.40 1.9
20 20 2.0 0.40 0.40 2.0
21 21 2.1 0.40 0.40 2.1
22 22 2.2 2.20 0.40 2.2
23 23 2.3 2.30 2.20 2.3
24 24 2.4 2.40 2.30 2.4
The bellow solution is as basic as it gets I think. No additional packages required. Cheers!
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$diff.b <- c(NA, diff(d$b))
d$diff.c <- c(NA, diff(d$c))
stuck.index <- which(d$diff.c==0)
d[stuck.index, "c"] <- d[stuck.index, "b"]
# changing to original data frame format
d$diff.b <- NULL
d$diff.c <- NULL

Obtaining Probabilities in KNN Classifier in R

I have the following the data set:
TRAIN dataset
Sr A B C XX
1 0.09 0.52 11.1 high
2 0.13 0.25 11.1 low
3 0.20 0.28 11.1 high
4 0.29 0.50 11.1 low
5 0.31 0.58 11.1 high
6 0.32 0.37 11.1 high
7 0.37 0.58 11.1 low
8 0.38 0.40 11.1 low
9 0.42 0.65 11.1 high
10 0.42 0.79 11.1 low
11 0.44 0.34 11.1 high
12 0.45 0.89 11.1 low
13 0.57 0.72 11.1 low
TEST dataset
Sr A B C XX
1 0.54 1.36 9.80 low
2 0.72 0.82 9.80 low
3 0.19 0.38 9.90 high
4 0.25 0.44 9.90 high
5 0.29 0.54 9.90 high
6 0.30 0.54 9.90 high
7 0.42 0.86 9.90 low
8 0.44 0.86 9.90 low
9 0.49 0.66 9.90 low
10 0.54 0.76 9.90 low
11 0.54 0.76 9.90 low
12 0.68 1.08 9.90 low
13 0.88 0.51 9.90 high
Sr : Serial Number
A-C : Parameters
XX : Output Binary Parameter
I am trying to use the KNN classifier to develop a predictor model with 5 nearest neighbors. Following is the code that I have written:
train_input <- as.matrix(train[,-ncol(train)])
train_output <- as.factor(train[,ncol(train)])
test_input <- as.matrix(test[,-ncol(test)])
prediction <- knn(train_input, test_input, train_output, k=5, prob=TRUE)
resultdf <- as.data.frame(cbind(test[,ncol(test)], prediction))
colnames(resultdf) <- c("Actual","Predicted")
RESULT dataset
A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
I have the following concerns:
What should I do to obtain probability values? Is this a probability of getting high or low i.e. P(high) or P(low)?
The levels are set to 1 (high) and 2 (low), which is based on the order of first appearance. If low appeared before high in the train dataset, it would have a value 1. I feel this is not good practice. Is there anyway I can avoid this?
If there were more classes (more than 2) in the classifier, how would I handle this in the classifier?
I am using the class and e1071 library.
Thanks.
Utility function built before the "text" argument to scan was introduced:
rd.txt <- function (txt, header = TRUE, ...)
{ tconn <- textConnection(txt)
rd <- read.table(tconn, header = header, ...)
close(tconn)
rd}
RESULT <- rd.txt(" A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
")
> prop.table(table(RESULT))
P
A 1 2
1 0.15385 0.23077
2 0.07692 0.53846
You can also set up prop.table to deliver row or column proportions (AKA probabilities).

Resources