I want to replace 0s in my dataset using sample to random select a value in the column to replace it with.
I have this example dataset:
Sepal.Length Sepal.Width Petal.Length Petal.Width species
1 0.0 3.5 0.0 0.2 setosa
2 4.9 3.0 0.0 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.0 setosa
5 5.0 0.0 0.0 0.0 setosa
6 0.0 0.0 0.0 0.4 setosa
I have tried:
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)
[1] 3.5 3.0 3.2 3.1 0.0 0.0 3.4 1.0 2.9 1.0 3.7 1.0 3.0 3.0 4.0
The zero's still remain. I've tried to loop this for all the columns as doing the code above for each column is too time-consuming and I've tried:
lapply(ir[,-5], function(x)ifelse(ir[,1:4] == 0, sample(ir[,1:4]),ir[,1:4]))
However it creates unnecessary columns of data with the zeros still remaining.
Reproducible code:
structure(list(Sepal.Length = c(0, 4.9, 4.7, 4.6, 5, 0, 4.6,
5, 4.4, 0, 5.4, 4.8, 0, 0, 0), Sepal.Width = c(3.5, 3, 3.2, 3.1,
0, 0, 3.4, 0, 2.9, 0, 3.7, 0, 3, 3, 4), Petal.Length = c(0, 0,
1.3, 1.5, 0, 0, 1.4, 1.5, 1.4, 1.5, 0, 1.6, 1.4, 1.1, 1.2), Petal.Width = c(0.2,
0.2, 0.2, 0, 0, 0.4, 0.3, 0.2, 0.2, 0, 0.2, 0, 0, 0, 0.2), species = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
15L), class = "data.frame")
Here is a short dplyr solution:
ir %>%
mutate(across(.cols = where(is.numeric),
~ replace(., . == 0, sample(.[. != 0], length(.[. == 0]), replace=T))))
You may or may not need replace=T, which allow to repeat sampled elements.
Function that replaces zeros with random non zero value from vector:
f <- function(vec){
ind <- vec == 0
vec[ind] <- sample(vec[!ind], sum(ind), TRUE)
vec
}
apply function f to each numeric column:
library(data.table)
num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
setDT(df)[, (num_cols) := lapply(.SD, f), .SD = num_cols]
or using base R
num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
df[num_cols] <- lapply(df[num_cols], f)
note
it would be better to use this sample function from book Advanced R:
sample <- function(x, size = NULL, replace = FALSE, prob = NULL) {
size <- size %||% length(x)
x[sample.int(length(x), size, replace = replace, prob = prob)]
}
because of the behavior of base::sample in case when x is numeric of length 1.
Using data.table (library(data.table)):
setDT(ir)
ir[, Sepal.Width :=
ifelse(Sepal.Width==0,
sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE),
Sepal.Width),
by=species]
You could also have it sample from within the same species by adding a by
setDT(ir)
ir[, Sepal.Width :=
ifelse(Sepal.Width==0,
sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE),
Sepal.Width),
by=species]
Getting this for all coulmns:
ir[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") :=
lapply(.SD, function(x) {
ifelse(x==0, sample(x[x!=0], size=.N, replace=TRUE), x)}),
by=species]
Note that your code
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)
Is sampling from values of TRUE and FALSE because you are not subsetting with this logical operation ir$Sepal.Width != 0 - you need
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width[ir$Sepal.Width != 0]), ir$Sepal.Width)
Related
I have two lists that I would like to join together.
I want to join the second list to the first list as a new column.
The second list looks like:
[[1]]
[1] 2.46
[[2]]
[1] 2.475
[[3]]
[1] 2.4875
[[4]]
[1] 2.485
[[5]]
[1] 2.4625
[[6]]
[1] 2.4875
So I would like to join [[1]] as a new column in list 1. Expected output for 2 of the lists:
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Value
1 5.1 3.5 1.4 0.2 setosa 2.46
2 4.9 3.0 1.4 0.2 setosa 2.46
3 4.7 3.2 1.3 0.2 setosa 2.46
4 4.6 3.1 1.5 0.2 setosa 2.46
5 5.0 3.6 1.4 0.2 setosa 2.46
6 5.4 3.9 1.7 0.4 setosa 2.46
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Value
2 4.9 3.0 1.4 0.2 setosa 2.475
3 4.7 3.2 1.3 0.2 setosa 2.475
4 4.6 3.1 1.5 0.2 setosa 2.475
5 5.0 3.6 1.4 0.2 setosa 2.475
6 5.4 3.9 1.7 0.4 setosa 2.475
7 4.6 3.4 1.4 0.3 setosa 2.475
This is related to a question I have here where in the EDIT I compute the mean which is the value presented here.
EDIT:
I realise I forgot the data.
Data1
list1 <- list(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4
), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4,
1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 0.2,
0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
6L), class = "data.frame"), structure(list(Sepal.Length = c(4.9,
4.7, 4.6, 5, 5.4, 4.6), Sepal.Width = c(3, 3.2, 3.1, 3.6, 3.9,
3.4), Petal.Length = c(1.4, 1.3, 1.5, 1.4, 1.7, 1.4), Petal.Width = c(0.2,
0.2, 0.2, 0.2, 0.4, 0.3), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = 2:7, class = "data.frame"),
structure(list(Sepal.Length = c(4.7, 4.6, 5, 5.4, 4.6, 5),
Sepal.Width = c(3.2, 3.1, 3.6, 3.9, 3.4, 3.4), Petal.Length = c(1.3,
1.5, 1.4, 1.7, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2,
0.4, 0.3, 0.2), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 3:8, class = "data.frame"),
structure(list(Sepal.Length = c(4.6, 5, 5.4, 4.6, 5, 4.4),
Sepal.Width = c(3.1, 3.6, 3.9, 3.4, 3.4, 2.9), Petal.Length = c(1.5,
1.4, 1.7, 1.4, 1.5, 1.4), Petal.Width = c(0.2, 0.2, 0.4,
0.3, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 4:9, class = "data.frame"),
structure(list(Sepal.Length = c(5, 5.4, 4.6, 5, 4.4, 4.9),
Sepal.Width = c(3.6, 3.9, 3.4, 3.4, 2.9, 3.1), Petal.Length = c(1.4,
1.7, 1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.4, 0.3,
0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 5:10, class = "data.frame"),
structure(list(Sepal.Length = c(5.4, 4.6, 5, 4.4, 4.9, 5.4
), Sepal.Width = c(3.9, 3.4, 3.4, 2.9, 3.1, 3.7), Petal.Length = c(1.7,
1.4, 1.5, 1.4, 1.5, 1.5), Petal.Width = c(0.4, 0.3, 0.2,
0.2, 0.1, 0.2), Species = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = 6:11, class = "data.frame"))
Data 2:
list2 <- list(2.46, 2.475, 2.4875, 2.485, 2.4625, 2.4875)
Data 3:
list3 <- list(1.80438213020271, 1.81796589626978, 1.81591080488058, 1.81906569425076,
1.81978971735325, 1.86302586794048)
Data 4:
list4 <- list(0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
We can use map2
library(dplyr)
library(purrr)
map2(lst1, lst2, ~ .x %>%
mutate(Value = .y))
If there are more lists, we can wrap it in a single list and use pmap
pmap(list(lst1, lst2, lst3, lst4), ~ ..1 %>%
mutate(mean = ..2, sd = ..3, min = ..4))
Or in base R with Map
Map(cbind, lst1, Value = lst2)
i have the following data: I need to take the average of all values till it falls to a minimum value as shown in the below image. I need to take average of all those values in the highlighted area (in red)
Sample data of signal:
3
3.1
3
3.2
3
3.1
3.2
3
3
3.05
3.25
3.21
3.2
0.4
0.48
0.51
0.65
0.92
1.4
2
3.2
3
3.5
You can find the index of the min value and subset values till that point and calculate mean like:
dat <- c(3, 3.1, 3, 3.2, 3, 3.1, 3.2, 3, 3, 3.05, 3.25, 3.21, 3.2, 0.4, 0.48, 0.51, 0.65, 0.92, 1.4, 2, 3.2, 3)
mean(dat[seq(max(which.min(dat)[1]-1, 1))])
[1] 3.100769
Or using dplyr as:
library(dplyr)
tibble(dat) %>%
filter(row_number() < which.min(dat)) %>%
summarise(mean = mean(dat))
# A tibble: 1 x 1
mean
<dbl>
1 3.10
As we need to find the average of points before it falls to minimum value, create a logical vector (v1 <= min(v1)), get the cumulative max, convert the TRUE to NA and FALSE as 1, multiply with the points and get the mean with na.rm as TRUE
with(df1, mean(v1 * NA^(cummax(v1 <= min(v1))), na.rm = TRUE))
#[1] 3.100769
data
df1 <- structure(list(v1 = c(3, 3.1, 3, 3.2, 3, 3.1, 3.2, 3, 3, 3.05,
3.25, 3.21, 3.2, 0.4, 0.48, 0.51, 0.65, 0.92, 1.4, 2, 3.2, 3,
3.5)), class = "data.frame", row.names = c(NA, -23L))
I'd like to create a new column where each value is a random subset of other values from that row in my data.
# Example data:
df <- data.frame(matrix(nrow = 57, ncol = 6)) %>%
mutate(
X1 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X2 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X3 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X4 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X5 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X6 = round(rnorm(n = 57, mean = 0, sd = 1), 1)
)
# my failed attempt at a new column
df %>%
rowwise() %>%
mutate(X7 = str_c(df[, sample(1:6, 3, replace = F)]), sep = ", ")
A solution uses tidyverse. The key is to split the data frame by row and apply a function to sample the values for each row subset. map_df can achieve the above-mentioned task and combine all the output to a data frame. df2 is the final output.
# Load package
library(tidyverse)
# Set seed
set.seed(123)
# Create example data frame
df <- data.frame(matrix(nrow = 57, ncol = 6)) %>%
mutate(
X1 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X2 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X3 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X4 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X5 = round(rnorm(n = 57, mean = 0, sd = 1), 1),
X6 = round(rnorm(n = 57, mean = 0, sd = 1), 1)
)
# Process the data
df2 <- df %>%
rowid_to_column() %>%
split(f = .$rowid) %>%
map_df(function(dt){
dt_sub <- dt %>%
select(-rowid) %>%
select(sample(1:6, 3, replace = FALSE)) %>%
unite(X7, everything(), sep = ", ")
return(dt_sub)
}) %>%
bind_cols(df) %>%
select(paste0("X", 1:7))
df2
X1 X2 X3 X4 X5 X6 X7
1 -0.6 0.6 0.5 0.1 0.9 0.1 0.1, 0.5, 0.9
2 -0.2 0.1 0.3 0.0 -1.0 0.2 0.1, 0.3, 0.2
3 1.6 0.2 0.1 2.1 2.0 1.6 1.6, 2.1, 0.1
4 0.1 0.4 -0.6 -0.7 -0.1 -0.2 0.1, 0.4, -0.6
5 0.1 -0.5 -0.8 -1.1 0.2 0.2 0.1, 0.2, -0.5
6 1.7 -0.3 -1.0 0.0 -0.7 1.2 -1, -0.7, -0.3
7 0.5 -1.0 0.1 0.3 -0.6 1.1 0.5, -0.6, -1
...
I believe that the best way is to use base R functions replicate, sample and sapply.
inx <- t(replicate(nrow(df), sample(1:6, 3, replace = F)))
df$X7 <- sapply(seq_len(nrow(df)), function(i)
paste(df[i, inx[i, ]], collapse = ", "))
This is a solution in dplyr:
library(dplyr)
df %>%
group_by(idx = seq(n())) %>%
do({
res <- select(., -idx)
bind_cols(res, X7 = toString(sample(unlist(res),
3, replace = FALSE)))
}) %>%
ungroup() %>%
select(-idx)
The result:
# A tibble: 57 x 7
X1 X2 X3 X4 X5 X6 X7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.4 0.4 -0.1 3.4 0.9 -0.4 0.4, 0.9, 0.4
2 1.5 0.9 -0.7 1.5 -1.1 -0.3 -0.7, 1.5, -1.1
3 -0.1 -0.5 -0.6 -0.8 -0.3 2.3 -0.3, 2.3, -0.8
4 0.7 -1.0 0.3 0.2 -0.5 -0.3 -1, 0.3, -0.3
5 0.6 0.9 0.4 1.9 -0.7 -2.0 0.4, -2, 0.9
6 0.3 0.7 1.3 0.6 1.3 -0.2 0.7, -0.2, 1.3
7 0.5 0.3 1.1 -0.2 -0.4 -0.8 0.5, 1.1, 0.3
8 0.4 -1.9 0.8 -0.6 -1.1 0.4 0.4, -1.9, -0.6
9 0.2 -1.5 -1.9 1.0 0.0 0.6 0, 1, 0.6
10 -0.2 0.7 -0.5 1.4 0.3 -0.1 -0.2, 0.3, -0.5
I have a table like this one (table 1):
X a b c d
A 1 0 1 1
B 1 0 0 1
C 0 0 1 1
D 1 1 0 1
E 0 0 1 0
And another one with an identical column "X" like this one (table 2):
X a b b.1 c d d.1
A 0.8 1.5 1.2 3 0.8 0.9
B 0.7 0.1 0.3 0.002 0.7 0.03
C 0.3 0.2 0.4 0.4 0.6 1.3
D 1.2 1.4 0.95 0.5 1.2 0.4
E 1 0.01 1.4 1.9 1.7 0.2
As you can see one column in table1 (e.g. column b) can have one or two corresponding columns in table2 (e.g. columns b and b.1)
I would like to apply the following modifications to table 1:
If the value in table1 is 1 and in the corresponding columns from table2 at least one value is > 0.9, keep the value as a 1
If the value in table1 is 1 but the corresponding value(s) in table2 are not > 0.9, replace with "NA"
If the value in table1 is 0, but at least one corresponding value(s) in table2 is > 0.9, replace with "NA"
If the value in table1 is 0, but the corresponding value(s) in table2 are not > 0.9, replace with 0
Therefore, I would get this table as a result:
X a b c d
A NA NA 1 NA
B NA 0 0 NA
C 0 0 NA 1
D 1 1 0 1
E NA NA 1 NA
Please let me know if I can clarify anything further. Thank you for your help!
Please note that the solution has to be applicable to much larger data frames!
Here is the example data:
> dput(table1)
structure(c(1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
1, 1, 0), .Dim = c(5L, 4L), .Dimnames = list(c("A", "B", "C",
"D", "E"), c("a", "b", "c", "d")))
> dput(table2)
structure(c(0.8, 0.7, 0.3, 1.2, 1, 1.5, 0.1, 0.2, 1.4, 0.01,
1.2, 0.3, 0.4, 0.95, 1.4, 3, 0.002, 0.4, 0.5, 1.9, 0.8, 0.7,
0.6, 1.2, 1.7, 0.9, 0.03, 1.3, 0.4, 0.2), .Dim = 5:6, .Dimnames = list(
c("A", "B", "C", "D", "E"), c("a", "b", "b.1", "c", "d",
"d.1")))
This solution requires converting the matrices to data frames and using functions from tidyverse. There is a definitely simpler way to do this. Hopefully, someone can share their answers.
# Create example data
dt1 <- structure(c(1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
1, 1, 0), .Dim = c(5L, 4L), .Dimnames = list(c("A", "B", "C",
"D", "E"), c("a", "b", "c", "d")))
dt2 <- structure(c(0.8, 0.7, 0.3, 1.2, 1, 1.5, 0.1, 0.2, 1.4, 0.01,
1.2, 0.3, 0.4, 0.95, 1.4, 3, 0.002, 0.4, 0.5, 1.9, 0.8, 0.7,
0.6, 1.2, 1.7, 0.9, 0.03, 1.3, 0.4, 0.2), .Dim = 5:6, .Dimnames = list(
c("A", "B", "C", "D", "E"), c("a", "b", "b.1", "c", "d",
"d.1")))
# Load package
library(tidyverse)
# Extract row names
Row <- rownames(dt1)
# Convert dt1 and dt2 to data frames
dt1 <- as_data_frame(dt1)
dt2 <- as_data_frame(dt2)
# Add the row names as a new column
dt1 <- dt1 %>% mutate(Row = Row)
dt2 <- dt2 %>% mutate(Row = Row)
# Re-organize the dataset
dt1_r <- dt1 %>% gather(Class, Value, -Row)
dt2_r <- dt2 %>% gather(Class, Value, -Row)
# Keep only the letters, remove numbers and . in Class
dt2_r <- dt2_r %>% mutate(Class = gsub("\\..*", "", Class))
# Check the value in dt2 for each combination of Row and Class
# if any value is larger than 0.9, mark as 1, otherwise 0
dt3 <- dt2_r %>%
group_by(Row, Class) %>%
summarise(Threshold = ifelse(any(Value > 0.9), 1, 0))
# Merge dt3 and dt1_r by Row and Class
dt4 <- dt1_r %>% left_join(dt3, by = c("Row", "Class"))
# Create a new column to document the result based on the condition in Value and Threshold
dt5 <- dt4 %>%
group_by(Row, Class) %>%
mutate(Value2 = ifelse(Value == 1 & Threshold == 1, 1,
ifelse(Value == 1 & Threshold != 1, NA,
ifelse(Value == 0 & Threshold == 1, NA, 0)))) %>%
select(Row, Class, Value2)
# Re-organize dt5
dt5_r <- dt5 %>% spread(Class, Value2)
# Convert dt5_r to a matrix
dt6 <- dt5_r %>%
ungroup() %>%
select(-Row) %>%
as.matrix()
# Rename the matrix, dt6 is the final output
rownames(dt6) <- Row
I have data that looks like this:
time sucrose fructose glucose galactose molasses water
1 5 0.0 0.00 0.0 0.0 0.3 0
2 10 0.3 0.10 0.1 0.0 1.0 0
3 15 0.8 0.20 0.2 0.2 1.4 0
4 20 1.3 0.35 0.7 0.4 2.5 0
5 25 2.2 0.80 1.6 0.5 3.5 0
6 30 3.1 1.00 2.3 0.6 4.5 0
7 35 3.6 1.60 3.1 0.7 5.7 0
8 40 5.1 2.80 4.3 0.7 6.7 0
How can i make a time series plot that uses the time column? They are all increasing values.
I saw this post multiple-time-series-in-one-plot which uses ts.plot to achieve something similar to what i want to show, which is this:
Input data for the table above:
structure(list(time = c(5, 10, 15, 20, 25, 30, 35, 40), sucrose = c(0,
0.3, 0.8, 1.3, 2.2, 3.1, 3.6, 5.1), fructose = c(0, 0.1, 0.2,
0.35, 0.8, 1, 1.6, 2.8), glucose = c(0, 0.1, 0.2, 0.7, 1.6, 2.3,
3.1, 4.3), galactose = c(0, 0, 0.2, 0.4, 0.5, 0.6, 0.7, 0.7),
molasses = c(0.3, 1, 1.4, 2.5, 3.5, 4.5, 5.7, 6.7), water = c(0,
0, 0, 0, 0, 0, 0, 0)), .Names = c("time", "sucrose", "fructose",
"glucose", "galactose", "molasses", "water"), row.names = c(NA,
-8L), class = "data.frame")
It doesn't seem like a ts plot is necessary. Here's how you could do it in base-R:
with(df, plot(time, sucrose, type="n", ylab="contents"))
var <- names(df)[-1]
for(i in var) lines(df$time, df[,i])
The more elegant solution would however be using the 'dplyrandggplot2` package:
df <- df %>%
gather(content, val, -time)
ggplot(df, aes(time, val, col=content)) + geom_line()