I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}
Related
In the below reproducible code, the custom balTransit() function correctly populates a values transition table using a for-loop, while the custom balTransit_1() function is supposed to do the same using lapply() but it doesn't work. What am I doing wrong in my implementation of lapply()? Run the code and you'll see results of:
balTransit (correct results):
> test
X1 X0 X2
X1 0 0 3
X0 0 50 0
X2 5 0 0
balTransit_1 (incorrect, all 0's):
> test_1
X1 X0 X2
X1 0 0 0
X0 0 0 0
X2 0 0 0
Enhanced explanation:
My main objective here is to learn how to use the apply() family of functions, for their perceived benefits. I’ve been going through simple tutorials. A secondary objective is the generation of a transition matrix from a base data frame. Once I figure this out with lapply() (or another apply() function that is most suitable), I’m going to run the various options (for-loop(), data.table(), lapply(), etc.) against the actual data set of 2.5m rows for speed testing.
What I’m doing is creating a transition matrix (technically here a data frame) showing the flow of values (balances) from one “Flags” category to another “Flags” category, over the periods specified by the user. So, in my “for-loop” reproducible example which works correctly, the user has specified a “From” period of 1 and a “To” period of 3. The transition matrix is then generated as shown in the image now posted at the bottom.
A related post yesterday, How to convert a for-loop to lapply function for parallel testing purposes?, addresses this issue for transition counts. This post addresses transition values.
Reproducible code:
# Set up data frame:
data <-
data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1")
)
# Function to set-up base transition table:
transMat <- function(data){
DF <- data.frame(matrix(0, ncol=length(unique(data$Flags)), nrow=length(unique(data$Flags))))
row.names(DF) <- unique(data$Flags)
names(DF) <- unique(data$Flags)
return(DF)
}
# Function to populate cells of transition table, using for-loop:
balTransit <- function(data, from=1, to=3){
DF <- transMat(data)
for (i in unique(data$ID)){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(DF) == id_from)
row <- which(row.names(DF) == id_to)
val <- (data$Values[(data$ID == i & data$Period == from)])
DF[row, column] <- val + DF[row,column]
}
return(DF)
}
# Function to populate cells of transition table, using lapply:
balTransit_1 <- function(data, from=1, to=3){
DF_1 <- transMat(data)
lapply(seq_along(unique(data$ID)), function(i){
id_from <- as.character(data$Flags[(data$ID == i & data$Period == from)])
id_to <- as.character(data$Flags[data$ID == i & data$Period == to])
column <- which(names(DF_1) == id_from)
row <- which(row.names(DF_1) == id_to)
val <- (data$Values[(data$ID == i & data$Period == from)])
DF_1[row, column] <- DF_1[row, column] + val
})
return(DF_1)
}
# Run the 2 functions:
test <- balTransit(data,1,3)
test
test_1 <- balTransit_1(data,1,3)
test_1
To make your lapply code work just replace <- with <<-:
DF_1[row, column] <<- DF_1[row, column] + val
Please see ?assignOps for more info.
However, again I wouldn't recommend lapply in this case (<<- should be avoided in general)
Here is a data.table approach:
library(data.table)
DT <- setDT(data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1")
))
unique_flags <- unique(DT$Flags)
all_flags <- setDT(expand.grid(list(first_flag = unique_flags, last_flag = unique_flags)))
resultDT <- dcast(
data = DT[, .(first_flag = first(Flags), last_flag = last(Flags), first_value = first(Values)), by = ID][
all_flags, on = c("first_flag", "last_flag")],
last_flag ~ first_flag,
fun.aggregate = sum,
value.var = "first_value"
)
for (col_i in seq_len(ncol(resultDT))){
set(resultDT, which(is.na(resultDT[[col_i]])), col_i, 0)
}
print(resultDT)
Result:
last_flag X0 X1 X2
1: X0 50 0 0
2: X1 0 0 3
3: X2 0 5 0
# step by step ------------------------------------------------------------
library(data.table)
DT <- setDT(data.frame(
ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
Values = c(5, 10, 15, 50, 2, 4, 3, 6, 9, 3, 6, 9),
Flags = c("X1","X0","X2","X0","X2","X0", "X2","X1","X1", "X2","X1","X1")
))
unique_flags <- unique(DT$Flags)
all_flags <- setDT(expand.grid(list(first_flag = unique_flags, last_flag = unique_flags)))
resultDT <- DT[, .(first_flag = first(Flags), last_flag = last(Flags), first_value = first(Values)), by = ID] # find relevant flags
resultDT <- resultDT[all_flags, on = c("first_flag", "last_flag")] # merge all combinations
resultDT <- dcast(resultDT, last_flag ~ first_flag, fun.aggregate = sum, value.var = "first_value") # dcast
for (col_i in seq_len(ncol(resultDT))){
set(resultDT, which(is.na(resultDT[[col_i]])), col_i, 0)
}
print(resultDT)
I made a function which is searching for outliners in each row of dataframe. What i'd like to get at the end is modified dataframe with new column x$outliers_numb as return not as just print. I added return() function at the end but it doesn't work at all. Any ideas?
outliers <- function(x, s, e){
# x = dataframe
# s = index of first col to take
# e = index of last column to take
p <- x
for(i in s:e){
Q1 <- quantile(p[,i], 0.25, names = FALSE)
Q3 <- quantile(p[,i], 0.75, names = FALSE)
iqr <- IQR(p[,i])
low <- Q1 - iqr*1.5
up <- Q3 + iqr*1.5
p[,i] <- ((p[,i] < low) | (p[,i] > up))
}
p <- p %>% mutate(outliers_numb = rowSums(p[,s:e]))
x$outliers_numb <- p$outliers_numb
return(x)
}
#example
w <- data.frame(col1 = c(1, 2, 3, 4, 5, 90, 6),
col2 = c(13, 60, 13, 18, 13, 12, 0),
col3 = c(1, 899, 5, 4, 3, 8, 6))
outliers(w, 1, 3)
Just assign it to a new variable
dataframe_to_reus <- outliers(w, 1, 3)
I want to create train, val, test splits (60:20:20). I repeated the process multiple times.
Test set should contain only 2 observations each time. But why does it sometimes contain 1 or 3 observations.
What is role of replace in sample(). Should I keep it FALSE
library(dplyr)
tbl <- tibble(id = 1:10)
train = list()
val = list()
test = list()
for (run in 1:5)
{
assignment <- sample(1:3, size = nrow(tbl), prob = c(0.6, 0.2, 0.2), replace = TRUE)
# Create a train, validation and test sets
train[[run]] <- tbl[assignment == 1, ]
val[[run]] <- tbl[assignment == 2, ]
test[[run]] <- tbl[assignment == 3, ]
}
If we exactly 6, 2, 2, values for 1, 2, 3 as sample, just replicate the 1, 2, 3 and sample it
v1 <- sample(rep(1:3, c(6, 2, 2)))
Then do a split
split(tbl, v1)
When we use prob, it can change the frequency slightly because it is just a probability. Regarding the use of replace = TRUE, it is needed in the OP's code as the length of 1:3 is just 3, whereas size = nrow(tbl) is 10, thus without replacement, it can't fill those 7 extra elements
I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)
`I'm wondering how I would go about altering this code so that corresponding values of both vectors cannot be equal. As an example: if x = (1, 2, 2, 4, 8, 1, 7, 9, 5, 10) and y = (3, 2, 7, 8, 4, 10, 4, 8, 2, 1), the second values for both vectors equal 2. Is there any way I can tell R to re-sample in this second spot in vector x until it is not the same value in vector y?
x <- c(1:10)
y <- c(1:10)
sample_x <- sample(x, length(10), replace = TRUE)
z <- sample_x > y`
You could do:
while(any(x == y)) x <- sample(x)
Edit: Now I realize x and y probably come from a similar sample call with replace = TRUE, here is an interesting approach that avoids a while loop. It uses indices and modulo to ensure that the two samples do not match:
N <- 1:10 # vector to choose from (assumes distinct values)
L <- 20 # sample size - this might be length(N) as in your example
n <- length(N)
i <- sample(n, L, replace = TRUE)
j <- sample(n-1, L, replace = TRUE)
x <- N[i]
y <- N[1 + (i + j - 1) %% n]
while (any(ind <- x==y))
x[ind] <- sample(N, sum(ind), TRUE)
where N is what you are sampling from (or the max integer)
The advantage here is that if you do not need to resample all of x, then this will converge more quickly.
You can use function permn from library combinat to generate all permutations of vector of length 10.
ind <- permn(10)
xy_any_equal <- sapply(ind, function(i) any(x[i] == y))
if(sum(xy_any_equal) < length(xy_any_equal)) x_perm <- x[head(ind[!xy_any_equal],1)[[1]]]
exists(x_perm)