I have a df, I would like one of the colums to contain a string of 5 random values between 1 and 100:
expected_df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =(c('2 3 50 17 9', '10 3 5 100 22', '86 30 74 10 27', '6 33 4 19 66')))
I have tried to create a function that repeat '1-100' 5 times, however it repeats the same 5 numbers for each row in the df
rand_str<- function() {
x = c(sample(1:100, 5, replace = FALSE))
return(paste0(x,collapse = " "))
}
df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =rep(rand_str(),4))
I have tried to add rep(rand_str(),4), however it doesn't solve the problem.
How can I create 4 rows with 5 different digits in each?
Thanks in advance!
The function you're looking for is replicate. With replicate, you can use your original rand_str() function like this:
replicate(4, rand_str())
Alternatively, you can rewrite your rand_str() function like this:
rand_str <- function(n) replicate(n, paste(sample(100, 5, FALSE), collapse=" "))
Demo:
set.seed(1) # So you can replicate these results
rand_str(4)
# [1] "27 37 57 89 20" "90 94 65 62 6" "21 18 68 38 74" "50 72 98 37 75"
For reference, if you are going to use a for loop, either of the following approaches would perform more efficiently than Steffen's answer, which grows a vector with each iteration of the loop. In R, you should pre-allocate space to store the results of your loops. When possible, specifying the storage mode (for example, specifying when a character or integer is expected in the results) will help improve the function's efficiency.
This option creates an empty character vector of the required length before the loop, and each iteration of the loop replaces the empty vector at the given position with the pasted result of the sample.
rand_str <- function(n) {
returnvalue <- character(n)
for (i in 1:n) {
returnvalue[i] <- paste0(sample(1:100, 5, replace = FALSE), collapse = " ")
}
returnvalue
}
This option creates an empty matrix where each row stores the results of the samples. Once the matrix has been filled, it gets pasted together using the do.call(paste, ...) idiom commonly used to paste together rows of a data.frame.
rand_str <- function(n) {
m <- matrix(NA_integer_, ncol = 5, nrow = n)
for (i in seq.int(n)) {
m[i, ] <- sample(100, 5, FALSE)
}
do.call(paste, data.frame(m))
}
How about this?
rand_str <- function(n) {
returnvalue <- c()
for (i in 1:n) {
x = c(sample(1:100, 5, replace = FALSE))
returnvalue <- c(returnvalue, paste0(x, collapse = " "))
}
returnvalue
}
df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =rand_str(4))
Related
I am working with the R programming language.
I am trying to count the first time a certain pattern (e.g. ABCD) appears in a random string (e.g. ACABCDCDBCABCDBC - answer =6 ). I wrote a function to do this:
library(stringr)
letters <- c("A", "B", "C", "D")
results <- list()
for (i in 1:100)
{
iteration_i = i
letters_i = paste(sample(letters, 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)),collapse="")
position_i = str_locate(letters_i, "ADBC")
results_tmp = data.frame(iteration_i , letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
This looks something like this now (note: I don't think this is correct - in row 5, I see ABCD at the beginning of the row, but its being recorded as NA for some reason):
iteration_i letters_i start end
1 1 BACDCCCDCCCDCDDBBCBBAACACBBBBAAABDDDACAABDDABBABADCDDCDACCBBBCABCDABCDCCCDADDDBADBDCADAABDBDCDCAACCB NA NA
2 2 CACACCCCDCCBADACBBAADBCABBAAAAADBDDBCADCAAADADAAABDCABBAABABBCBDADCDDDDCDBADDBDCBCDDDBDCDDAACBBBBACA 20 23
3 3 CDCBDAABDDDDADBAAABBADAADBDDDBDADDCABADDDCDABBBCBCBBACBBDADABBCDCCACDBCDCDDBDBADBCDCADDADDDBDBAAABBD 79 82
4 4 ADBCDBADADBAAACAADACACACACBDDCACBDACCBDAAABDBAAAABBCCDBADADDADCBCABCBAABDCBCDCDACDCCDBADCBDDAADBCDAC 1 4
5 5 D**ABCD**DDCCBCDABADBBBBCDBCADCBBBDCAAACACCCBCBCADBDDABBACACBDABAAACCAAAAACCCCBCBCCABABDDADBABDDDCCDDCCC NA NA
6 6 DDDDDBDDDDBDDDABDDADAADCABCDAABBCCCDAABDDAACBDABBBBBABBCBDADBDCCAAADACCBCDDBDCAADCBBBCACDBBADDDDCABC NA NA
Currently, I am only generating 100 letters and hoping that this is enough to observe the desired pattern (sometimes this doesn't happen, notice the NA's) - is there a way to add a WHILE LOOP to what I have written to keep generating letters until the desired pattern first appears?
Can someone please show me how to do this?
Thanks!
The loop is a repeat loop, not while, that only breaks when the pattern is found. I have set the results list length to 2, there's no point in making it bigger just to test the code.
library(stringr)
Letters <- c("A", "B", "C", "D")
Pattern <- "ADBC"
n <- 2L
set.seed(2022)
results <- vector("list", length = n)
for (i in seq.int(n)) {
repeat {
l <- sample(Letters, 100, replace = TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
letters_i <- paste(l, collapse = "")
position_i <- str_locate(letters_i, pattern = Pattern)
if(any(!is.na(position_i))) break
}
results_tmp <- data.frame(iteration = i, letters = letters_i, position_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
results_df
#> iteration letters start end
#> 1 1 ADBDBDBBCABBBDDBADDAADCBBADACACDCCBBADAADCDDABADCABCDCDDCCCBDDAABACCBDAAAADBDDCCCCADBCBBDABBDCCCBADD 83 86
#> 2 2 DDBDBDBCDDBDBBBDBABBCCBBCCBDBDABBAAABACABADCCBBABADBCCCDABABBDBADCADCABDDDAAACCBDCAACACACBBDDDACCDDC 50 53
Created on 2022-06-11 by the reprex package (v2.0.1)
In R, you can define an arbitrary integer sequence using :, e.g.
a = c(1:3, 12:14)
print(a)
## 1 2 3 12 13 14
I'm looking for a way to do the inverse operation, e.g. given a vector of integers I want to produce a character (or character vector) that collapses the integer sequence(s) to the equivalent expressions using :, e.g.
some_function (a)
## "1:3" "12:14"
Bonus if the stride can be detected, e.g. c(1, 3, 5) becomes "1:2:5" or something like that.
Motivation: generate an integer sequence in R based on some data manipulation to identify database row selection, and pass the most concise representation of that sequence to an external program in the proper format.
We can be able to take into consideration the rle of the differences and paste the range together taking into consideration the sequence distance.
fun=function(s){
m=c(0,diff(s))
b=rle(m)
b$values[b$lengths==1&b$values!=1]=0
l=cumsum(!inverse.rle(b))
d=function(x)paste0(range(x[,1]),
collapse = paste0(":",unique(x[-1,-1]),":"))
f=c(by(cbind(s,m),l,d))
sub("::.*","",sub(":1:",":",f))
}
fun(c(1,1:3,12:14,c(1,3,5)))
1 2 3 4
"1" "1:3" "12:14" "1:2:5"
fun(c(1, 3, 5, 8:10, 14, 17, 20))
1 2 3
"1:2:5" "8:10" "14:3:20"
fun(1)
1
"1"
Ah, nerd heaven. Here's a first shot. You could even use this for encoding within R.
Needs testing; code always prints the stride out.
encode_ranges <- function (x) {
rle_diff <- list(
start = x[1],
rled = rle(diff(x))
)
class(rle_diff) <- "rle_diff"
rle_diff
}
decode_ranges <- function (x) {
stopifnot(inherits(x, "rle_diff"))
cumsum(c(x$start, inverse.rle(x$rled)))
}
format.rle_diff <- function (x, ...) {
stopifnot(inherits(x, "rle_diff"))
output <- character(length(x$rled$values))
start <- x$start
for (j in seq_along(x$rled$values)) {
stride <- x$rled$values[j]
len <- x$rled$lengths[j]
if (len == 1L) {
start <- end + stride
next
}
end <- start + stride * x$rled$lengths[j]
output[j] <- paste(start, end, stride, sep = ":")
}
output <- output[nchar(output) > 0]
paste(output, collapse = ", ")
}
print.rle_diff <- function (x, ...) cat(format(x, ...))
encode_ranges(c(1:3, 12:14))
encode_ranges(c(1, 3, 5, 8:10, 14, 17, 20))
We create a grouping variable with diff and cumsum, then use on the group by functions to paste the range of values
f1 <- function(vec) {
unname(tapply(vec, cumsum(c(TRUE, diff(vec) != 1)),
FUN = function(x) paste(range(x), collapse=":")))
}
f1(a)
#[1] "1:3" "12:14"
For the second case
b <- c(1, 3, 5)
un1 <- unique(diff(c(1, 3, 5)))
paste(b[1], un1, b[length(b)], sep=":")
#[1] "1:2:5"
I have a df with a label "S" for anywhere my numeric column is <35.
I'd like to use each S position and label "S-1", "S-2", "S-3" for the 3 previous rows to S, then "S+1", "S+2" for the next 2 rows of S.
like this..
N S
45
56
67 S-3
47 S-2
52 S-1
28 S
89 S+1
66 S+2
55
76
I was using this to start me off, just as an example.
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)
data$S <- ifelse(data$N<35, "S", "")
Any ideas..?
here is an option using base R, where we get the index of rows where 'N' is less than 35 ('i1'), create the 'S' column with blank ("") elements, loop through 'i1', get the sequence of 3 elements before, 2 elements after, paste with 'S', get the intersect of sequence with the index ('ind') and assign the strings ('val') to the 'S' column
i1 <- which(data$N < 35)
data$S <- ""
out <- do.call(rbind, lapply(i1, function(i) data.frame(ind =(i-3): (i+2),
val = c(paste0("S-", 3:1), "S", paste0("S+", 1:2)), stringsAsFactors = FALSE)))
i2 <- out$ind %in% seq_len(nrow(data))
data$S[out$ind[i2]] <- out$val[i2]
data
set.seed(24)
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)
Without dealing with possible overlap, here is a bunch of ifelse() statements to get the job done. Maybe you can think of a more appropriate way to generalize it.
You can use lag() and lead() with the dplyr package.
data %>% mutate(S = ifelse(S == "S", S,
ifelse(lag(S == "S"), "S+1",
ifelse(lag(S == "S", 2), "S+2",
ifelse(lead(S == "S"), "S-1",
ifelse(lead(S == "S", 2), "S-2", ""))))),
S = ifelse(is.na(S), "", S))
(You would get NA values in the first two rows if the first value is not <35, so replace these with "".)
N S
1 52
2 86
3 86
4 57
5 54
6 57
7 51
8 98
9 100 S-2
10 73 S-1
11 30 S
12 52 S+1
13 86 S+2
14 86
This is a long-ish answer since I break it down into pieces I would normally implement using a pipeline and lambda expressions, but it should be easy enough to follow.
I will work on row indices and compute two vectors, one containing the index closest to i on the left that has label "S", and one containing the index closest to i on the right.
indices <- 1:length(data$S)
closest_left <- rep(NA, length = length(indices))
closest_right <- rep(NA, length = length(indices))
I compute these using purrr's reduce functions but you could easily do it in a loop as well.
this_or_left <- function(left_val, i) {
res <- if (data$S[[i]] == "S") i else left_val
closest_left[[i]] <<- if (data$S[[i]] == "S") i else left_val
}
this_or_right <- function(right_val, i) {
res <- if (data$S[[i]] == "S") i else right_val
closest_right[[i]] <<- if (data$S[[i]] == "S") i else right_val
}
purrr::reduce(indices, this_or_left, .init = this_or_left(NA, 1))
purrr::reduce_right(indices, this_or_right, .init = this_or_right(NA, length(indices)))
Whether you could do it with vectorised expressions I don't know. Possibly. I didn't try.
Now, I simply have to compute the distance to the closest S and make labels from that, using empty labels if the distance is greater than 3 and label "S" if the distance is zero.
get_dist <- Vectorize(function(i) {
down <- i - closest_left[i]
up <- closest_right[i] - i
if (is.na(down) || down > up) up
else if (is.na(up) || down <= up) -down
else NA
})
make_label <- Vectorize(function(dist) {
if (abs(dist) > 3) ""
else if (dist == 0) "S"
else if (dist < 0) paste0("S", dist)
else if (dist > 0) paste0("S+", dist)
})
make_label(get_dist(indices))
Here, I used Vectorized expressions to change it up a little.
I wanted to multiply to each list element in say l1 with b1's col1 and store it in a separate column. Basically this is what i wanted to do :
res = 0
for item in a
for col_item in b
res = res + item * col_item
E.g.
l1 = list(c('17-Nov-14', 10), c('17-Apr-15', 20))
b1 = data.frame(col1 = c(10, 20), res=c(0))
result = data.frame(col1= c(10, 20), res = c(2*10+4*10+3*10, 2*20+4*20+3*20))
I have a working code but can be improved.
test <- function(param, df) {
df$res <- as.integer(param[2]) * df$col1
df
}
t <- lapply(l1, test, b)
result <- cbind(t[[1]]$col1, t[[1]]$res + t[[2]]$res + t[[3]]$res)
We can simplify the computation with a little algebra. If we factor out the element of b1$col1, then we can precompute the sum of the list and perform a vectorized multiplication against it:
b1$res <- sum(unlist(l1))*b1$col1;
b1;
## col1 res
## 1 10 90
## 2 20 180
For your new problem definition, we need to extract the required element out of each list component vector:
b1$res <- sum(as.integer(sapply(l1,`[`,2L)))*b1$col1;
b1;
## col1 res
## 1 10 300
## 2 20 600
If you are looking for a method to reduce your list after lapply, you can use the Reduce function:
Reduce(function(df1, df2) data.frame(col = df1[1], res = df1[2] + df2[2]), myList)
# col1 res
# 1 10 90
# 2 20 180
Suppose myList <- lapply(...).
I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.