For loop list of tibbles R - r

I want to create a list of random tibbles using a for loop. I have a large data set where I will need to apply functions to lists of tibbles and create lists of tibbles as the outputs. I understand there might be better ways to do this and would also appreciate hearing those but am trying to wrap my head around how for loops work.
I can create a list of random tibbles with each tibble in the list named:
tibble_random1 <- tibble(Number = sample((1:100), 10, replace = TRUE),
Letter = sample((LETTERS), 10, replace = TRUE),
Logical = sample(c("True", "False"), 10, replace = TRUE))
tibble_random2 <- tibble(Number = sample((1:100), 10, replace = TRUE),
Letter = sample((LETTERS), 10, replace = TRUE),
Logical = sample(c("True", "False"), 10, replace = TRUE))
tibble_random3 <- tibble(Number = sample((1:100), 10, replace = TRUE),
Letter = sample((LETTERS), 10, replace = TRUE),
Logical = sample(c("True", "False"), 10, replace = TRUE))
tibble_random <- list(tibble1 = tibble_random1,
tibble2 = tibble_random2,
tibble3 = tibble_random3)
I cannot figure out how to do this with a for loop or if a for loop is completely inappropriate for this.
Thanks.

Initialise a list and fill 1 tibble in every iteration using for loop.
tibble_random <- vector('list', 3)
for(i in seq_along(tibble_random)) {
tibble_random[[i]] <- tibble(Number = sample((1:100), 10, replace = TRUE),
Letter = sample((LETTERS), 10, replace = TRUE),
Logical = sample(c("True", "False"), 10, replace = TRUE))
}
You can also use replicate or lapply to do this without for loop.
tibble_random <- replicate(3, tibble(Number = sample((1:100), 10, replace = TRUE),
Letter = sample((LETTERS), 10, replace = TRUE),
Logical = sample(c("True", "False"), 10, replace = TRUE)), simplify = FALSE)
To assign the names of the list you can use :
names(tibble_random) <- paste0('tibble', seq_along(tibble_random))

Related

Process sets of rasters in parallel using lapp function from terra package

I have groups of rasters that I want to run a function on, I think probably using the lapp function from the {terra} package? Here is a simple example using toy data of the 'style' of thing I am hoping to accomplish.
library("terra")
rp10val = 106520
rp20val = 106520
rp50val = 154250
rp100val = 154250
rp200val = 154250
rp500val = 154250
rp1500val = 154250
sopval = 200
rp_10_vul = rast(nrow = 10, ncol = 10, vals = rep(rp10val, 10))
rp_20_vul = rast(nrow = 10, ncol = 10, vals = rep(rp20val, 10))
rp_50_vul = rast(nrow = 10, ncol = 10, vals = rep(rp50val, 10))
rp_100_vul = rast(nrow = 10, ncol = 10, vals = rep(rp100val, 10))
rp_200_vul = rast(nrow = 10, ncol = 10, vals = rep(rp200val, 10))
rp_500_vul = rast(nrow = 10, ncol = 10, vals = rep(rp500val, 10))
rp_1500_vul = rast(nrow = 10, ncol = 10, vals = rep(rp1500val, 10))
sop_tile = rast(nrow = 10, ncol = 10, vals = rep(sopval, 10))
input_raster_group <- c(rp_10_vul, rp_20_vul, rp_50_vul, rp_100_vul,
rp_200_vul, rp_500_vul, rp_1500_vul, sop_tile)
## In real world each of these lists would have rasters with different data in
input_raster_lists <- list(list(input_raster_group),
list(input_raster_group),
list(input_raster_group))
mcmapply(lapp,
input_raster_lists,
function(a,b,c,d,e,f,g,h){a+b+c+d+e+f+g+h},
mc.cores = 2)
## If working on windows, this might be better to try and run as proof of concept
# mapply(lapp,
# input_raster_lists,
# function(a,b,c,d,e,f,g,h){(a+b-c) / (d+e+f+g+h)})
Simplified data to make this easier to read
library("terra")
r10 = rast(nrow = 10, ncol = 10, vals = 10)
r20 = rast(nrow = 10, ncol = 10, vals = 20)
r50 = rast(nrow = 10, ncol = 10, vals = 50)
group <- c(r10, r20, r50)
input <- list(group, group, group)
You can use lapply to compute lists sequentially
x <- lapply(input, \(i) sum(i))
y <- lapply(input, \(i) app(i, sum))
z <- lapply(input, \(i) lapp(i, function(a,b,c){a+b+c}))
To use parallelization you could use e.g. parallel::parLapply or, as in your case, parallel::mcmapply.
SpatRaster objects hold a pointer (reference) to a C++ object that cannot be passed to a worker. Therefore you would need to use wrap and unwrap as I show below. I use proxy=TRUE to not force values to memory.
library(parallel)
inp <- lapply(input, \(x) wrap(x, proxy=TRUE))
f <- \(i) { unwrap(i) |> sum() |> wrap(proxy=TRUE)}
b <- mcmapply(f, inp)
out <- lapply(b, unwrap)
This approach may be useful in some cases, e.g. when you have to do many simulations on a relatively small raster that is memory.
In most cases you would do parallelization because you are dealing with large rasters that are on disk. In that case, you could just send the filenames to the workers, and create the SpatRasters there (and write the output to disk).
There is more discussion here

Arguments must have same length when using tapply

data.frame(q1 = sample(c(1, 5), 200, replace = T, prob = c(1/2, 1/2)),
gender = sample(c("M", "F"), 200, replace = T, prob = c(2/3, 1/3))
) %>% tapply(.$q1,list(.$gender),FUN=sum)
I just want to use tapply to sum by gender, but got error as below:
Error in tapply(., .$q1, list(.$gender), FUN = sum) :
arguments must have same length
Where's the problem?
For the sum example, you can use data.table syntax:
library(data.table)
df <- data.frame(q1 = sample(c(1, 5), 200, replace = T, prob = c(1/2, 1/2)),
gender = sample(c("M", "F"), 200, replace = T, prob = c(2/3, 1/3)))
as.data.table(df)[, sum(q1), by = gender]
This will also work with a function that has multiple return values, unlike my previous example with summarize:
as.data.table(df)[, shapiro.test(q1), by = gender]

Column name formatting in KableExtra in R

Latex command are not formatting the column names as I intended.
library(kableExtra)
kable(test,"latex", col.names = c('Mean','\\textit{N}' ,'Strongly\nDisagree','Disagree','Neither Agree\norDisagree','Agree','Strongly\nAgree'))
The output I am getting is:
https://www.dropbox.com/s/xvl7lfh94bl2274/Kable%20Table.PNG?dl=0
I have tried both latex commands and R-markdown commands.
The N should be italicized and Strong disagree, Neither Agree or Disagree, and Strongly Agree should be broken up on two lines.
test_data <- data.frame(Mean = runif(5, 3.71, 7.72),
N = sample(57:59, 5, replace = TRUE),
sd = c(1, rep(0, 4)),
d = rep(1, 5),
naod = sample(1:4, 5, replace = TRUE),
a = sample(5:12, 5, replace = TRUE),
sa = sample(37:44, 5, replace = TRUE))
kable(test_data,"latex" ,booktabs=T,
align="c",
col.names=linebreak(c('Mean','\\textit{N}' ,'Strongly\n Disagree','Disagree','Neither Agree\n or Disagree','Agree','Strongly\n Agree')),
row.names = T,escape=F) %>%
row_spec(0,align = "c")

apply statement to sample columns, across rows of different lengths

I'm trying to write a simple R function to sample 5-element substrings across two columns of a single data frame. The length of the strings are equal for each row, but they differ down the columns. The function works when I specify a row and col to act on, but I can't get the apply statement to work on on each row and each column. As written, it will only pull random samples based on the length of the first instance, so if the first instance is shorter than any of the other strings, the output for the other rows is sometimes less than 5-elements.
example df:
BP TF
1 CGTCTCTATTCTAGGCAAGA TTTFFFFTFFFTFFFTFTTT
2 AAGTCACTCGAATTCGGATGCCCCCTAGGC TTFFFFFTFFFFTTFTFFTTTFTTTTFTFF
3 TGCTCATGACGGGAC FFFTFTFFFFTFTFT
'intended output:'
1 CTATT FFTFF
2 CCTAG TTTFT
3 TCATG TFTFF
'reproducible example code:'
#make fake data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
Fragment = function(string) {
nStart = sample(1:nchar(string) -5, 1)
substr(string, nStart, nStart + 4)
}
Fragment(string = my_df[1,1])#works for the first row, first col.
but this does not work:
apply(my_df, c(1,2), function(x) Fragment(string = my_df[1:nrow(my_df),1:ncol(my_df)]))
There was an error in your function:
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
substr(string, nStart, nStart + 4)
}
It was missing parentheses between nchar(string) - 5, which made the subsetting go wrong.
You can then simply use apply(my_df, c(1,2), Fragment) as suggested in the comments.
To show that this works now:
for(i in 1:10000){
stopifnot(all(5 == sapply(apply(my_df, c(1,2), Fragment), nchar)))
}
This shows that in 10000 tries, it always produced 5 characters as output.

Grepl first letters, two patterns to match

I have the following data frame. I want to create a column called A1B1 with a 1 if there are strings starting with A1 or B1 or both in the data frame and a zero if not.
What am I doing wrong here:
set.seed(1)
Data <- data.frame(id = seq(1, 10),
Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE),
Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE),
Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE),
Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE),
Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
A1orB1 <- c("^A1", "^B1")
Data$A1B1 <- apply(Data[-1],1,function(x)as.integer(sum(grepl(paste(A1orB1,collapse="|"), x))==1))
I would expect to have a one with ID's (1,2,3,4,5,8,9,10) but I have only ones in 4,5 and 9.
Thanks!
At present you are computing the sum of the number of A1... or B1... strings and comparing it to 1.
Perhaps replace sum() with any() (and consider breaking your code up into a few more intermediate chunks to make it more readable ...). Or you could just replace ==1 with >=1 in your code.
ss <- paste(A1orB1,collapse="|")
ff <- function(x) as.integer(any(grepl(ss,x)))
Data$A1B1 <- apply(Data[-1],1,ff)

Resources