I'm creating a random number where each member of a specific group has the same value of the variable. I found a solution but I suspect it isn't very efficient. I'm wondering if anyone anyone has a way to do this in one line of code:
library(dplyr)
data(mtcars)
t1 <- Sys.time() #Can the next two lines be replaced by one?
a <- data.frame(random = runif(3, 0, 6),
cyl = seq(4,8,2))
merged <- merge(mtcars, a, by = 'cyl')
t2 <- Sys.time()
t2 - t1
#check to make sure it worked
merged %>%
group_by(cyl) %>%
summarise(across(random, sd))
One-liner using ave.
res <- transform(mtcars, rand=ave(cyl, cyl, FUN=\(x) runif(1)))
Check:
with(res, tapply(rand, list(cyl), var))
# 4 6 8
# 0 0 0
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439
I am trying to use tidyverse tools (instead of for loops) on some groups to be evaluated with procedures from the mvabund package.
Basically, for the procedure I need a dataframe with just numeric columns (species abundances) first and then grouping variables for a downstream procedure.
But if I want to do this on multiple groupings, I need to include grouping variables. However, when using group_by these non-numeric variables are still present and the procedure will not run.
How can I use dplyr to pass the numeric variables to a (mvabund) function?
If I were to just one group, the process is as follows:
library(tidyverse)
library(mvabund)
df <- data.frame(Genus.species1 = rep(c(0, 1), each = 10),
Genus.species2 = rep(c(1, 0), each = 10),
Genus.species3 = sample(1:100,20,replace=T),
Genus.species4 = sample(1:100,20,replace=T),
GroupVar1 = rep(c("Site1", "Site2"), each=2, times=5),
GroupVar2 = rep(c("AA", "BB"), each = 10),
GroupVar3 = rep(c("A1", "B1"), times=10))
df1 <- filter(df, GroupVar2 == "AA" & GroupVar3 == "A1") # get desired subset/group
df2 <- select(df1, -GroupVar1, -GroupVar2, -GroupVar3) # retain numeric variables
MVA.fit <- mvabund(df2) # run procedure
MVA.model <- manyglm(MVA.fit ~ df1$GroupVar1, family="negative binomial") # here I need to bring back GroupVar1 for this procedure
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
MVA.anova$table[2,] # desired result
I have tried using map, do, nest, etc to no avail.
Without groupings this works
df.t <- as_tibble(df)
nest.df <- df.t %>% nest(-GroupVar1, -GroupVar2, -GroupVar3)
mva.tt <- nest.df %>%
mutate(mva.tt = map(data, ~ mvabund(.x)))
but this next step does not
mva.tt %>% mutate(MANY = map(data, ~ manyglm(.x ~ GroupVar1, family="negative binomial")))
Moreover, once I try to remove columns that sum to zero or include groupings, everything fails.
Is there a smart way to to this with dplyr and pipes? Or is a for loop the answer?
Edit:
Originally, I asked about this :Also, when broken into groups, the dataframe will contain columns that are all zeroes, normally I'd remove these. Can I have dplyr groupings that vary in the number of variables?" but the comments revealed this is not possible given my proposed set up. So I am still interested in the above.
Copied the steps into a function. Also added group information to differentiate in the last line.
fun <- function(df) {
df1 <- select(df, -GroupVar1, -GroupVar2, -GroupVar3)
df3 <- df1 %>% select_if(~sum((.)) > 0)
MVA.fit <- mvabund(df3)
MVA.model <- manyglm(MVA.fit ~ df$GroupVar1, family="negative binomial")
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
cbind(Group2 = df$GroupVar2[1], Group3 = df$GroupVar3[1], MVA.anova$table[2,])
}
Split the dataframe into groups and apply the function
library(tidyverse)
library(mvabund)
df %>%
group_split(GroupVar2, GroupVar3) %>%
map_dfr(fun)
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
# Group2 Group3 Res.Df Df.diff wald Pr(>wald)
#1 AA A1 3 1 1.028206 0.7432567
#2 AA B1 3 1 2.979169 0.1608392
#3 BB A1 3 1 2.330708 0.2137862
#4 BB B1 3 1 1.952617 0.2567433
I have a large matrix that is calculating the distance between two different zip codes (using rgeosphere package). I would like to run a function that finds all zip code pairings that are <=x distance away from each other and create a list of them. The data looks like this:
91423 92231 94321
90034 3 4.5 2.25
93201 3.75 2.5 1.5
94501 2 6 0.5
So if I ran the function to extract all zip code pairings that are <2 miles away I would end up with these zip codes:
94321
94321
93201
94501
The goal is basically to identify all adjacent zip codes in the US to a list of zip codes I have. If there is a better way to do this I am open to suggestions.
Perhaps something like the following. It will be slow, but it should work.
for(i in 1:nrow(data)){
for (j in 1:ncol(data)){
if(data[i,j]<distance){
if(exists(hold.zips)==FALSE){
hold.zips<-matrix(c(colnames(data)[i],colnames(data)[j]),ncol=2)
}else{
temp<-matrix(c(colnames(data)[i],colnames(data)[j]),ncol=2)
hold.zips<-rbind(hold.zips,temp)
}
}
}
}
This should work. Gives a nice list as output (calling your data x):
rn = rownames(x)
apply(x, 2, function(z) rn[z < 2])
# $`91423`
# character(0)
#
# $`92231`
# character(0)
#
# $`94321`
# [1] "93201" "94501"
Here is the Tidyverse solution:
library(dplyr)
library(tidyr)
# your data
dat <- matrix(c(3,3.75,2,4.5,2.5,6,2.25,1.5,0.5), nrow = 3, ncol = 3)
rownames(dat) <- c(90034, 93201, 94501)
colnames(dat) <- c(91423, 92231, 94321)
# tidyverse solution
r <- rownames(dat)
dat_tidy <- dat %>%
as_tibble() %>%
mutate(x = r) %>%
select(x, everything()) %>%
gather(key = y,
value = distance,
-x) %>%
filter(distance < 2)
print(dat_tidy)
# note if your matrix is a symetric matrix then
# to remove duplicates, filter would be:
# filter(x < y,
# distance < 2)
I am trying to find the starting point of the largest break of a given data. Here is my example:
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
select <- data[data$countx >= 50,] #take value greater than 50 into account
# find the break
wtym <- select$month
breaks <- c(0, which(diff(wtym) != 1), length(wtym))
allbreak <- sapply(seq(length(breaks) - 1 ),
function(i) wtym[(breaks[i] + 1):breaks[i+1]])
> allbreak
[[1]]
[1] 1 2
[[2]]
[1] 5 6 7 8
The question is: I need to find this for a large number of dataset (and the breaks are obviously varied), is there any way to auto pick up the start point of the largest break in a series (in this example, it is number 5 (gap no.2)? Any idea is highly appreciated. Thanks
Sounds like a run-length-encoding ?rle task where you are looking for runs of x < 50 and x >= 50. Here's a function:
bigbreak <- function(x, cutoff) {
r <- rle(x >= cutoff)
cumsum(r$l)[which(r$l == max(r$l[r$v]) & r$v)-1]+1
}
bigbreak(data$countx, 50)
#[1] 5
Now let's try it on 5 million records:
set.seed(1)
x <- sample(c(50,0), 5e6, replace=TRUE)
system.time({
bigbreak(x, 50)
})
# user system elapsed
# 0.41 0.00 0.41
Under half a second, not too bad.
A solution using dplyr and data.table.
# Create example data frame
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
# Load package
library(dplyr)
library(data.table)
# Process the data
data2 <- data %>%
mutate(Condition = countx >= 50) %>%
mutate(RunID = rleid(Condition)) %>%
filter(Condition) %>%
group_by(RunID) %>%
mutate(num = n()) %>%
ungroup() %>%
filter(num == max(num))
# Show the number of the first month
data2$month[1]
[1] 5