Create a sequence in R with that alternatively repeats - r

My homework asks me to create a sequence like this.
(1, 2, 2, 3, 4, 4, 5, 6, 6, . . . , 50, 50)
I'm a newbie to R so would really appreciate your help

You can use rep and alternate the number of repeats. Check ?rep for more information.
rep(1:50, ifelse(seq(1:50) %% 2, 1, 2))
Or, with two reps:
rep(1:50, rep(1:2, 25))
rep(1:50, rep(1:2, length.out = 50))

Short code using integer division:
2:76%/%1.5
This will return a numeric vector.

Related

R: Exclude granges overlapped by another range

I would compute the remained part from a query range after excluding another range, is there a way to do it? Thanks.
query <- IRanges(1, 10)
ranges2exclude <- IRanges(c(2, 6), c(3, 7))
The output I want:
IRanges(c(1, 4, 8), c(1, 5, 10))

Automatically insert whitespace in RStudio script

In the same ways that lines can be correctly indented using Ctrl + I or Cmd + I, is there a shortcut to automatically insert correct whitespace in RStudio scripts?
For example, for this:
df<-data.frame(x=c(1,2,3,4,5),y=c(3,4,5,6,7))
RStudio gives information saying "expected whitespace around '<-' operator" and "expected whitespace around '=' operator". Is there a shortcut to get this:
df <- data.frame(x = c(1, 2, 3, 4, 5), y = c(3, 4, 5, 6, 7))
Under RStudio you can select the code and type ctrl+shift+A for code reformatting, see RStudio shortcuts.
Result:
df <- data.frame(x = c(1, 2, 3, 4, 5), y = c(3, 4, 5, 6, 7))

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

Create conditional variable in multiple data.tables (or data.frames)

I want to execute the same action in multiple data.tables (or data.frames). For example, I want to create the same variable conditional on the same rule in all data.tables.
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to loop on this list trying to create the variable:
list.df
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
My (bad) tentative:
for(i in 1:3){
a<-get(paste(list.df[[i]], "$var1", sep=""))
b<-get(paste(list.df[[i]], "$var2", sep=""))
name<-paste(list.df[[i]], "$var.new", sep="")
assign(name, ifelse(a==2 & b==10, 1, 0))
}
Clearly r cannot create this new variable the way I am doing as I get a error message "object not found".
Any clues on how to fix my bad code? I have a feeling that dplyr could help me but I don't know how.
We can use mget after creating the strings of object names with paste so that we get the values ie. data.frames in a list. We loop through the list (lapply(...,) and transform each dataset by creating the variable ('varNew') which is a binary variable. We can either use ifelse on the logical statement or just wrap with + to coerce the TRUE/FALSE to 1/0.
lst <- lapply(mget(paste0('df', 1:3)), transform,
varNew = +(var1==2 & var2==10))
If we need to update the original objects, we can use list2env.
list2env(lst, envir = .GlobalEnv)
df1
df2

Resources