extract data one row below based on specific condition - r

have a very large data ~1GB and would like to extract summary data with such condition:
for loop:
if(a[i] == 999) then extract b[i+1]
else next
so that i can then table(b) to find the its distribution/composition, assuming column b is of class character, column a is of class integer
my R code:
summary123 <- data.frame()
j = 1
k = 1
for(i in 1:nrow(df1)){
if(df1$a[i] == 999 & i != nrow(df1)){
j = i + 1
summary123[k,1] <- df1$b[j]
k = k + 1
}
else{
next
}
}
however it is taking a long time, would like faster R-code equivalent

Use lead from dplyr:
output=lead(df1$b,1)[df1$a==999]
Then the answer you are looking for is:
output[-1]
(basically removing the last element, which is a NA introduced by the lead function)

Related

How do I save a single column of data produced from a while loop in R to a dataframe?

I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.

R programming: How to set while loop condition based if all required values in vector have been copied from sample?

I new to R and I'm trying to see how many iterations are needed to fill a vector with numbers 1 to 55 (no duplicates) from a random sample using runif.
At the moment, the vector has a lots of duplicates in it and my number of iterations being returned is the size of the vector. So, i'm not sure if my logic is correct.
The aim of the if statement is to check if the value from the sample exists in the vector, and if it does, choose the next one. But i'm not sure if it's correct, since the next number could already exist in the vector. Any help would be much appreciated
numbers=as.integer(runif(800, min=1, max=55)) ## my sample from runif
i=sample(numbers, 1)
## setting up my vector to store 55 unique values (1 to 55)
p=rep(0,55)
## my counters
j=0
n=1
## my while loop
while (p[n] %in% 0){
## if the sample value already exists in the vector, choose the next value from the sample
if (numbers[n] %in% p) {
p[n]=numbers[n+1]
}
else {
p[n] = numbers[n]
}
n = n + 1
j = j + 1
}
I believe that the following is what you want. Instead pf a while loop on p, the while loop should search for a new value in numbers.
set.seed(2021) # make the results reproducible
numbers <- sample(55, 800, TRUE)
## setting up my vector to store 55 unique values (1 to 55)
p <- integer(55)
# assign the elemnts of p one by one
for(j in seq_along(p)){
## if the sample value already exists in the vector,
## choose the next value from the sample
n <- 1
while (numbers[n] %in% p) {
n <- n + 1
}
if(n <= length(numbers)){
p[j] <- numbers[n]
}
}
j
#[1] 55
length(unique(p)) == length(p)
#[1] TRUE

Checking the value in the dataframe in R

The following are my r code. I am trying to check whether the true value a = 10 is included or in the dataframe. If its included in the dataframe, then I need to compute the length of that data frame otherwise I want to assign the length 0 .
Assume the value I am checking is 10
k1 = c(1,2,3,5,6)
k2 = c(10,12,13,15,16,18)
For example, for the k1 set i want to get the length 0 whereas for k2 the length must be 6
I trying to use the following code to do this work
library(tidyverse)
map_lgl(k, `%in%`, x = 10) %>% length
Why it is not working for the k1 dataset?
you can do this with a simple ifelse statement - nothing else required.
a <- 10
ifelse(a %in% k2, length(k2), 0)
[1] 0
you could wrap in a function and feed the different sets in:
my_func <- function(x){
ifelse(a %in% x, length(x), 0)
}
my_func(k2)
[1] 6
If you have more K(i) lists (100, for example) and you need to interate with all of then, you can use a loop and store the results in a resume table.
I never saw map_lgl, but we can use the ~hard code~ of R, like:
k1 <- c(1,2,3,5,6)
k2 <- c(10,12,13,15,16,18)
results <- data.frame()
for(i in 1:2){
analysis <- get(paste("k",i,sep=""))
if(10 %in% analysis){
results[nrow(results)+1, 1] <- paste("k",i,sep="")
results[nrow(results), 2] <- length(analysis)
} else{
results[nrow(results)+1, 1] <- paste("k",i,sep="")
results[nrow(results), 2] <- 0
}
}
Than we get:

Increase performance of R for-loops after pre-allocation of data structures

I have read a bit about increasing performance of for loops in r, but I am still stuck with one taking ~140secs.
I will start with the code:
matrix <- matrix(NA, length(register[,1]), length(AK), dimnames = list(register[,1], AK))
data.cleaned <- data[data$FO %in% register[,1],]
rownames(data.cleaned) <- paste(1:nrow(data.cleaned))
for (i in 1 : nrow(data.cleaned)) {
for (j in 1 : nrow(matrix)) {
if (data.cleaned$FO[i] == rownames(matrix)[j]) {
for (k in 1 : ncol(matrix)) {
if (data.cleaned$AK[i] == colnames(matrix)[k])
{matrix[j,k] <- 1}
}
}
}
}
Unfortunately I can't deliver any reproducible example. That data.cleaned data frame is frame, which includes around 11000 rows. In each row there is an observation for FO (main category) and for AK (sub category for FO) (two different variables).
The goal is fill matrix[i,j] with 1 if there in one row is the corresponding FO and AK observation.
Does this make sense. Please also comment, if I need to specify or can write the post in a more clear/better way
First step:
You can set 
cnames.m <- colnames(matrix)
 before you go into the loops. At the right place you can do 
if (data.cleaned$AK[i] == cnames.m[k]) matrix[j,k] <- 1
Second step:
The inner loop is identical with
matrix[j, data.cleaned$AK[i] == cnames.m] <- 1
So there is no need to loop with k.
matrix <- matrix(NA, length(register[,1]), length(AK), dimnames = list(register[,1], AK))
data.cleaned <- data[data$FO %in% register[,1],]
rownames(data.cleaned) <- paste(1:nrow(data.cleaned))
cnames.m <- colnames(matrix)
for (i in 1 : nrow(data.cleaned)) for (j in 1 : nrow(matrix))
if (data.cleaned$FO[i] == rownames(matrix)[j]) matrix[j, data.cleaned$AK[i] == cnames.m] <- 1
one remark about object names:
it is not a good idea to name a matrix matrix (would you name a dog Dog?)

cor() function in R with a subset

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)
You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090
You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}
Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Resources