Set consequent non na values to NA - r

Set every non-NA value that has a non-NA value to "his left" to NA.
Data
a <- c(3,2,3,NA,NA,1,NA,NA,2,1,4,NA)
[1] 3 2 3 NA NA 1 NA NA 2 1 4 NA
Desired Output
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
My working but ugly solution:
IND <- !(is.na(a)) & data.table::rleidv(!(is.na(a))) %>% duplicated
a[IND]<- NA
a
There's gotta be a better solution ...

Alternatively,
a[-1][diff(!is.na(a)) == 0] <- NA; a
# [1] 3 NA NA NA NA 1 NA NA 2 NA NA NA

OK for brevity...
a[!is.na(dplyr::lag(a))]<-NA
a
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA

You can do a simple ifelse statement where you add your vector with a lagged vector a. If the result is NA then the value should remain the same. Else, NA, i.e.
ifelse(is.na(a + dplyr::lag(a)), a, NA)
#[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA

Related

Applying a for-loop to different levels of a variable

I have created a data frame, in the data frame there are 3 sites and I have created a nested for loop to create my desired matrices. THe overall objective is find a more efficient way to do this for each of the 3 sites instead of just the one.
The outputs from the nested for loop (EDmatrix and timelags) are the expected results for the other two sites. I would like to find a more efficient way of obtaining these matrices as well as be able to do it for all site instead of just the one in this example.
set.seed(123)
d1 = sample.int(50, 27)
d2 = sample.int(50, 27)
d3 = sample.int(50, 27)
year <- c(1990:1998)
site <- c(rep("a", 9), rep("b", 9), rep("c", 9))
ED = function(x,y){
#x and y are vectors of spp abundances
#they must be the same length!
if(length(x)!=length(y)) stop("Bad abundances!")
out = sqrt(sum((x-y)^2))
out
}
df <- data.frame(site, year, d1 = d1, d2 = d2, d3 = d3)
Here is the code to get the expected output for only a single site, but I would like to be able to do this for all of the sites in the data frame df.
subdf = subset(df,site=="a") # subset data for one site
EDmatrix = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the dissimilarity values
timeLags = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the time lags
# First loop through all "j" years from 1 to the total number of years
# Now loop through all "k" years from 1 to the total number of years
for(j in 1: length(subdf$year)){
for(k in 1: length(subdf$year)){
# grab density data for year "j"
jdensity <- subdf[j,-c(1:2)]
# grab density data for year "k"
kdensity <- subdf[k,-c(1:2)]
# calculate and store (in the EDmatrix) the ED value based on the data for year j and k
EDmatrix[j,k] <- ED(jdensity, kdensity)
# calculate and store (in timeLags) the time lag (the absolute value of the difference
# in time between year j and k
timeLags[j,k] <- abs(subdf[j, 2] - subdf[k, 2])
}# exit k loop
}# exit j loop
EDmatrix[lower.tri(EDmatrix, diag=T)]=NA # set duplicate entries to NA
timeLags[lower.tri(timeLags, diag=T)]=NA # set duplicate entries to NA
y = as.vector(EDmatrix) # turn the matrix into a vector
x = as.vector(timeLags)
We may use outer for this operation
library(dplyr)
library(tidyr)
library(purrr)
f1 <- function(dat, i, j) {
subdat <- dat %>%
select(starts_with('d'))
jdensity <- subdat[i, ]
kdensity <- subdat[j,]
EDtmp <- ED(jdensity, kdensity)
timetmp <- abs(dat$year[i] - dat$year[j])
tibble(EDtmp, timetmp)
}
f2 <- function(dat, s1, s2) {
mat <- outer(s1, s2, Vectorize(\(i, j) list(f1(dat, i, j))))
EDmatrix <- matrix(map_dbl(mat, ~ .x$EDtmp), length(s1), length(s1))
timeLags <- matrix(map_dbl(mat, ~ .x$timetmp), length(s1), length(s1))
EDmatrix[lower.tri(EDmatrix, diag=TRUE)]=NA
timeLags[lower.tri(timeLags, diag=TRUE)]=NA
y = as.vector(EDmatrix)
x = as.vector(timeLags)
tibble(y, x)
}
out1 <- df %>%
group_by(site) %>%
summarise(out = f2(cur_data(), row_number(), row_number()),
.groups = 'drop') %>%
unnest(out)
-checking with OP's output
> out1$x[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> x
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> out1$y[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA
> y
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA

R: Combining columns under precondition

I want to have a column's values equal another column's values if the first column's value is NA in this row. So I want to change something like this
A B
3 NA
NA NA
NA NA
5 NA
NA NA
NA NA
7 5
to something like this
A B
3 3
NA NA
NA NA
5 5
NA NA
NA NA
7 5
I am fairly new to R and any other kind of programming.
As per OP's description:
equal another column's values if the first column's value is NA in
this row
Could you please try following and let me know if this helps you.
df21223$B[is.na(df21223$B[1])] <- df21223$A
Output will be as follows for data frame's B part:
> df21223$B
[1] 3 NA NA 5 NA NA 7
Where Sample data is:
> df21223$A
[1] 3 NA NA 5 NA NA 7
> df21223$B
[1] NA NA NA NA NA NA NA
try:
df$B[is.na(df$B)] <- df$A

why i cant change contents in column in R?

> data$Accepted.Final.round
[1] NA NA NA NA NA NA NA NA 1 NA NA NA NA 1 1 1 1 0 1 0 0 1 1 1
1
1 NA 1 1 1 1
[32] NA 1 1 0 1 1 1 1 1 1 NA 1 1 0 1 1 0 1 1 1 1 1 1 1
1
NA 1 NA NA NA NA
[63] NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA
NA
NA NA NA NA NA NA
[94] NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
I have a dataset column consists of NA, 1, 0. However when I try
data$Accepted.Final.round[data$Accepted.Final.round==NA]<-0
or
ifelse(data$Accepted.Final.round==1,1,0)
to replace NA with 0, both lines cannot work.
Could you guys think of any ways to fix this?
Use is.na() to determine if a value is NA. NA is contagious, meaning that doing operations with NA usually returns NA. That includes checking for equality with ==, i.e. x == NA will always return NA and not TRUE or FALSE.
x <- c(2, NA, 2)
x[is.na(x)] <- 0
The second attempt from OP was pretty close:
data$Accepted.Final.round <- ifelse(is.na(data$Accepted.Final.round),
0 ,data$Accepted.Final.round)
The document for ifelse explains as:
Usages:
ifelse(test, yes, no)
yes will be evaluated if and only if any element of test is true, and
analogously for no.
Missing values (i.e. NA) in test give missing (NA) values in the result.

Subsets defined by k-way combinations of factors in R

I would like to apply a function (will be a custom function, but for simplicity I will say it is mean) to subgroups defined by combinations of factors. I have 20 factors, but I would like to consider, say, subgroups defined by all combinations of 1,2,3,...,k of the factors.
Here is an example for k=3
N = 100
test_data <- data.frame( factorA = factor(sample(1:4, replace = TRUE, size = N)), factorB = factor(sample(1:2, replace = TRUE, size = N)), factorC = factor(sample(1:2, replace = TRUE, size = N)), var = rnorm(n = N))
#1-way subsets
mean(test_data$var[test_data$factorA == "1"])
mean(test_data$var[test_data$factorA == "2"])
mean(test_data$var[test_data$factorA == "3"])
mean(test_data$var[test_data$factorA == "4"])
mean(test_data$var[test_data$factorB == "1"])
#and so forth...
#2-way subsets
mean(test_data$var[test_data$factorA == "1" & test_data$factorB == "1" ])
mean(test_data$var[test_data$factorA == "1" & test_data$factorB == "2" ])
mean(test_data$var[test_data$factorA == "1" & test_data$factorC == "1" ])
#and so forth...
#3-way subsets
mean(test_data$var[test_data$factorA == "1" & test_data$factorB == "1" & test_data$factorC == "1" ])
mean(test_data$var[test_data$factorA == "1" & test_data$factorB == "1" & test_data$factorC == "2" ])
#and so forth...
For each combinations of k factors, compute the mean of var for all combinations of levels for these k factors. It would be best if the output is then labeled the given combination of factors/levels that defines the subset.
It seems that expand.grid and/or combn should be useful, but not sure how to use them in this situation.
To calculate the mean of var for all combinations of all three factors you can use the data.table by argument:
library(data.table)
N = 100
test_data <- data.frame(factorA = factor(sample(1:4, replace = TRUE, size = N)),
factorB = factor(sample(1:2, replace = TRUE, size = N)),
factorC = factor(sample(1:2, replace = TRUE, size = N)), var = rnorm(n = N))
setDT(test_data)
test_data[, .(mean_var = mean(var, na.rm = TRUE)),
by = .(factorA, factorB, factorC)]
Which gives this output:
factorA factorB factorC mean_var
1: 1 1 1 -0.304218613
2: 1 1 2 -0.122405096
3: 1 2 1 0.532219871
4: 1 2 2 -0.679400706
5: 2 1 1 0.006901209
6: 2 1 2 0.605850466
7: 2 2 1 -0.083305497
8: 2 2 2 -0.408660971
9: 3 1 1 -0.362234218
10: 3 1 2 -0.368472511
11: 3 2 1 0.243274183
12: 3 2 2 0.119927615
13: 4 1 1 -0.517337915
14: 4 1 2 -0.790908511
15: 4 2 1 -0.077665828
16: 4 2 2 -0.295695277
Updated with example data containing 20 factor columns (with two to four levels each). All possible combinations of three factors (i.e. columns) are generated (6480) and for each combination the mean_var for each unique combination of factor levels is calculated:
library(data.table)
# Generate example data
N = 100
dt <- dcast(rbindlist(lapply(seq(1:20), function(x) {
dt_tmp <- data.table(id = 1:N, factor = paste0("factor", LETTERS[x]),
value = sample(1:sample(2:4, 1), replace = TRUE, size = N))
})), id~factor)[, ":="(var = rnorm(n = N), id = NULL)]
# Generate all combinations of three out of the 20 factors (20*19*18 = 6840)
factors <- colnames(dt[, 1:20])
tests <- CJ(k1 = factors, k2 = factors, k3 = factors)[k1 != k2 & k1 != k3 & k2 != k3]
# Iterate over every row of tests and calculate mean_var for each unique
# combination of the three factors (this takes time - output ~ 170000 rows)
dt_out <- rbindlist(lapply(seq(1:nrow(tests)), function(x) {
dt[, .(mean_var = mean(var, na.rm = TRUE)),
by = c(tests[x, k1], tests[x, k2], tests[x, k3])]
}), use.names = TRUE, fill = TRUE)
The output looks like this:
> head(out_dt, 30)
factorA factorB factorC mean_var factorD factorE factorF factorG factorH factorI factorJ factorK factorL factorM factorN factorO factorP factorQ factorR factorS factorT
1: 1 2 3 -0.595391823 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 2 1 1 -0.049915238 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3: 2 2 4 0.087206182 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4: 2 1 2 0.010622079 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5: 1 2 1 0.277414685 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6: 1 1 3 0.366482963 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
7: 2 2 3 0.017438655 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
8: 2 2 1 -1.116071505 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9: 2 1 4 1.371340706 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
10: 2 2 2 0.045354904 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
11: 1 2 2 0.644926008 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
12: 1 2 4 -0.121767568 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
13: 1 1 2 0.261070274 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
14: 2 1 3 -0.506061865 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
15: 1 1 4 -0.075228598 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
16: 1 1 1 0.333514316 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
17: 1 2 NA -0.185980008 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
18: 2 1 NA -0.113793548 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
19: 2 2 NA 0.015100176 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
20: 1 2 NA 0.484182038 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
21: 1 1 NA -0.123811140 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
22: 1 1 NA 0.543852715 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
23: 2 2 NA -0.267626769 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
24: 2 1 NA 0.133316773 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
25: 1 2 NA 0.538964320 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
26: 2 1 NA 0.006298113 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
27: 2 2 NA 0.010152043 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
28: 2 1 NA 0.011377912 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
29: 1 1 NA 0.504610954 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
30: 2 2 NA -0.311834384 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
factorA factorB factorC mean_var factorD factorE factorF factorG factorH factorI factorJ factorK factorL factorM factorN factorO factorP factorQ factorR factorS factorT

pulling data from data frame in R

I'm trying to create a vector using data from my data frame which contains all of the numeric values in the data frame.
Basically, I want a vector that has (2,2,5,2,2,3,2,3,2,2,2,2,2).
two three four five six seven
2 NA NA NA NA NA
2 NA NA NA NA NA
NA NA NA 5 NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
NA 3 NA NA NA NA
2 NA NA NA NA NA
NA 3 NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
Just subset the dataframe for non-NA values with !is.na(df):
df <- data.frame(two = c(2, 2, NA),
three = c(NA, NA, NA),
four = c(NA, 3, NA))
df
# two three four
# 1 2 NA NA
# 2 2 NA 3
# 3 NA NA NA
is.na(df)
# two three four
# [1,] FALSE TRUE TRUE
# [2,] FALSE TRUE FALSE
# [3,] TRUE TRUE TRUE
df[!is.na(df)]
# [1] 2 2 3

Resources