Delete specific columns with NA values - r

This is my dataframe:
set.seed(1)
df <- data.frame(A = 1:50, B = 11:60, c = 21:70)
head(df)
df.final <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
I want to delete the columns that its last 5 values ​​are not filled by NA. That is, only the columns that has values in the rows from 46 to 50 remain. the columns which the last 5 values has one or more NA´s will be deleted.
Is it possible do this with dplyr?
Any help?

dplyr::select() accepts integer column positions. We can use that to achieve this -
result <- df.final %>% select(., which(!is.na(colSums(tail(., 5)))))
head(result)
A B
1 1 11
2 2 NA
3 3 13
4 NA 14
5 5 15
6 NA 16

Shree beat me to it, but it might come in handy:
> df.final %>% tail
A B c
45 45 55 65
46 46 NA 66
47 47 57 67
48 NA 58 68
49 NA 59 69
50 NA 60 NA
> df.final %>%
+ select_if(~ !any(is.na(tail(., n = 1)))) %>%
+ tail()
B
45 55
46 NA
47 57
48 58
49 59
50 60
Just change the n above to the number of last NAs that you want.

Related

Vectorizing lagged operations

How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])

R | Mutate with condition for multiple columns

I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000

Custom function to mutate a new column for row means using starts_with()

I have a data frame for which I want to create columns for row means. Each row mean column should be computed for a group of columns in the data. which are related to each other. I can differentiate between the groups of columns using dplyr's starts_with(). Since I have several groups of columns to calculate row means for, I'd like to build a function to do it. For some reason, I fail to get it to work.
Data
df <- data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 51
2 NA 12 22 NA 42 52
3 3 13 23 33 43 NA
4 NA 14 NA NA 44 54
5 NA 15 25 35 45 55
6 6 NA NA 36 46 56
7 7 17 27 NA 47 57
8 NA 18 28 38 48 NA
9 9 19 29 39 49 NA
10 10 NA 30 40 NA 60
I want to create a new column for the row means of each continent, across cities. One column for Asia cities, and one for Europe. Each run of the function will be fed by the name of a continent, to guide which columns to pick.
My attempt to build the function
This attempt is based on this answer.
continent_mean <-
function(continent) {
df %>%
select(starts_with(as.character(continent))) %>%
mutate(., (!!as.name(continent)) == rowMeans(., na.rm = TRUE))
}
However, running this code results in a weird behavior, as it seemingly returns the same dataset, with just the selected columns according to starts_with(), but it doesn't generate a new column for row means.
continent_mean("asia")
asia_bangkok asia_tokyo asia_kathmandu
1 31 41 51
2 32 42 52
3 33 43 53
4 34 44 54
5 35 45 55
6 36 46 56
7 37 47 57
8 38 48 58
9 39 49 59
10 40 50 60
What am I missing here? I thought this could be due to the == rather than = in mutate(), but a single = throws an error, so it seems not to be the solution either.
Thanks!
We can use quo_name to assign column names
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
continent_mean(df, "asia")
# asia_bangkok asia_tokyo asia_kathmandu asia
#1 NA 41 51 46
#2 NA 42 52 47
#3 33 43 NA 38
#4 NA 44 54 49
#5 35 45 55 45
#6 36 46 56 46
#7 NA 47 57 52
#8 38 48 NA 43
#9 39 49 NA 44
#10 40 NA 60 50
Using base R, we can do similar thing by
continent_mean <- function(df, continent) {
df1 <- df[startsWith(names(df), "asia")]
df1[continent] <- rowMeans(df1, na.rm = TRUE)
df1
}
If we want rowMeans of all the continents together we can use split.default
sapply(split.default(df, sub("_.*", "", names(df))), rowMeans, na.rm = TRUE)
# asia europe
# [1,] 46 1
# [2,] 47 17
# [3,] 38 13
# [4,] 49 14
# [5,] 45 20
# [6,] 46 6
# [7,] 52 17
# [8,] 43 23
# [9,] 44 19
#[10,] 50 20

Splitting columns of a dataframe to merge a repetitive variable

I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!

Filter rows based on a threshold by grouping ID column

I have a data frame that I got from
ID <- c("A","A","A","B","B","B","C","C","C")
Type <- c(45,46,47,45,46,47,45,46,47)
Point_A <- c(10,15,20,8,9,10,35,33,39)
df <- data.frame(ID,Type,Point_A)
ID Type Point_A
1 A 45 10
2 A 46 15
3 A 47 20
4 B 45 8
5 B 46 9
6 B 47 10
7 C 45 35
8 C 46 33
9 C 47 39
I want to calculate the median of the Point_A column grouping by ID and then remove the rows based on a threshold.
For example, lets say my threshold is 11. I want to remove the rows grouped by ID having median less than the threshold and so the desired output would be
ID Type Point_A
1 A 45 10
2 A 46 15
3 A 47 20
4 C 45 35
5 C 46 33
6 C 47 39
While I am able to calculate the medians but I am not knowing how to remove the rows.
func <- function(x) (median(x,na.rm=TRUE))
df1 <- df %>%
group_by(ID) %>%
mutate_each(funs(.=func(.)),Point_A)
summarise(Point_A = f(Point_A))
Kindly let me know how to go about doing this.

Resources