I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
Related
In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))
I have a vector of values (say 1:10), and want to repeat certain values in it 2 or more times, determined by another vector (say c(3,4,6,8)). In this example, the result would be c(1,2,3,3,4,4,5,6,6,7,8,8,9,10) when repeating 2 times.
This should work for an arbitrary length range vector (like 200:600), with a second vector which is contained by the first. Is there a handy way to achieve this?
Akrun's is a more compact method, but this also will work
# get rep vector
reps <- rep(1L, 10L)
reps[c(3,4,6,8)] <- 2L
rep(1:10, reps)
[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
The insight here is that rep will take an integer vector in the second argument the same length as the first argument that indicates the number of repetitions for each element of the first argument.
Note that this solution relies on the assumption that c(3,4,6,8) is the index or position of the elements that are to be repeated. Under this scenario, then d-b's comment has a one-liner
rep(x, (seq_along(x) %in% c(3,4,6,8)) + 1)
If instead, c(3,4,6,8) indicates the values that are to be repeated, then docendo-discimus's super-compact code,
rep(x, (x %in% c(3,4,6,8)) * (n-1) +1)
where n may be adjusted to change the number of repetitions. If you need to call this a couple times, this could be rolled up into a function like
myReps <- function(x, y, n) rep(x, (x %in% y) * (n-1) +1)
and called as
myReps(1:10, c(3,4,6,8), 2)
in the current scenario.
We can try
i1 <- v1 %in% v2
sort(c(v1[!i1], rep(v1[i1], each = 2)))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Update
For the arbitrary vector,
f1 <- function(vec1, vec2, n){
i1 <- vec1 %in% vec2
vec3 <- seq_along(vec1)
c(vec1[!i1], rep(vec1[i1], each = n))[order(c(vec3[!i1],
rep(vec3[i1], each=n)))]
}
set.seed(24)
v1N <- sample(10)
v2 <- c(3,4,6,8)
v1N
#[1] 3 10 6 4 7 5 2 9 8 1
f1(v1N, v2, 2)
#[1] 3 3 10 6 6 4 4 7 5 2 9 8 8 1
f1(v1N, v2, 3)
#[1] 3 3 3 10 6 6 6 4 4 4 7 5 2 9 8 8 8 1
Here's another approach using sapply
#DATA
x = 1:10
r = c(3,4,6,8)
n = 2 #Two repetitions of selected values
#Assuming 'r' is the index of values in x to be repeated
unlist(sapply(seq_along(x), function(i) if(i %in% r){rep(x[i], n)}else{rep(x[i],1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
#Assuming 'r' is the values in 'x' to be repeated
unlist(sapply(x, function(i) if(i %in% r){rep(i, n)}else{rep(i, 1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Haven't tested these thoroughly but could be possible alternatives. Note that the order of the output will be considerably different with this approach.
sort(c(x, rep(x[x %in% r], n-1))) #assuming 'r' is values
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
sort(c(x, rep(x[r], n-1))) #assuming 'r' is index
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
I suggest this solution just to emphasize the cool usage of append function in base R:
ff <- function(vec, v, n) {
for(i in seq_along(v)) vec <- append(vec, rep(v[i], n-1), after = which(vec==v[i]))
vec
}
Examples:
set.seed(1)
ff(vec = sample(10), v = c(3,4,6,8), n = 2)
#[1] 3 3 4 4 5 7 2 8 8 9 6 6 10 1
ff(vec = sample(10), v = c(2,5,9), n = 4)
#[1] 3 2 2 2 2 6 10 5 5 5 5 7 8 4 1 9 9 9 9
I have an x vector with categorical variables and a y vector of numerical variables, both of the same length.
I need to create a data-frame in which all numerical observations in y are separated into groups by a categorical label in x so the end result would look something like:
x obs1 obs2 obs3
a 1 3 5
b 6 7 8
c 3 4 6
Now both aggregate and tapply require a FUN specification but I don't want to do operations on the variables.
x= {random sampling from letters of the alphabet}
y= {random numbers}
Remember, everything is a function in R. So things like c() are just function calls.
x <- rep(letters[1:3], each=3)
y <- c(1, 3, 5, 6, 7, 8, 3, 4, 6)
foo <- tapply(y, x, c)
# > foo
# $a
# [1] 1 3 5
# $b
# [1] 6 7 8
# $c
# [1] 3 4 6
Then you can use this silly pattern to get the data.frame you're looking for:
do.call(rbind, foo)
# [,1] [,2] [,3]
# a 1 3 5
# b 6 7 8
# c 3 4 6
I am not clear about something from your example: is it possible for there to be different numbers of y-values for each category in x? For example, would you consider basic data like this:
> x <- c(rep(c("a", "b", "c"), 3), "c", "c")
> y <- sample(1:20, 11)
> df <- data.frame(x, y)
> df
x y
1 a 16
2 b 4
3 c 9
4 a 2
5 b 12
6 c 17
7 a 7
8 b 10
9 c 11
10 c 1
11 c 8
Here there are more values for category c. This is not entirely what you are looking for, but it might be a start:
> library(reshape2)
> dcast(df, x ~ y)
Using y as value column: use value.var to override.
x 1 2 4 7 8 9 10 11 12 16 17
1 a NA 2 NA 7 NA NA NA NA NA 16 NA
2 b NA NA 4 NA NA NA 10 NA 12 NA NA
3 c 1 NA NA NA 8 9 NA 11 NA NA 17
The values for each of the categories appear on the right rows... the NAs are a nuisance though. How would you want the data to appear in this case? Something like
1 a 2 7 16
2 b 4 10 12
3 c 1 8 9 11 17
This will not work, of course, because each row must have the same number of columns, so you would end up with NAs for the last two elements in the top two rows.
However, I suspect that a list would probably be the best solution in this case anyway, in which case, consider this:
> dl <- split(y, x)
> dl[["a"]]
[1] 16 2 7
> dl$b
[1] 4 12 10
> dl[["c"]]
[1] 9 17 11 1 8
You can then operate on the elements of this list. As with all things R, there are a variety of ways to do this. For example, to get the output as a list:
> lapply(dl, sum)
$a
[1] 25
$b
[1] 26
$c
[1] 46
Or with output as a vector
> sapply(dl, sum)
a b c
25 26 46
Or, alternatively, to get the output as a data frame:
> library(plyr)
> ldply(dl, sum)
.id V1
1 a 25
2 b 26
3 c 46
These mechanisms afford a far greater degree of generality than functions like rowSum() since you can apply essentially arbirary functions to each of the elements in the original list.
I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8
Hoping there's a simple answer here but I can't find it anywhere.
I have a numeric matrix with row names and column names:
# 1 2 3 4
# a 6 7 8 9
# b 8 7 5 7
# c 8 5 4 1
# d 1 6 3 2
I want to melt the matrix to a long format, with the values in one column and matrix row and column names in one column each. The result could be a data.table or data.frame like this:
# col row value
# 1 a 6
# 1 b 8
# 1 c 8
# 1 d 1
# 2 a 7
# 2 c 5
# 2 d 6
...
Any tips appreciated.
Use melt from reshape2:
library(reshape2)
#Fake data
x <- matrix(1:12, ncol = 3)
colnames(x) <- letters[1:3]
rownames(x) <- 1:4
x.m <- melt(x)
x.m
Var1 Var2 value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
...
The as.table and as.data.frame functions together will do this:
> m <- matrix( sample(1:12), nrow=4 )
> dimnames(m) <- list( One=letters[1:4], Two=LETTERS[1:3] )
> as.data.frame( as.table(m) )
One Two Freq
1 a A 7
2 b A 2
3 c A 1
4 d A 5
5 a B 9
6 b B 6
7 c B 8
8 d B 10
9 a C 11
10 b C 12
11 c C 3
12 d C 4
Assuming 'm' is your matrix...
data.frame(col = rep(colnames(m), each = nrow(m)),
row = rep(rownames(m), ncol(m)),
value = as.vector(m))
This executes extremely fast on a large matrix and also shows you a bit about how a matrix is made, how to access things in it, and how to construct your own vectors.
A modification that doesn't require you to know anything about the storage structure, and that easily extends to high dimensional arrays if you use the dimnames, and slice.index functions:
data.frame(row=rownames(m)[as.vector(row(m))],
col=colnames(m)[as.vector(col(m))],
value=as.vector(m))