How to remove columns with dplyr with NA in specific row? - r

This code removes all columns which contain at least one NA.
library(dplyr)
df %>%
select_if(~ !any(is.na(.)))
What do I need to modify if I want only remove the columns that have NA for the eighth row (for my generated data below)?
set.seed(1234)
df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))

In base-R one can simply try as:
df[,which(!is.na(df[8,]))]
Or as suggested by #RichScriven:
df[, !is.na(df[8,])]
# A B
# 1 1 11
# 2 2 12
# 3 3 13
# 4 4 NA
# 5 NA 15
# 6 6 16
# 7 7 17
# 8 8 18
# 9 9 19
# 10 10 20

You could do this:
df %>%
select_if(!is.na(.[8,]))
A B
1 1 11
2 2 12
3 3 13
4 4 NA
5 NA 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20

Another option is keep
library(purrr)
keep(df, ~ !(is.na(.x[8])))
# A B
#1 1 11
#2 2 12
#3 3 13
#4 4 NA
#5 NA 15
#6 6 16
#7 7 17
#8 8 18
#9 9 19
#10 10 20
Or with Filter from base R
Filter(function(x) !(is.na(x[8])), df)

Related

R, How to generate additional observations denoted by numbered sequence

I'm currently a bit stuck, since I'm a bit unsure of how to even formulate my problem.
What I have is a dataframe of observations with a few variables.
Lets say:
test <- data.frame(var1=c("a","b"),var2=c(15,12))
Is my initial dataset.
What I want to end up with is something like:
test2 <- data.frame(var1_p=c("a","a","a","a","a","b","b","b","b","b"),
var2=c(15,15,15,15,15,12,12,12,12,12),
var3=c(1,2,3,4,5,1,2,3,4,5)
However, the initial observation count and the fact, that I need the numbering to run from 0-9 makes it rather tedious to do by hand.
Does anybody have a nice alternative solution?
Thank you.
What I tried so far was:
a)
testdata$C <- 0
testdata <- for (i in testdata$Combined_Number) {add_row(testdata,C=seq(0,9))}
which results in the dataset to be empty.
b)
testdata$C <- with(testdata, ave(Combined_Number,flur, FUN = seq(0,9)))
which gives the following error code:
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Perhaps crossing helps
library(tidyr)
crossing(df, var3 = 0:9)
-output
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
With dplyr this is one approach
library(dplyr)
df %>%
group_by(var1) %>%
summarize(var2, var3 = 0:9, .groups="drop")
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
Data
df <- structure(list(var1 = c("a", "b"), var2 = c(15, 12)), class = "data.frame", row.names = c(NA,
-2L))

problem with using coalesce to merge 2 columns into 1

I would like to merge values from 2 columns into 1. For example, here is sample data:
id x y
1 12
1 14
2 13
3 15
3 18
4 19
I want
id x y z
1 12 12
1 14 14
2 13 13
3 15 15
3 18 18
4 19 19
I tried using coalesce to create a new variable.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
df$z <- coalesce(df$x, df$y)
However, the variable doesn't reflect the columns joined. Am I using this function incorrectly?
You could use the dplyr::coalesce function:
> df$z <- dplyr::coalesce(ifelse(df$x == "", NA, df$x), df$y)
> df
id x y z
1 1 12 12
2 1 14 14
3 2 13 13
4 3 15 15
5 3 18 18
6 4 19 19
>
To implement my own mycoalesce:
mycoalesce <- function(...) {apply(cbind(...), 1, max)}
And:
> df$z <- mycoalesce(df$x, df$y)
> df
id x y z
1 1 12 12
2 1 14 14
3 2 13 13
4 3 15 15
5 3 18 18
6 4 19 19
>
This might be a more crude and inefficient way than the other methods posted above, but still worth a try:
df1<-df
df1[is.na(df1)]=0
z=df1$x+df1$y
df<-cbind(df,z)
df
# ID x y z
#1 1 12 NA 12
#2 2 NA 14 14
#3 3 13 NA 13
#4 4 15 NA 15
#5 5 NA 18 18
#6 6 NA 19 19
I mainly copied the original dataframe to a new dataframe so as to preserve the NA values in the original dataframe. Also, I assumed that none of the ID's are missing along with #Park's assumption.
Data: df<-data.frame(ID=1:6,x=c(12,NA,13,15,NA,NA),y=c(NA,14,NA,NA,18,19))
If one of x and y is always NA and one has value,
for a custom function %+% defined like
`%+%` <- function(x, y) mapply(sum, x, y, MoreArgs = list(na.rm = TRUE))
df$z <- df$x %+% df$y
df
id x y z
1 1 12 NA 12
2 1 NA 14 14
3 2 13 NA 13
4 3 15 NA 15
5 3 NA 18 18
6 4 NA 19 19

How can I stack columns per x columns in R

I'm looking to transform a data frame of 660 columns into 3 columns just by stacking them on each other per 3 columns without manually re-arranging (since I have 660 columns).
In a small scale example per 2 columns with just 4 columns, I want to go from
A B C D
1 4 7 10
2 5 8 11
3 6 9 12
to
A B
1 4
2 5
3 6
7 10
8 11
9 12
Thanks
reshape to the rescue:
reshape(df, direction="long", varying=split(names(df), rep(seq_len(ncol(df)/2), 2)))
# time A B id
#1.1 1 1 4 1
#2.1 1 2 5 2
#3.1 1 3 6 3
#1.2 2 7 10 1
#2.2 2 8 11 2
#3.2 2 9 12 3
rbind.data.frame requires that all columns match up. So use setNames to replace the names of the C:D columns:
rbind( dat[1:2], setNames(dat[3:4], names(dat[1:2])) )
A B
1 1 4
2 2 5
3 3 6
4 7 10
5 8 11
6 9 12
To generalize that to multiple columns use do.call and lapply:
dat <- setNames( as.data.frame( matrix(1:36, ncol=12) ), LETTERS[1:12])
dat
#----
A B C D E F G H I J K L
1 1 4 7 10 13 16 19 22 25 28 31 34
2 2 5 8 11 14 17 20 23 26 29 32 35
3 3 6 9 12 15 18 21 24 27 30 33 36
do.call( rbind, lapply( seq(1,12, by=3), function(x) setNames(dat[x:(x+2)], LETTERS[1:3]) ))
A B C
1 1 4 7
2 2 5 8
3 3 6 9
4 10 13 16
5 11 14 17
6 12 15 18
7 19 22 25
8 20 23 26
9 21 24 27
10 28 31 34
11 29 32 35
12 30 33 36
The 12 would be replaced by 660 and everything else should work.
A classical split-apply-combine approach will scale flexibly:
as.data.frame(lapply(split(unclass(df),
names(df)[seq(ncol(df) / 2)]),
unlist, use.names = FALSE))
## A B
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
or with a hint of purrr,
library(purrr)
df %>% unclass() %>% # convert to non-data.frame list
split(names(.)[seq(length(.) / 2)]) %>% # split columns by indexed names
map_df(simplify) # simplify each split to vector, coerce back to data.frame
## # A tibble: 6 × 2
## A B
## <int> <int>
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
Here is another base R option
i1 <- c(TRUE, FALSE)
`row.names<-`(data.frame(A= unlist(df1[i1]), B = unlist(df1[!i1])), NULL)
# A B
#1 1 4
#2 2 5
#3 3 6
#4 7 10
#5 8 11
#6 9 12
Or another option is melt from data.table
library(data.table)
i1 <- seq(1, ncol(df1), by = 2)
i2 <- seq(2, ncol(df1), by = 2)
melt(setDT(df1), measure = list(i1, i2), value.name = c("A", "B"))
rbindlist from data.table package can also be used for the task and seems to be much more efficient.
# EXAMPLE DATA
df1 <- read.table(text = '
Col1 Col2 Col3 Col4
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8', header = TRUE)
library(data.table)
library(microbenchmark)
library(purrr)
microbenchmark(
Map = as.data.frame(Map(c, df1[,1:2], df1[, 3:4])),
Reshape = reshape(df1, direction="long", varying=split(names(df1), rep(seq_len(ncol(df1)/2), 2))),
Purr = df1 %>% unclass() %>% # convert to non-data.frame list
split(names(.)[seq(length(.) / 2)]) %>% # split columns by indexed names
map_df(simplify),
DataTable = rbindlist(list(df1[,1:2], df1[, 3:4])),
Mapply = data.frame(mapply(c, df1[,1:2], df1[, 3:4], SIMPLIFY=FALSE)),
Rbind = rbind(df1[, 1:2],setnames(df1[, 3:4],names(df1[,1:2])))
)
The results are:
Unit: microseconds
expr min lq mean median uq max neval cld
Map 214.724 232.9380 246.2936 244.1240 255.9240 343.611 100 bc
Reshape 716.962 739.8940 778.7912 749.7550 767.6725 2834.192 100 e
Purr 309.559 324.6545 339.2973 334.0440 343.4290 551.746 100 d
DataTable 98.228 111.6080 122.7753 119.2320 129.2640 189.614 100 a
Mapply 233.577 258.2605 271.1881 270.7895 281.6305 339.291 100 c
Rbind 206.001 221.1515 228.5956 226.6850 235.2670 283.957 100 b

R: Summing values of columns through a loop

I'm really new to R and this forum, and need help constructing a loop.
(I'm a biology student with almost zero programming experience).
My dataframe has the following (simplified) structure:
a = "TNS"
b = NA
c = NA
d = 21
e = 37
f = 1
g = 39
h = 29
df = data.frame (a,b,c,d,e,f,g,h)
In reality my data frame consists of 210 rows and 90 columns, but those other rows are not really of interest to me right now.
What I'm looking for is a way to sum the values of every column which each other, except the first three, and add those results automatically as new columns to the end of my dataframe.
This would preferentially result in a data.frame as follows:
a = "TNS"
b = NA
c = NA
d = 21
e = 37
f = 1
g = 39
h = 29
de = 58
df = 22
dg = 60
dh = 50
ef = 38
eg = 76
eh = 66
fg = 40
fh = 30
gh = 68
df = data.frame (a,b,c,d,e,f,g,h,de,df,dg,dh,ef,eg,eh,fg,fh,gh)
It cannot pair each column more than once. And having run the loop for each pairing I need to do it for each triplet of columns, quartet columns etc.
Why would I want to do this? I need to do this for 85 columns for a biodiversity research project and it would take way too much time to calculate the value for each combination by hand.
Any help would be greatly appreciated as I really don't have the experience with R to come up with a solution by myself!!!
You can use combn in conjunction with rowSums, like this:
## This creates the names for the new columns we'll be creating
nam <- combn(names(df)[-c(1, 2, 3)], 2, FUN = function(x) paste(x, collapse = ""))
## Create and assign to your original data.frame
df[nam] <- combn(names(df)[-c(1, 2, 3)], 2,
FUN = function(x) rowSums(df[x], na.rm = TRUE), simplify = FALSE)
df
# a b c d e f g h de df dg dh ef eg eh fg fh gh
# 1 TNS NA NA 3 3 10 5 9 6 13 8 12 13 8 12 15 19 14
# 2 TNS NA NA 4 2 3 6 7 6 7 10 11 5 8 9 9 10 13
# 3 TNS NA NA 6 7 7 5 8 13 13 11 14 14 12 15 12 15 13
# 4 TNS NA NA 10 4 2 2 6 14 12 12 16 6 6 10 4 8 8
# 5 TNS NA NA 3 8 3 9 6 11 6 12 9 11 17 14 12 9 15
# 6 TNS NA NA 9 5 4 7 8 14 13 16 17 9 12 13 11 12 15
# 7 TNS NA NA 10 8 1 8 1 18 11 18 11 9 16 9 9 2 9
# 8 TNS NA NA 7 10 4 2 5 17 11 9 12 14 12 15 6 9 7
# 9 TNS NA NA 7 4 9 8 8 11 16 15 15 13 12 12 17 17 16
# 10 TNS NA NA 1 8 4 5 7 9 5 6 8 12 13 15 9 11 12
Here's the sample data used for this answer:
set.seed(1)
df <- data.frame(a = "TNS", b = NA, c = NA,
matrix(sample(10, 50, TRUE), ncol = 5,
dimnames = list(NULL, c("d", "e", "f", "g", "h"))))
df
# a b c d e f g h
# 1 TNS NA NA 3 3 10 5 9
# 2 TNS NA NA 4 2 3 6 7
# 3 TNS NA NA 6 7 7 5 8
# 4 TNS NA NA 10 4 2 2 6
# 5 TNS NA NA 3 8 3 9 6
# 6 TNS NA NA 9 5 4 7 8
# 7 TNS NA NA 10 8 1 8 1
# 8 TNS NA NA 7 10 4 2 5
# 9 TNS NA NA 7 4 9 8 8
# 10 TNS NA NA 1 8 4 5 7

bind rows over similar data frames where one data frame is missing columns.

I'd like to combine data frames with similar named columns and introduce NA's where one of the data frames is missing column values (in this case the z variable is missing in df2).
>df1 <- data.frame(x = 1:10,
y = 1:10,
z = 1:10)
>df2 <- data.frame(y = 11:20,
x = 11:20)
#my output could look like this with NA's added where there are missing columns.
>data.frame(x = 1:20,
y = 1:20,
z = c(1:10, rep(NA, 10)))
Hadley offers a nice function for that:
library(plyr)
rbind.fill(df1, df2)
# x y z
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 10 10
#11 11 11 NA
#12 12 12 NA
#13 13 13 NA
#14 14 14 NA
#15 15 15 NA
#16 16 16 NA
#17 17 17 NA
#18 18 18 NA
#19 19 19 NA
#20 20 20 NA

Resources