R: Summing values of columns through a loop - r

I'm really new to R and this forum, and need help constructing a loop.
(I'm a biology student with almost zero programming experience).
My dataframe has the following (simplified) structure:
a = "TNS"
b = NA
c = NA
d = 21
e = 37
f = 1
g = 39
h = 29
df = data.frame (a,b,c,d,e,f,g,h)
In reality my data frame consists of 210 rows and 90 columns, but those other rows are not really of interest to me right now.
What I'm looking for is a way to sum the values of every column which each other, except the first three, and add those results automatically as new columns to the end of my dataframe.
This would preferentially result in a data.frame as follows:
a = "TNS"
b = NA
c = NA
d = 21
e = 37
f = 1
g = 39
h = 29
de = 58
df = 22
dg = 60
dh = 50
ef = 38
eg = 76
eh = 66
fg = 40
fh = 30
gh = 68
df = data.frame (a,b,c,d,e,f,g,h,de,df,dg,dh,ef,eg,eh,fg,fh,gh)
It cannot pair each column more than once. And having run the loop for each pairing I need to do it for each triplet of columns, quartet columns etc.
Why would I want to do this? I need to do this for 85 columns for a biodiversity research project and it would take way too much time to calculate the value for each combination by hand.
Any help would be greatly appreciated as I really don't have the experience with R to come up with a solution by myself!!!

You can use combn in conjunction with rowSums, like this:
## This creates the names for the new columns we'll be creating
nam <- combn(names(df)[-c(1, 2, 3)], 2, FUN = function(x) paste(x, collapse = ""))
## Create and assign to your original data.frame
df[nam] <- combn(names(df)[-c(1, 2, 3)], 2,
FUN = function(x) rowSums(df[x], na.rm = TRUE), simplify = FALSE)
df
# a b c d e f g h de df dg dh ef eg eh fg fh gh
# 1 TNS NA NA 3 3 10 5 9 6 13 8 12 13 8 12 15 19 14
# 2 TNS NA NA 4 2 3 6 7 6 7 10 11 5 8 9 9 10 13
# 3 TNS NA NA 6 7 7 5 8 13 13 11 14 14 12 15 12 15 13
# 4 TNS NA NA 10 4 2 2 6 14 12 12 16 6 6 10 4 8 8
# 5 TNS NA NA 3 8 3 9 6 11 6 12 9 11 17 14 12 9 15
# 6 TNS NA NA 9 5 4 7 8 14 13 16 17 9 12 13 11 12 15
# 7 TNS NA NA 10 8 1 8 1 18 11 18 11 9 16 9 9 2 9
# 8 TNS NA NA 7 10 4 2 5 17 11 9 12 14 12 15 6 9 7
# 9 TNS NA NA 7 4 9 8 8 11 16 15 15 13 12 12 17 17 16
# 10 TNS NA NA 1 8 4 5 7 9 5 6 8 12 13 15 9 11 12
Here's the sample data used for this answer:
set.seed(1)
df <- data.frame(a = "TNS", b = NA, c = NA,
matrix(sample(10, 50, TRUE), ncol = 5,
dimnames = list(NULL, c("d", "e", "f", "g", "h"))))
df
# a b c d e f g h
# 1 TNS NA NA 3 3 10 5 9
# 2 TNS NA NA 4 2 3 6 7
# 3 TNS NA NA 6 7 7 5 8
# 4 TNS NA NA 10 4 2 2 6
# 5 TNS NA NA 3 8 3 9 6
# 6 TNS NA NA 9 5 4 7 8
# 7 TNS NA NA 10 8 1 8 1
# 8 TNS NA NA 7 10 4 2 5
# 9 TNS NA NA 7 4 9 8 8
# 10 TNS NA NA 1 8 4 5 7

Related

Avoid the for loops-R

I have two data frames, x and y. For each value of x[,2], I look if the value is equal to the value of the elements of y[,1]. If so, I add a third column in the first data frame that contains the values of y[,2].
I managed to do that with loops, but how can I do this using vectors?
x=data.frame(1:15,15:1)
y=data.frame(3:5,c(7.2,8.5,0.3))
for ( i in 1:nrow(x)) {
for (j in 1:nrow(y)) {
if (x[i,2]==y[j,1]){
x[i,3]=y[j,2]
}else{
}
}
}
Use a join instead of loops - based on the loop comparision, the second column of 'x' is compared with the first column of 'y', thus those columns are used in the on, assign (:=) the second column (col2) from the second dataset to create the new column 'col3' in first data
library(data.table)
setDT(x)[y, col3 := i.col2, on = .(col2 = col1)]
-output
> x
col1 col2 col3
1: 1 15 NA
2: 2 14 NA
3: 3 13 NA
4: 4 12 NA
5: 5 11 NA
6: 6 10 NA
7: 7 9 NA
8: 8 8 NA
9: 9 7 NA
10: 10 6 NA
11: 11 5 0.3
12: 12 4 8.5
13: 13 3 7.2
14: 14 2 NA
15: 15 1 NA
data
x <- data.frame(col1 = 1:15, col2 = 15:1)
y <- data.frame(col1 = 3:5, col2 = c(7.2,8.5,0.3))
Update: Many thanks to #TrainingPizza (who has drawn my attention to the false output of my first answer and also provided how it could work:
library(dplyr)
x %>%
rowwise() %>%
mutate(col3 = ifelse(col2 %in% y$col1, y$col2[y$col1==col2], NA))
col1 col2 col3
<int> <int> <dbl>
1 1 15 NA
2 2 14 NA
3 3 13 NA
4 4 12 NA
5 5 11 NA
6 6 10 NA
7 7 9 NA
8 8 8 NA
9 9 7 NA
10 10 6 NA
11 11 5 0.3
12 12 4 8.5
13 13 3 7.2
14 14 2 NA
15 15 1 NA
First answer (not correct)
Here is dplyr way how to avoid the for - loop:
library(dplyr)
x %>%
mutate(V3 = ifelse(V2 %in% y$V1, y$V2, NA))
V1 V2 V3
1 1 15 NA
2 2 14 NA
3 3 13 NA
4 4 12 NA
5 5 11 NA
6 6 10 NA
7 7 9 NA
8 8 8 NA
9 9 7 NA
10 10 6 NA
11 11 5 8.5
12 12 4 0.3
13 13 3 7.2
14 14 2 NA
15 15 1 NA

tidyverse: binding list elements of same dimension

Using reduce(bind_cols), the list elements of same dimension may be combined. However, I would like to know how to combine only same dimension (may be specified dimesion in some way) elements from a list which may have elements of different dimension.
library(tidyverse)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls1
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls1 %>%
reduce(bind_cols)
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
ls2 <- list(df1, df2, df3)
ls2
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
[[3]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls2 %>%
reduce(bind_cols)
Error: Can't recycle `..1` (size 10) to match `..2` (size 20).
Run `rlang::last_error()` to see where the error occurred.
Question
Looking for a function to combine all data.frames in a list with an argument of number of rows.
One option could be:
map(split(lst, map_int(lst, NROW)), bind_cols)
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
You can use -
n <- 1:max(sapply(ls2, nrow))
res <- do.call(cbind, lapply(ls2, `[`, n, ,drop = FALSE))
res
# A1 A2 B C
#1 1 10 11 31
#2 2 9 12 32
#3 3 8 13 33
#4 4 7 14 34
#5 5 6 15 35
#6 6 5 16 36
#7 7 4 17 37
#8 8 3 18 38
#9 9 2 19 39
#10 10 1 20 40
#NA NA NA 21 NA
#NA.1 NA NA 22 NA
#NA.2 NA NA 23 NA
#NA.3 NA NA 24 NA
#NA.4 NA NA 25 NA
#NA.5 NA NA 26 NA
#NA.6 NA NA 27 NA
#NA.7 NA NA 28 NA
#NA.8 NA NA 29 NA
#NA.9 NA NA 30 NA
A little-bit shorter with purrr::map_dfc
purrr::map_dfc(ls2, `[`, n, , drop = FALSE)
We can use cbind.fill from rowr
library(rowr)
do.call(cbind.fill, c(ls2, fill = NA))
A base R option using tapply + sapply
tapply(
ls2,
sapply(ls2, nrow),
function(x) do.call(cbind, x)
)
gives
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
You may also use if inside reduce if you want to combine similar elements of list (case: when first item in list has priority)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls2 <- list(df1, df2, df3)
library(tidyverse)
reduce(ls2, ~if(nrow(.x) == nrow(.y)){bind_cols(.x, .y)} else {.x})
#> A1 A2 C
#> 1 1 10 31
#> 2 2 9 32
#> 3 3 8 33
#> 4 4 7 34
#> 5 5 6 35
#> 6 6 5 36
#> 7 7 4 37
#> 8 8 3 38
#> 9 9 2 39
#> 10 10 1 40
Created on 2021-06-09 by the reprex package (v2.0.0)
Here's another tidyverse option.
We're creating a dummy ID in each data.frame based on the row_number(), then joining all data.frames by the dummy ID, and then dropping the dummy ID.
ls2 %>%
map(., ~mutate(.x, id = row_number())) %>%
reduce(full_join, by = "id") %>%
select(-id)
This gives us:
A1 A2 B C
1 1 10 11 31
2 2 9 12 32
3 3 8 13 33
4 4 7 14 34
5 5 6 15 35
6 6 5 16 36
7 7 4 17 37
8 8 3 18 38
9 9 2 19 39
10 10 1 20 40
11 NA NA 21 NA
12 NA NA 22 NA
13 NA NA 23 NA
14 NA NA 24 NA
15 NA NA 25 NA
16 NA NA 26 NA
17 NA NA 27 NA
18 NA NA 28 NA
19 NA NA 29 NA
20 NA NA 30 NA
We can also use Reduce function from base R:
lst <- list(df1, df2, df3)
# First we create id number for each underlying data set
lst |>
lapply(\(x) {x$id <- 1:nrow(x);
x
}
) -> ls2
Reduce(function(x, y) if(nrow(x) == nrow(y)){
merge(x, y, by = "id")
} else {
x
}, ls2)
id A1 A2 C
1 1 1 10 31
2 2 2 9 32
3 3 3 8 33
4 4 4 7 34
5 5 5 6 35
6 6 6 5 36
7 7 7 4 37
8 8 8 3 38
9 9 9 2 39
10 10 10 1 40

Make values not adjacent to each other NA

The values >=10 in the data frame below (values 31,89,12,69) does sometimes come in order like 89 and 12. By that I mean de order 123456789, they are adjacent to eachother. I would like to make the values which are not adjacent to each other(31,69, in 31 nr 2 is missing in between to be in order, for 69, nr 7 and8 are missing to be in order) NA. How to code this? Imagine a big dataset! :)
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,67,8,9,0,6,7,9)
B <- c(5,6,31,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,89,3,2,9,NA,12,69,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 31 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 69
10 b 9 4 6 8
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
Another solution defining a vector of the values to keep beforehand (only up to two-digit numbers, but could be extended):
numerals <- 1:9
vector <- 0:9
for (i in numerals) {
j <- numerals[i+1]
if (!is.na(j)) {
number <- as.numeric(paste(c(i, j), collapse = ""))
number_reverse <- as.numeric(paste(c(j, i), collapse = ""))
vector <- c(vector, number, number_reverse)
}
}
vector
[1] 0 1 2 3 4 5 6 7 8 9 12 21 23 32 34 43 45 54 56 65 67 76 78 87 89 98
Function to replace number if not in vector:
replace <- function(x) {
x <- ifelse(!x %in% vector, NA, x)
return(x)
}
Result:
df %>% mutate_at(c("A", "B", "C", "D"), replace)
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
Here is a function that tests individual numbers
MyFunction <- function(A){
NumbersToCheck <- lapply(strsplit(as.character(A),""),as.integer)
check <- lapply(2:length(unlist(NumbersToCheck)), function(X) ifelse(NumbersToCheck[[1]][X]-NumbersToCheck[[1]][X-1]==1,TRUE,FALSE))
return(ifelse(FALSE %in% check,NA,A))
}
Which can then be applied to your entire df as follows
df[,2:ncol(df)] <- lapply(2:ncol(df), function(X) unlist(lapply(df[,X],MyFunction)))
to get the following result
> df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
df[] <- lapply(df, function(col) {
# Split each value character by character
NAs <- sapply(strsplit(as.character(col), split = ""), function(chars) {
# Convert them back to integer to compare with `diff`
# and verify the increment is always 1 or -1
diff <- diff(as.integer(chars))
!all(diff == 1) && !all(diff == -1)
})
# If not, replace those values with NA
col[NAs] <- NA
col
})
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
#> ...
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
df
#> id A B C D
#> 1 a 1 5 2 6
#> 2 a 2 6 3 5
#> 3 a NA NA 5 89
#> 4 a 67 9 NA 3
#> 5 a 8 8 NA 2
#> 6 b 9 1 2 9
#> 7 b 0 NA 7 NA
#> 8 b 6 9 6 12
#> 9 b 7 7 4 NA
#> 10 b 9 4 6 8
Created on 2020-03-31 by the reprex package (v0.3.0)

Fill column with prior nonmissing value, no ID

I'm trying to fill a missing ID column of a data frame as shown below. It's not blank in the first row it applies to and then blank until the next ID. I wrote ugly code to do this in a for loop, but wonder if there's a tidy-ier way to do this. Any suggestions?
Here's what I've got:
code data
1 A 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 B 11
12 12
13 13
14 14
15 15
16 C 16
17 17
18 18
19 19
20 20
I want:
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
Code I've got now:
# Create mock data frame
df <- data.frame(code = c("A", rep("", 9),
"B", rep("", 4),
"C", rep("", 4)),
data = 1:20)
# For loop over rows (BAD!)
for (i in seq(2, nrow(df))) {
df[i,]$code <- ifelse(df[i,]$code == "", df[i-1,]$code, df[i, ]$code)
}
There is a tidyr way to do it, there is the fill function. You also need to replace the zero length string with NA for this to work, which you can easily do using the mutate and na_if functions from dplyr.
df %>%
mutate(code = na_if(code,"")) %>%
fill(code)
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20

How to remove columns with dplyr with NA in specific row?

This code removes all columns which contain at least one NA.
library(dplyr)
df %>%
select_if(~ !any(is.na(.)))
What do I need to modify if I want only remove the columns that have NA for the eighth row (for my generated data below)?
set.seed(1234)
df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
In base-R one can simply try as:
df[,which(!is.na(df[8,]))]
Or as suggested by #RichScriven:
df[, !is.na(df[8,])]
# A B
# 1 1 11
# 2 2 12
# 3 3 13
# 4 4 NA
# 5 NA 15
# 6 6 16
# 7 7 17
# 8 8 18
# 9 9 19
# 10 10 20
You could do this:
df %>%
select_if(!is.na(.[8,]))
A B
1 1 11
2 2 12
3 3 13
4 4 NA
5 NA 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Another option is keep
library(purrr)
keep(df, ~ !(is.na(.x[8])))
# A B
#1 1 11
#2 2 12
#3 3 13
#4 4 NA
#5 NA 15
#6 6 16
#7 7 17
#8 8 18
#9 9 19
#10 10 20
Or with Filter from base R
Filter(function(x) !(is.na(x[8])), df)

Resources