Loop function for comparing the columns - r

I have a very large data set including 400 string and numeric variables. I want to compare each two consequiative columns 3&4, 5&6, etc. I am going to compare the third variable (.x) with fourth (.y) , fifth with sixth one, seventh one with eightth one and so on in the following way: if (.y) is NA then we replace the NA with the value of corresponding row from (.x) . For example if number .y is NA we replace NA with the corresponding value from number .x which would be 5. Again, if day.y is NA we replace NA in day.y with the corresponding value from day.x which would be 3. How can I write a loope function to do that?
A<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
B<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
number.x<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
number.y<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
day.x<-c(1,3,4,5,6,7,8,1,NA,3,5,3)
day.y<-c(4,5,6,7,8,7,8,1,2,3,5,NA)
school.x<-c("a","b","b","c","n","f","h","NA","F","G","z","h")
school.y<-c("a","b","b","c","m","g","h","NA","NA","G","H","T")
city.x<- c(1,2,3,7,5,8,7,5,6,7,5,1)
city.y<- c(1,2,3,5,5,7,7,NA,NA,3,4,5)
df<-data.frame(A,B,number.x,number.y,day.x,day.y,school.x,school.y,city.x,city.y)

This is a hacked approach to your question and it requires that every two columns are going to be compared against one another.
library(dplyr)
start_group <- seq(1, length(df), by = 2)
df2 <- data.frame(id = 1:nrow(df))
for(i in start_group){
i <- i
j <- i + 1
dnames <- df[, c(i, j)] %>%
names
df_ <- data.frame(col1 = df[, i],
col2 = df[, j]) %>%
mutate(col1 = ifelse(is.na(col1), col2 %>% paste, col1 %>% paste)) %>%
mutate(col2 = ifelse(is.na(col2), col1 %>% paste, col2 %>% paste))
names(df_) <- dnames
df2 <- cbind(df2, df_)
}
df2[, -1]
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5

Consider the following base R solution. Essentially, it loops through a distinct list of column stem names (number, day, school, class) and replaces NA values in .x columns with corresponding NA values in .y columns and vice versa. NOTE: Schools column require conversion from factor to character and one of its rows has NA in both .x and .y columns
# CONVERT TO CHARACTER (NOTE: NA VALUE BECOME "NA" STRINGS)
df[,c('school.x', 'school.y')] <-
sapply(df[,c('school.x', 'school.y')], as.character)
# SET UP FINAL DF
finaldf <- df
# OBTAIN UNIQUE LIST OF COLUMNS STEM (W/O x AND y SUFFIXES)
distinctcols <- unique(gsub("[.][x]|[.][y]", "", names(df)[49:ncol(df)]))
# LOOP THROUGH COLUMN STEM REPLACING NA VALUES
for (col in distinctcols) {
# REPLACE NA .x COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.x')] <-
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.y')]
# REPLACE NA .y COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.y')] <-
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.x')]
}
OUTPUT
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5

Related

How to remove rows with NAs from two dataframes based on NAs from one?

I am trying to remove the same rows with NA in df1 from df2.
eg.
df1
A
1 1
2 NA
3 7
4 NA
df2
A B C D
1 2 4 7 10
2 3 6 1 3
3 9 5 1 3
4 4 9 2 5
Intended outcome:
df1
A
1 1
3 7
df2
A B C D
1 2 4 7 10
3 9 5 1 3
I have already tried things along the lines of...
newdf <- df2[-which(rowSums(is.na(df1))),]
and
noNA <- function(x) { x[!rowSums(!is.na(df1)) == 1]}
NMR_6mos_noNA <- as.data.frame(lapply(df2, noNA))
or
noNA <- function(x) { x[,!is.na(df1)]}
newdf3 <- as.data.frame(lapply(df2, noNA))
We can use is.na to create a logical condition and use that to subset the rows of 'df1' and 'df2'
i1 <- !is.na(df1$A)
df1[i1, , drop = FALSE]
# A
#1 1
#3 7
df2[i1,]
# A B C D
# 1 2 4 7 10
#3 9 5 1 3

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

How to assign a value to a column based on a column index

Having a data frame I would like to assign a calculated value based on a given a column index
df <- data.frame(a = c(2,4,7,3,5,3), b = c(8,3,8,2,6,1))
> df
a b
1 2 8
2 4 3
3 7 8
4 3 2
5 5 6
6 3 1
max <- apply(df, 1, which.max)
> max
[1] 2 1 2 1 2 1
addition <- apply(df, 1, sum)
> addition
[1] 10 7 15 5 11 4
Then some operation which I cannot figure out with the following result being assigned to df2
> df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
highly appreciate your ideas and your help. Thank you
You can use cbind to access your selected columns for each row:
df2 = df
df2[cbind(1:nrow(df2),max)] = addition
df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
Here, cbind returns a matrix of 2 columns and 6 rows that we use to subset the dataframe using matrix subsetting.
You can also use vectorised ifelse directly:
with(df, cbind.data.frame(a = ifelse(a > b, a + b, a), b = ifelse(a > b, b, a + b)));
# a b
#1 2 10
#2 7 3
#3 7 15
#4 5 2
#5 5 11
#6 4 1

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

How to merge two datasets by the different values in R?

I have two datasets and want to merge them. How I add to first dataset only the lines that are in the second that are not in the first?
Only add to final dataset if the value not exists in the another dataset. An example dataset:
x = data.frame(id = c("a","c","d","g"),
value = c(1,3,4,7))
y = data.frame(id = c("b","c","d","e","f"),
value = c(5,6,8,9,7))
The merged dataset should look like (the order is not important):
a 1
b 5
c 3
d 4
e 9
f 7
g 7
Using !, %in% and rbind:
rbind(x[!x$id %in% y$id,], y)
id value
1 a 1
4 g 7
3 b 2
41 c 3
5 d 4
6 e 5
7 f 6
For your example to work, you first need to ensure that id in each data.frame are directly comparable. Since they're factors, you need ensure they have the same levels/labels; or you can just convert them to character.
# convert factors to character
x$id <- as.character(x$id)
y$id <- as.character(y$id)
# merge
z <- merge(x,y,by="id",all=TRUE)
# keep first value, if it exists
z$value <- ifelse(is.na(z$value.x),z$value.y,z$value.x)
# keep desired columns
z <- z[,c("id","value")]
z
# id value
# 1 a 1
# 2 b 5
# 3 c 3
# 4 d 4
# 5 e 9
# 6 f 7
# 7 g 7
You already answered your own question, but just didn't realize it right away. :)
> merge(x,y,all=TRUE)
id value
1 a 1
2 c 3
3 c 6
4 d 4
5 d 8
6 g 7
7 b 5
8 e 9
9 f 7
EDIT
I'm a bit dense here and I'm not sure where you're getting at, so I provide you with a shotgun approach. What I did was I merged the data.frames by id and copied values from x to y if y` was missing. Take whichever column you need.
> x = data.frame(id = c("a","c","d","g"),
+ value = c(1,3,4,7))
> y = data.frame(id = c("b","c","d","e","f"),
+ value = c(5,6,8,9,7))
> xy <- merge(x, y, by = "id", all = TRUE)
> xy
id value.x value.y
1 a 1 NA
2 c 3 6
3 d 4 8
4 g 7 NA
5 b NA 5
6 e NA 9
7 f NA 7
> find.na <- is.na(xy[, "value.y"])
> xy$new.col <- xy[, "value.y"]
> xy[find.na, "new.col"] <- xy[find.na, "value.x"]
> xy
id value.x value.y new.col
1 a 1 NA 1
2 c 3 6 6
3 d 4 8 8
4 g 7 NA 7
5 b NA 5 5
6 e NA 9 9
7 f NA 7 7
> xy[order(as.character(xy$id)), ]
id value.x value.y new.col
1 a 1 NA 1
5 b NA 5 5
2 c 3 6 6
3 d 4 8 8
6 e NA 9 9
7 f NA 7 7
4 g 7 NA 7

Resources