Sort dataframe in a function - r

I am trying to create a function which takes a dataframe and the columns by which I want to sort as arguments. This is what I have come up with:
sortDf <- function(df, columns){
df <- df[order(df[,columns]),]
return(df)
}
This is my usecase:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
sortedDf <- sortDf(df, c('V1', 'V2'))
How ever I get this as a result:
V1 V2 V3 V4 V5
3 1 1 5 3 4
5 1 5 2 5 2
NA NA NA NA NA NA
NA.1 NA NA NA NA NA
NA.2 NA NA NA NA NA
NA.3 NA NA NA NA NA
1 5 2 1 2 5
4 5 2 1 2 1
NA.4 NA NA NA NA NA
2 NA 4 NA 1 4
The dataframe is kinda sorted but where does the 'NA' come from and how can I remove them? What do I do wrong? I want to sort descending. Thanks in advance.

We can create a different function
f1 <- function(dat, cols){
dat[do.call(order, dat[cols]),]
}
f1(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
In the OP's code, the order is applied on a data.frame instead of a vector. It can be used either separately or within do.call i.e.
df[order(df$V1, df$V2),]
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
gives the same result as the OP's code. So, either it columns can be individually mentioned (which would not be easy when there are more number of columns) or use do.call.
This can also be implemented using the devel version of dplyr (soon to be released 0.6.0) with quosures. After taking the input vector, it is converted to quosures (parse_quosures) and then evaluated by unquoting (!!!) it in arrange
library(dplyr)
f2 <- function(dat, cols){
cols <- rlang::parse_quosures(paste(cols, collapse=";"))
dat %>%
arrange(!!! cols)
}
f2(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#1 1 1 2 1 3
#2 1 5 3 5 NA
#3 3 1 1 NA 1
#4 3 4 4 3 NA
#5 4 4 4 NA 4
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5))

Related

Removing rows with all NA's, from data.frames within a list

Example Data:
df1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(NA, NA, NA), c(4,6,7), c(4, 8, NA)))
df2 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)))
dfList <- list(df1,df2)
colnames <- c("A","B","C")
dfList[[1]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 NA NA NA
4 4 6 7
5 4 8 NA
dfList[[2]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 NA NA NA
5 4 8 NA
How do I remove the rows that are empty/have ALL values NA, within each of the data.frames in the list?
Desired outcome:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
You can use lapply to iterate over the list and rowSums to drop rows with all NA values.
lapply(dfList, function(x) x[rowSums(!is.na(x)) != 0, ])
#[[1]]
# V1 V2 V3
#1 1 2 3
#2 1 NA 4
#4 4 6 7
#5 4 8 NA
#[[2]]
# V1 V2 V3
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
use tidyverse
library(tidyverse)
library(janitor)
map(dfList, remove_empty, which = c("rows"))
[[1]]
V1 V2 V3
1 1 2 3
2 1 NA 4
4 4 6 7
5 4 8 NA
[[2]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
5 4 8 NA
Here is another solution with all()
lapply(dfList, function(d) d[!apply(is.na(d), 1, all),])

R First Row Value Meets Criteria

data = data.frame(STUDENT=c(1,2,3,4,5,6,7,8),
CAT=c(NA,NA,1,2,3,NA,NA,0),
DOG=c(NA,NA,2,3,2,NA,1,NA),
MOUSE=c(2,3,NA,NA,NA,NA,NA,NA),
WANT=c(2,3,2,2,3,NA,NA,NA))
I have 'data' and wish to create the 'WANT' variable and what it does is it takes the first non-NA value that does not equals to '1' or '0' and it stores it in 'WANT'. The code example above shows an example of what I hope to get.
We can use coalesce after changing the values 0, 1 in the selected columns to NA, then bind the column with the original dataset
library(dplyr)
data %>%
transmute(across(CAT:MOUSE, ~ replace(., . %in% 0:1, NA))) %>%
transmute(WANT2 = coalesce(!!! .)) %>%
bind_cols(data, .)
# STUDENT CAT DOG MOUSE WANT WANT2
#1 1 NA NA 2 2 2
#2 2 NA NA 3 3 3
#3 3 1 2 NA 2 2
#4 4 2 3 NA 2 2
#5 5 3 2 NA 3 3
#6 6 NA NA NA NA NA
#7 7 NA 1 NA NA NA
#8 8 0 NA NA NA NA
Or using data.table with fcoalesce. Convert the 'data.frame' to 'data.table' (setDT(data)), specify the columns of interest in .SDcols, loop over the .SD replace the values that are 0, 1 to NA, use fcoalesce and assign (:=) it to create new column 'WANT2'
library(data.table)
setDT(data)[, WANT2 := do.call(fcoalesce, lapply(.SD, function(x)
replace(x, x %in% 0:1, NA))), .SDcols = CAT:MOUSE]
or with base R, we can use a vectorized option with row/column indexing to extract the first non-NA element after replaceing the values 0, 1 to NA
m1 <- !is.na(replace(data[2:4], data[2:4] == 1|data[2:4] == 0, NA))
data$WAN2 <- data[2:4][cbind(seq_len(nrow(m1)), max.col(m1, "first"))]
data$WANT2[data$WANT2 == 0] <- NA
Try this:
data$Want2 <- apply(data[,-c(1,5)],1,function(x) x[min(which(!is.na(x) & x!=0 & x!=1))])
STUDENT CAT DOG MOUSE WANT Want2
1 1 NA NA 2 2 2
2 2 NA NA 3 3 3
3 3 1 2 NA 2 2
4 4 2 3 NA 2 2
5 5 3 2 NA 3 3
6 6 NA NA NA NA NA
7 7 NA 1 NA NA NA
8 8 0 NA NA 0 NA

Iteratively shift variables in data

Some example of my data:
library(tidyverse)
set.seed(1234)
df <- tibble(
v1 = c(1:6),
v2 = rnorm(6, 5, 2) %>% round,
v3 = rnorm(6, 4, 2) %>% round,
v4 = rnorm(6, 4, 1) %>% round %>% lag(1),
v5 = rnorm(6, 6, 2) %>% round %>% lag(2),
v6 = rnorm(6, 5, 3) %>% round %>% lag(3),
v7 = rnorm(6, 5, 3) %>% round %>% lag(4))
v1 v2 v3 v4 v5 v6 v7
1 1 3 3 NA NA NA NA
2 2 6 3 3 NA NA NA
3 3 7 3 4 4 NA NA
4 4 0 2 5 11 3 NA
5 5 6 3 4 6 1 8
6 6 6 2 3 5 7 4
I want to shift it by diagonal, that separates NA and filled data.
So, desired output looks like this:
v1 v2 v3 v4 v5 v6 v7
1 NA NA 3 3 4 3 8
2 NA 3 3 4 11 1 4
3 1 6 3 5 6 7 NA
4 2 7 2 4 5 NA NA
5 3 0 3 4 NA NA NA
6 4 6 2 NA NA NA NA
7 5 6 NA NA NA NA NA
8 6 NA NA NA NA NA NA
Each column around v3 is just shifted by 1, 2, 3.. etc rows down and up.
Tried to achieve this inside dplyr::mutate_all() but I failed to iterate it with a lag() and lead() functions.
EDIT: after #wibeasley advice I made this
df %>%
mutate(dummy1 = c(3:8)) %>%
gather("var", "val", -dummy1) %>%
mutate(
dummy2 = sub("v", "", var, fixed = T),
dummy3 = dummy1 - as.numeric(dummy2) + 1) %>%
select(-dummy1, -dummy2) %>%
spread(var, val) %>%
slice(-c(1:4)) %>% select(-dummy3)
Looks ugly, but works.
We can use lapply to handle each column, putting NA to the back.
df[] <- lapply(df, function(x) c(x[!is.na(x)], x[is.na(x)]))
df
# # A tibble: 6 x 7
# v1 v2 v3 v4 v5 v6 v7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3 4 3 8
# 2 2 6 3 4 11 1 4
# 3 3 7 3 5 6 7 NA
# 4 4 0 2 4 5 NA NA
# 5 5 6 3 3 NA NA NA
# 6 6 6 2 NA NA NA NA

Fill missing values with new data R-Python

I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6

Deleting columns based on the value of a row

Given two data frames:
C1<-c(3,4,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,6,3,7,4)
DF<-data.frame(C1=C1,C2=C2,C3=C3)
DF
C1 C2 C3
1 3 3 5
2 4 7 6
3 4 3 3
4 4 4 7
5 5 5 4
and
V1<-c(3,2,2,4,5)
V2<-c(3,7,3,5,2)
V3<-c(5,2,5,7,5)
V4<-c(1,1,2,3,4)
V5<-c(1,2,6,7,5)
DF2<-data.frame(V1=V1,V2=V2,V3=V3,V4=V4,V5=V5)
DF2
V1 V2 V3 V4 V5
1 3 3 5 1 1
2 2 7 2 1 2
3 2 3 5 2 6
4 4 5 7 3 7
5 5 2 5 4 5
Looking at each equivalent row in both data frames, there is a relationship between the value in C3 and the number of columns I want to drop in that same row in DF2.
The relationship between the value in C3 and the # of columns in DF2 to drop looks like this
If C3≥7 drop V5
If C3=6.0:6.9 drop V4 and up (so basically V5,V4)
If C3=5.0:5.9 drop V3 and up (so basically V5,V4,V3)
If C3=4.0:4.9 drop V2 and up (so basically V5,V4,V3,V2)
If C3≤3.9 drop entire row
For this example, based on the values of C3, I would want DF2 to look like this
V1 V2 V3 V4 V5
1 3 3
2 2 7 2
4 4 5 7 3
5 5
I've tried write a simple script to do this (I'm pretty new so I like to keep things simple so I can see what's going on) but I'm throwing errors left and right so I'd appreciate some advice on how to proceed
I like Koshke's answer, but if your rules for setting to NA don't have a nice mathematical property to them or you need to define your rules arbitrarily, this approach should give you that flexibility. First, define a function that returns the columns to drop based on your rules:
f <- function(x) {
if(x >= 7){
out <- 5
}else if(x >= 6.0){
out <- 4:5
} else if( x >= 5.0){
out <- 3:5
} else if (x >= 4.0){
out <- 2:5
} else {
out <- 1:5
}
return(out)
}
Next, create a list for the column indices to drop:
z <- lapply(DF$C3, f)
Finally, loop through each row setting the corresponding columns to NA:
for(j in seq(length(z))){
DF2[j, z[[j]]] <- NA
}
#-----
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA
Perhaps the easiest way is like:
DF3 <- DF2
for (i in seq_len(nrow(DF3))) {
DF3[i, seq_len(ncol(DF3)) >= DF[i, ]$C3 - 2] <- NA
}
DF3
then,
> DF3
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA
A slight variation on kohske's answer using defined cut points:
breaksx <- cut(DF$C3,c(0,3,4,5,6,7,Inf),labels=FALSE)
for (i in seq(nrow(DF2))) {
DF2[i,breaksx[i]:ncol(DF2)] <- NA
}
Result:
> DF2
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA
To remove the rows which are all NAs
DF2[apply(DF2,1,function(x) !all(is.na(x))),]
Result:
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA

Resources