Deleting columns based on the value of a row - r

Given two data frames:
C1<-c(3,4,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,6,3,7,4)
DF<-data.frame(C1=C1,C2=C2,C3=C3)
DF
C1 C2 C3
1 3 3 5
2 4 7 6
3 4 3 3
4 4 4 7
5 5 5 4
and
V1<-c(3,2,2,4,5)
V2<-c(3,7,3,5,2)
V3<-c(5,2,5,7,5)
V4<-c(1,1,2,3,4)
V5<-c(1,2,6,7,5)
DF2<-data.frame(V1=V1,V2=V2,V3=V3,V4=V4,V5=V5)
DF2
V1 V2 V3 V4 V5
1 3 3 5 1 1
2 2 7 2 1 2
3 2 3 5 2 6
4 4 5 7 3 7
5 5 2 5 4 5
Looking at each equivalent row in both data frames, there is a relationship between the value in C3 and the number of columns I want to drop in that same row in DF2.
The relationship between the value in C3 and the # of columns in DF2 to drop looks like this
If C3≥7 drop V5
If C3=6.0:6.9 drop V4 and up (so basically V5,V4)
If C3=5.0:5.9 drop V3 and up (so basically V5,V4,V3)
If C3=4.0:4.9 drop V2 and up (so basically V5,V4,V3,V2)
If C3≤3.9 drop entire row
For this example, based on the values of C3, I would want DF2 to look like this
V1 V2 V3 V4 V5
1 3 3
2 2 7 2
4 4 5 7 3
5 5
I've tried write a simple script to do this (I'm pretty new so I like to keep things simple so I can see what's going on) but I'm throwing errors left and right so I'd appreciate some advice on how to proceed

I like Koshke's answer, but if your rules for setting to NA don't have a nice mathematical property to them or you need to define your rules arbitrarily, this approach should give you that flexibility. First, define a function that returns the columns to drop based on your rules:
f <- function(x) {
if(x >= 7){
out <- 5
}else if(x >= 6.0){
out <- 4:5
} else if( x >= 5.0){
out <- 3:5
} else if (x >= 4.0){
out <- 2:5
} else {
out <- 1:5
}
return(out)
}
Next, create a list for the column indices to drop:
z <- lapply(DF$C3, f)
Finally, loop through each row setting the corresponding columns to NA:
for(j in seq(length(z))){
DF2[j, z[[j]]] <- NA
}
#-----
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA

Perhaps the easiest way is like:
DF3 <- DF2
for (i in seq_len(nrow(DF3))) {
DF3[i, seq_len(ncol(DF3)) >= DF[i, ]$C3 - 2] <- NA
}
DF3
then,
> DF3
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA

A slight variation on kohske's answer using defined cut points:
breaksx <- cut(DF$C3,c(0,3,4,5,6,7,Inf),labels=FALSE)
for (i in seq(nrow(DF2))) {
DF2[i,breaksx[i]:ncol(DF2)] <- NA
}
Result:
> DF2
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
3 NA NA NA NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA
To remove the rows which are all NAs
DF2[apply(DF2,1,function(x) !all(is.na(x))),]
Result:
V1 V2 V3 V4 V5
1 3 3 NA NA NA
2 2 7 2 NA NA
4 4 5 7 3 NA
5 5 NA NA NA NA

Related

create an other data if elements are same

I have two data sets A and B (shown below), and wanted to create third data set called C, based on this condition: If elements of A and B are Same (or matched) then its should be C (if not macthed then that element should be NA/missing).
A
2 5 9 3
5 3 2 1
2 1 1 3
B
2 7 9 3
5 3 6 1
2 2 2 3
expected C should look like
2 NA 9 3
5 3 NA 1
2 NA NA 3
BOTH data have same dimensions, any suggestion please?
`is.na<-`(A,!A==B)
V1 V2 V3 V4
1 2 NA 9 3
2 5 3 NA 1
3 2 NA NA 3
This should work for both data frame and matrix.
If A and B are data frames:
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# 1 2 NA 9 3
# 2 5 3 NA 1
# 3 2 NA NA 3
If A and B are matrix:
A <- as.matrix(A)
B <- as.matrix(B)
C <- A
C[C != B] <- NA
C
# V1 V2 V3 V4
# [1,] 2 NA 9 3
# [2,] 5 3 NA 1
# [3,] 2 NA NA 3
DATA
A <- read.table(text = "2 5 9 3
5 3 2 1
2 1 1 3",
header = FALSE)
B <- read.table(text = "2 7 9 3
5 3 6 1
2 2 2 3",
header = FALSE)

Sort dataframe in a function

I am trying to create a function which takes a dataframe and the columns by which I want to sort as arguments. This is what I have come up with:
sortDf <- function(df, columns){
df <- df[order(df[,columns]),]
return(df)
}
This is my usecase:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
sortedDf <- sortDf(df, c('V1', 'V2'))
How ever I get this as a result:
V1 V2 V3 V4 V5
3 1 1 5 3 4
5 1 5 2 5 2
NA NA NA NA NA NA
NA.1 NA NA NA NA NA
NA.2 NA NA NA NA NA
NA.3 NA NA NA NA NA
1 5 2 1 2 5
4 5 2 1 2 1
NA.4 NA NA NA NA NA
2 NA 4 NA 1 4
The dataframe is kinda sorted but where does the 'NA' come from and how can I remove them? What do I do wrong? I want to sort descending. Thanks in advance.
We can create a different function
f1 <- function(dat, cols){
dat[do.call(order, dat[cols]),]
}
f1(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
In the OP's code, the order is applied on a data.frame instead of a vector. It can be used either separately or within do.call i.e.
df[order(df$V1, df$V2),]
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
gives the same result as the OP's code. So, either it columns can be individually mentioned (which would not be easy when there are more number of columns) or use do.call.
This can also be implemented using the devel version of dplyr (soon to be released 0.6.0) with quosures. After taking the input vector, it is converted to quosures (parse_quosures) and then evaluated by unquoting (!!!) it in arrange
library(dplyr)
f2 <- function(dat, cols){
cols <- rlang::parse_quosures(paste(cols, collapse=";"))
dat %>%
arrange(!!! cols)
}
f2(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#1 1 1 2 1 3
#2 1 5 3 5 NA
#3 3 1 1 NA 1
#4 3 4 4 3 NA
#5 4 4 4 NA 4
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5))

Replace the rows in dataframe with condition

Hi in relation to the question here:
[Dynamically replace row in dataframe with vector
I have a data.frame for example:
d <- read.table(text=' V1 V2 V3 V4 V5 V6 V7
1 1 a 2 3 4 9 6
2 1 b 2 2 4 5 NA
3 1 c 1 3 4 5 8
4 1 d 1 2 3 6 9
5 2 a 1 2 3 4 5
6 2 b 1 4 5 6 7
7 2 c 1 2 3 5 8
8 2 d 2 3 6 7 9', header=TRUE)
Now I want to take one row, for example the first one (1a) and:
Get the min and max value from that row. In this case min=2 and max=9 (note there are missing values in between for example there is no 5, 7, or 8 in that row).
Now I want to replace that row with all missing values and extend it (the row will be longer than all others as it will go from 2 until 9 (2,3,4,5,6,7,8,9). The whole data.frame should then be automatically extended by NA columns for the other rows that are not as long as the one I replaced.
Now the following code does achieve this:
row.to.change <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add > 0) {
d <- cbind(d, replicate(num.add, rep(NA, nrow(d))))
} else if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
and finally renames the extended data.frame headers as the default ones:
d[row.to.change,c(-1, -2)] <- new.row
colnames(d) <- paste0("V", seq_len(ncol(d)))
Now: This does work for the row that I specify in: row.to.replace but how does this work, if for example I want it to work for all rows which have a 'b' in the second column? Something like: "do this where d$V2 == 'b'"? In case the data.frame is 5000 rows long.
You have already solved. Just make a function and then apply it to each row of your data.
rtc=function(row.to.change){# <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
new.row
}
#d2=d
newr=lapply(1:nrow(d),rtc) # for the hole data
# for specific condition, like lines with "b" in V2 change to:
# newr=lapply(1:nrow(d),function(z)if(d$V2[z]=="b")rtc(z) else as.numeric(d[z,c(-1, -2)]))
mxl=max(sapply(newr,length))
newr=lapply(newr,function(z)if(length(z)<mxl)c(z,rep(NA,mxl-length(z))) else z)
if (ncol(d)-2 < mxl) {
d <- cbind(d, replicate(mxl-ncol(d)+2, rep(NA, nrow(d))))
}
d[,c(-1, -2)] <- do.call(rbind,newr)
colnames(d) <- paste0("V", seq_len(ncol(d)))
d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 a 2 3 4 5 6 7 8 9 NA
2 1 b 2 3 4 5 NA NA NA NA NA
3 1 c 1 2 3 4 5 6 7 8 NA
4 1 d 1 2 3 4 5 6 7 8 9
5 2 a 1 2 3 4 5 NA NA NA NA
6 2 b 1 2 3 4 5 6 7 NA NA
7 2 c 1 2 3 4 5 6 7 8 NA
8 2 d 2 3 4 5 6 7 8 9 NA

r- how to shift a varying number of NAs from the bottom to the top of columns in a dataframe [duplicate]

This question already has answers here:
Move NAs to the end of each column in a data frame
(5 answers)
Closed 8 years ago.
For example, I'd like to take this:
dat <- as.data.frame(rbind(c(1:6), c(6,5,4,3,NA,1), c(1,NA,3,4,NA,6)))
dat
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 6 5 4 3 NA 1
3 1 NA 3 4 NA 6
And create this:
dat <- as.data.frame(rbind(c(1,NA,3,4,NA,6), c(6,2,4,3,NA,1), c(1,5,3,4,5,6)))
dat
V1 V2 V3 V4 V5 V6
1 1 NA 3 4 NA 6
2 6 2 4 3 NA 1
3 1 5 3 4 5 6
Try
dat[] <- apply(dat,2, function(x) c(x[is.na(x)], x[!is.na(x)]))
dat
# V1 V2 V3 V4 V5 V6
#1 1 NA 3 4 NA 6
#2 6 2 4 3 NA 1
#3 1 5 3 4 5 6
Or a better method would be
dat[] <- lapply(dat, function(x) c(x[is.na(x)], x[!is.na(x)]))
Or using data.table (suggested by #David Arenburg)
library(data.table)
setDT(dat)[, names(dat) := lapply(.SD, function(x)
c(x[is.na(x)], x[!is.na(x)]))]
Something like:
move_NAs_to_end <- function(v) { c(v[!is.na(v)], v[is.na(v)]) }
apply(dat, 2, move_NAs_to_end)
V1 V2 V3 V4 V5 V6
[1,] 1 2 3 4 5 6
[2,] 6 5 4 3 NA 1
[3,] 1 NA 3 4 NA 6

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources