R: selecting row values based on row range - r

I have a data frame (df) with 4 columns of values (V1 to V4 columns) that I need to select based on two other columns (max and min columns). My aim is to assign NAs to those values outside of the range set by the max and min columns for each row and calculate the mean of the remaining values.
V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16
The expected output would be:
V1 V2 V3 V4 max min mean
NA NA 6 NA 7 5 6
23 30 NA 17 30 16 35
So far, I can only do this by using the following script to assign NAs...
df$V1 <- ifelse(df$V1 > df$max | df$V1 < df$min, NA, df$V1)
df$V2 <- ifelse(df$V2 > df$max | df$V2 < df$min, NA, df$V2)
df$V3 <- ifelse(df$V3 > df$max | df$V3 < df$min, NA, df$V3)
df$V4 <- ifelse(df$V4 > df$max | df$V4 < df$min, NA, df$V4)
...and then the following to calculate the mean:
df$mean <- rowMeans(df[, 1:4], na.rm = TRUE)
The problem is that the number of columns in the real data will be much larger than 4 and this method seems to require far too much repetition. Is there a better way of doing this in R?
I have tried using data.table to subset the valid values to then use the apply function without success:
df <- df[df[,1:4] <= df$max | df[,1:4] >= df$min, ]
apply(df[,1:4], 1, function(x) mean(x))
Thank you.

For instance, you could try the following, which works by melting your data first.
# getting your data:
df <- read.table(text="V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16", header=T)
# melting the data:
library(reshape2)
df2 <- melt(df, id.vars = c("max", "min"))
df2
max min variable value
1 7 5 V1 1
2 30 16 V1 23
3 7 5 V2 3
4 30 16 V2 30
5 7 5 V3 6
6 30 16 V3 5
7 7 5 V4 8
8 30 16 V4 17
# I create a new vector with NAs, but you could easily just overwrite the values:
df2$val <- with(df2, ifelse(value > max | value < min, NA, value))
# Cast the data into the old form again.
df3 <- dcast(df2, max + min ~ variable, value.var = "val")
# calculate the rowMeans:
df3$mean <- rowMeans(df3[, 3:6], na.rm = TRUE)
# Doing some cosmetics here to get the same column ordering. Chose your preferred way or rearranging the columns, if required at all.
df3 <- df3[, c(paste0("V", 1:4),"max", "min", "mean") ]
df3
V1 V2 V3 V4 max min mean
1 NA NA 6 NA 7 5 6.00000
2 23 30 NA 17 30 16 23.33333
Note that the only difference is that the mean of the second row is lower. I am not sure how you got a value of 35 there.

Try:
df <- read.table(header=TRUE, text="V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16")
df.new<-apply(df[,1:4],2,function(x) ifelse(x>df[,5] | x<df[,6],NA,x))
df.new<-cbind(df.new,df[,5:6])
df.new$mean=rowMeans(df.new[1:4],na.rm=TRUE)
df.new

Here is a simple solution with a for loop to fill in the NAs and rowMeans to calculate the mean of each row.
# loop through rows and fill in NA for values outside of min/max
for(i in 1:nrow(df))
is.na(df[i, 1:4]) <- df[i, 1:4] < df[i, "min"] | df[i, 1:4] > df[i, "max"]
# calculate mean of each row
df$mean <- rowMeans(df[, 1:4], na.rm=TRUE)
this returns
df
V1 V2 V3 V4 max min mean
1 NA NA 6 NA 7 5 6.00000
2 23 30 NA 17 30 16 23.33333

Related

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
library(microbenchmark)
microbenchmark(
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
},
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <- as.data.table(df)

how can I delete the columns contain NA or the variance equal to 0

I want to scale my data before do a PCA, but unfortunately I found some columns contains NA, and the variance of some columns equal to 0, I want to delete these columns. This is an example of my data
df <- data.frame( v1 = 1:10 , v2 = rep( 0 , 10 ) , v3 = sample( c( 1:3 , NA ) , 10 , repl = TRUE ), v4 = 1:10 )
I want to delete the v2 and v3 column at the same time. how can I implement that?
I know how to delete the columns contain NA, and then delete the column whose variance equal to 0.
colsd <- apply(df, 2, sd)
df2 <- df[!is.na(colsd)]
colsd2 <- apply(df2, 2, sd)
df3 <- df2[!colsd2 == 0]
but it looks redundancy, I just want to know can I implement this more efficient, maybe just in one line. Thank you for any response.
You can try something like:
> df[!sapply(df, var) %in% c(0, NA)]
v1 v4
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10

R: colSums when not all columns are numeric

I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Returning above and below rows of specific rows in r dataframe

Consider any dataframe
col1 col2 col3 col4
row.name11 A 23 x y
row.name12 A 29 x y
row.name13 B 17 x y
row.name14 A 77 x y
I have a list of rownames which I want to return from this dataframe. Lets say I have row.name12 and row.name13 in a list. I can easily return these rows from dataframe. But I also want to return 4 rows above and 4 rows below these rows. It means I want to return from row.name8 to row.name17. I think it is similar to grep -A -B in shell.
Probable solution- Is there any way to return row number by row name? Because if I have row number than I can easily subtract 4 and add 4 in row number and return rows.
Note: Here rownames are just examples. Rownames could be anything like RED, BLUE, BLACK, etc.
Try that:
extract.with.context <- function(x, rows, after = 0, before = 0) {
match.idx <- which(rownames(x) %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
dat <- data.frame(x = 1:26, row.names = letters)
extract.with.context(dat, c("a", "b", "j", "y"), after = 3, before = 1)
# x
# a 1
# b 2
# c 3
# d 4
# e 5
# i 9
# j 10
# k 11
# l 12
# m 13
# x 24
# y 25
# z 26
Perhaps a combination of which() and %in% would help you:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1, 1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name14 A 77 x y
In the above, we are trying to identify which row names in "dat" are "row.name13" (using which()), and the + c(-1, 1) tells R to return the row before and the row after. If you wanted to include the row, you could do something like + c(-1:1).
To get the range of rows, switch the comma to a colon:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1:1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
Update
Matching a list is a little bit trickier, but without thinking about it too much, here is a possibility:
myRows <- c("row.name12", "row.name13")
rowRanges <- lapply(which(rownames(dat) %in% myRows), function(x) x + c(-1:1))
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 3 4
#
lapply(rowRanges, function(x) dat[x, ])
# [[1]]
# col1 col2 col3 col4
# row.name11 A 23 x y
# row.name12 A 29 x y
# row.name13 B 17 x y
#
# [[2]]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
This outputs a list of data.frames which might be handy since you might have duplicated rows (as there are in this example).
Update 2: Using grep if it is more appropriate
Here is a variation of your question, one which would be less convenient to solve using the which()...%in% approach.
set.seed(1)
dat1 <- data.frame(ID = 1:25, V1 = sample(100, 25, replace = TRUE))
rownames(dat1) <- paste("rowname", sample(apply(combn(LETTERS[1:4], 2),
2, paste, collapse = ""),
25, replace = TRUE),
sprintf("%02d", 1:25), sep = ".")
head(dat1)
# ID V1
# rowname.AD.01 1 27
# rowname.AB.02 2 38
# rowname.AD.03 3 58
# rowname.CD.04 4 91
# rowname.AD.05 5 21
# rowname.AD.06 6 90
Now, imagine you wanted to identify the rows with AB and AC, but you don't have a list of the numeric suffixes.
Here's a little function that can be used in such a scenario. It borrows a little from #Spacedman to make sure that the rows returned are within the range of the data (as per #flodel's suggestion).
getMyRows <- function(data, matches, range) {
rowMatches = lapply(unlist(lapply(matches, function(x)
grep(x, rownames(data)))), function(y) y + range)
rowMatches = lapply(rowMatches, function(x) x[x > 0 & x <= nrow(data)])
lapply(rowMatches, function(x) data[x, ])
}
You can use it as follows (but I won't print the results here). First, specify the dataset, then the pattern(s) you want matched, then the range (in this example, three rows before and four rows after).
getMyRows(dat1, c("AB", "AC"), -3:4)
Applying it to the earlier example of matching row.name12 and row.name13, you can use it as follows: getMyRows(dat, c(12, 13), -1:1).
You can also modify the function to make it more general (for example, to specify matching with a column instead of row names).
Create some sample data:
> dat=data.frame(col1=letters,col2=sample(26),col3=sample(letters))
> dat
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
...
Set our target vector (note I choose an edge case and overlapping cases), and find matching rows:
> target=c("a","e","g","s")
> match = which(dat$col1 %in% target)
Create sequences from -2 to +2 of the matches (adjust for your needs) and merge:
> getThese = unique(as.vector(mapply(seq,match-2,match+2)))
> getThese
[1] -1 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21
Fix the edge cases:
> getThese = getThese[getThese > 0 & getThese <= nrow(dat)]
> dat[getThese,]
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
4 d 22 d
5 e 2 j
6 f 9 l
7 g 1 w
8 h 21 n
9 i 17 p
17 q 18 a
18 r 10 m
19 s 24 o
20 t 13 e
21 u 3 k
>
Remember our targets were a, e, g and s. You've now got those plus two rows above and two rows below for each, with no duplicates.
If you are using row names, just create 'match' from those. I was using a column.
I'd write a bunch more tests using the testthat package if this were my problem.
Another option will be to use filter. In case stats::filter is masked e.g. by dplyr::filter you have to use stats::filter.
dat <- data.frame(x = seq_along(letters), row.names = letters)
i <- rownames(dat) %in% c("a", "b", "j", "y") #Get the matches
nAfter <- 3
nBefore <- 1
fi <- seq(-nBefore, nAfter)
n <- max(abs(x))
fi <- seq(-n, n) %in% fi
dat[head(tail(filter(c(rep(FALSE, n), i, rep(FALSE, n)), fi), -n), -n) > 0,, drop = FALSE]
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#i 9
#j 10
#k 11
#l 12
#m 13
#x 24
#y 25
#z 26
I would simply proceed as follow:
dat[(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4),]
grep("row.name12",row.names(dat)) gives you the row number that have "row.name12" as name, so
(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4)
gives you a serie of row numbers ranging from the 4th row preceding the row named "row.name12" to the 4th row after the one named "row.name13".

Resources