reference x's column in R's apply function - r

I have a df like this:
a <- c(4,5,3,5,1)
b <- c(8,9,7,3,5)
c <- c(6,7,5,4,3)
df <- data.frame(rbind(a,b,c))
I want a new df, df2, containing the difference between the values in each cell in rows a and b and the value in row c in their respective columns.
df2 would look like this:
a <- c(-2,-2,-2,1,-2)
b <- c(2,2,2,-1,2)
df2 <- data.frame(rbind(a,b))
Here is where I'm getting stuck:
df2 <- data.frame(apply(df,c(1,2),function(x) x - df[nrow(df),the col index of x]))
How do I reference the column index of x? Is there something like JavaScript's this?

We can do this easily by replicating the 3rd row to make the lengths equal before subtracting with the first two rows
out <- df[c("a", "b"),] - df["c",][col(df[c("a", "b"),])]
identical(df2, out)
#[1] TRUE
Or explicitly using rep
df[c("a", "b"),] - rep(unlist(df["c",]), each = 2)

Related

In which column there is a value of a specific variable

I have this dataframe:
a <- c(2,5,90,77,56,65,85,75,12,24,52,32)
b <- c(45,78,98,55,63,12,23,38,75,68,99,73)
c <- c(77,85,3,22,4,69,86,39,78,36,96,11)
d <- c(52,68,4,25,79,120,97,20,7,19,37,67)
e <- c(14,73,91,87,94,38,1,685,47,102,666,74)
df <- data.frame(a,b,c,d,e)
and this variable:
bb <- 120
I need to know the column number of df in which there is the value of the variable "bb". How can I do?
Thx everyone!
We could use which with arr.ind = TRUE to extract the row/col index after creating a logical matrix. Then, extract the second column to get the column index
which(df == bb, arr.ind = TRUE)[,2]
col
4
If there are duplicate elements in the column for the value compared, wrap with unique to return the unique column index
unique(which(df == bb, arr.ind = TRUE)[,2])
[1] 4
I think we could use grep
grep(bb, df)
[1] 4

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

R: Concatenated values in column B based on values in column A

QUESTION: Using R, how would you create values in column B prefixed with a constant "1" + n 0's where n is the value in each row in column A?
#R CODE EXAMPLE
df <- as.data.frame(1:3);colnames(df)[1] <- "A";
print(df);
# A
# 1
# 2
# 3
preFixedValue <- 1; repeatedValue <- 0;
#pseudo code: create values in column B with n 0's prefixed with 1
df <- cbind(df,paste(rep(c(preFixedValue,repeatedValue), times = c(1,df[1:nrow(df),])),collapse = ""));
#expected/desired result
# A B
# 1 10
# 2 100
# 3 1000
USE CASE: Real data contains hundreds of rows in column A with random integers, not just three sequential int's as shown in the code above.
Below is an example using Excel to demonstrate what I want to do in R.
The rowwise() function in dplyr lets you make variables from column values in each row.
require(dplyr)
df <- data.frame(A = 1:3, B = NA)
preFixedValue <- 1; repeatedValue <- 0;
df <- df %>%
rowwise() %>%
mutate(B = as.numeric(paste0(c(preFixedValue, rep(repeatedValue, A)), collapse = "")))
For maximum flexibility, i.e. total freedom of choosing prefixed and repeated values as single values or vectors, and for simplicity of the syntax (one single line):
library(stringr)
df$B <- str_pad(preFixedValue, width = df$A, pad = repeatedValue, side = c("right"))
Would something like this work?
B<-10^(df$A)
df<-cbind(df,B)

Filter data table by dynamic column name

lets say I have a data.table with columns A, B and C
I'd like to write a function that applies a filter (for example A>1) but "A" needs to be dynamic (the function's parameter) so if I inform A, it does A>1; If I inform B, it does B>1 and so on... (A and B always being the columns names, of course)
Example:
Lets say my data is bellow, I'd like to do "A==1" and it would return the green line, or do "B==1 & C==1" and return the blue line.
Can this be done?
thanks
You can try
f1 <- function(dat, colName){dat[eval(as.name(colName))>1]}
setDT(df1)
f1(df1, 'A')
f1(df1, 'B')
If you need to make the value also dynamic
f2 <- function(dat, colName, value){dat[eval(as.name(colName))>value]}
f2(df1, 'A', 1)
f2(df1, 'A', 5)
data
set.seed(24)
df1 <- data.frame(A=sample(-5:10, 20, replace=TRUE),
B=rnorm(20), C=LETTERS[1:20], stringsAsFactors=FALSE)
Try:
dt = data.table(A=c(1,1,2,3,1), B=c(4,5,1,1,1))
f=function(dt, colName) dt[dt[[colName]]>1,]
#> f(dt, 'A')
# A B
#1: 2 1
#2: 3 1
If your data is
a <- c(1:9)
b <- c(10:18)
# create a data.frame
df <- data.frame(a,b)
# or a data.table
dt <- data.table(a,b)
you can store your condition(s) in a variable x
x <- quote(a >= 3)
and filter the data.frame using dplyr (subsetting with [] won't work)
library(dplyr)
filter(df, x)
or using data.table as suggested by #Frank
library(data.table)
dt[eval(x),]
Why write a function? You can do this...
Specifically:
d.new=d[d$A>1,]
where d is the dataframe d$A is the variable and d.new is a new dataframe.
More generally:
data=d #data frame
variable=d$A #variable
minValue=1 #minimum value
d.new=data[variable>minValue,] #create new data frame (d.new) filtered by min value
To create a new column:
If you don't want to actually create a new dataframe but want to create an indicator variable you can use ifelse. This is most similar to coloring rows as shown in your example. Code below:
d$indicator1=ifelse(d$X1>0,1,0)

How can we rank rows of a matrix based on the mean of each column?

I want to rank each row of my data based on the mean of each column
Here you can find an example data
https://gist.github.com/anonymous/2c69
I calculate the mean of row and the mean of each row and each column by
C <- colMeans(data, na.rm = FALSE, dims = 1)
R <- rowMeans(data, na.rm = FALSE, dims = 1)
Then I divide each row mean by each column mean and somehow rank them. Is there any idea?
After we read the dataset, (read.table('Nemo.txt'....)), remove the first character column (data2 <- data[,-1]), get the row means and column means and extend it to the "row/column" (rowMeans(...)[row(data)]), divide and create a matrix "m1". If we need to get the "ranks" of rows for each column in "m1", use mutate_each from dplyr.
data <- read.table('Nemo.txt', header=TRUE, stringsAsFactors=FALSE)
data2 <- data[,-1]
m1 <- rowMeans(data2, na.rm=FALSE, dims=1)[row(data2)]/colMeans(data2,
na.rm=FALSE, dims=1)[col(data2)]
dim(m1) <- dim(data2)
library(dplyr)
d1 <- as.data.frame(m1)
d1 %>%
mutate_each(funs(rank(., ties.method='min')))
But, suppose we need to get the aggregate rank of each row, (not sure if this is what you want), perhaps we can get the row means of "m1" and rank it.
rnk <- rank(rowMeans(m1))
head(rnk)
#[1] 1234 1557 1052 1176 575 290
Then rank the original data based on rnk as follows:
rankeddata <- data[rnk,]

Resources