How to apply scale rule for many columns in new dataset - r

I have a next task
a = data.frame(a= c(1,2,3,4,5,6)) # dataset
range01 <- function(x){(x-min(a$a))/(max(a$a)-min(a$a))} # rule for scale
b = data.frame(a = 6) # newdaset
lapply(b$a, range01) # we can apply range01 for this dataset because we use min(a$a) in the rule
But how can I apply this when i have many columns in my dataset? like below
a = data.frame(a= c(1,2,3,4,5,6))
b = data.frame(b= c(1,2,3,3,2,1))
c = data.frame(c= c(6,2,4,4,5,6))
df = cbind(a,b,c)
df
new = data.frame(a = 1, b = 2, c = 3)
Of course I can make rules for every variable
range01a <- function(x){(x-min(df$a))/(max(df$a)-min(df$a))}
But it's very long way. How to make it convenient?

You can redefine your scale function so it takes two arguments; One to be scaled and one the scaler as follows, and then use Map on the two data frames:
scale_custom <- function(x, scaler) (x - min(scaler)) / (max(scaler) - min(scaler))
Map(scale_custom, new, df)
#$a
#[1] 0
#$b
#[1] 0.5
#$c
#[1] 0.25
If you need the data frame as result:
as.data.frame(Map(scale_custom, new, df))
# a b c
#1 0 0.5 0.25

You can exploit the fact that the column names of new and df are same. Could be helpful if the order of the columns in the two dataframes is not the same.
sapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x])))
#$a.a
#[1] 0
#$b.b
#[1] 0.5
#$c.c
#[1] 0.25
to put in data.frame
data.frame(lapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x]))))
# a b c
#1 0 0.5 0.25

Related

Backwards rollapply with zoo object

Suppose I have a zoo object:
> df <- data.frame(col1=c(1,2,3,4), col2=c("a","b","c","d"))
> v <- zoo(df, order.by = df$col2)
> v
col1 col2
a 1 a
b 2 b
c 3 c
d 4 d
I can calculate the mean as:
> rollapply(v, 2, by.column = F, function(x) { mean(as.numeric(x[,"col1"])) })
a b c
1.5 2.5 3.5
How do I rollapply mean in DESCENDING order? (please no solutions where you just reverse the results AFTER applying the regular rollapply)
I would like my output to look like:
d c b
3.5 2.5 1.5
The oo in zoo stands for ordered observations and such objects are always ordered by the index; however, what is shown in the question is not ordered by the index so it cannot be a valid zoo object.
Also, the line starting v <- in the question is not likely what is wanted since it seems to ask for a mix of numeric and character data. Fixing that line and creating a data frame with the order shown we have:
library(zoo)
v <- read.zoo(df, index = "col2", FUN = c)
r <- rollapplyr(v, 2, mean)
fortify.zoo(r)[length(r):1, ]
giving:
Index r
3 d 3.5
2 c 2.5
1 b 1.5
Per G. Grothendieck:
rollapply(rev.zoo(v), 2, by.column = F, function(x) { mean(as.numeric(x[,"col1"])) })

How to use lapply to transform specific values in a list of dataframes

I'm looking for help to transform a for loop into an lapply or similar function.
I have a list of similar data.frames, each containing
an indicator column ('a')
a value column ('b')
I want to invert the values in column b for each data frame, but only for specific indicators. For example, invert all values in 'b' that have an indicator of 2 in column a.
Here are some sample data:
x = data.frame(a = c(1, 2, 3, 2), b = (seq(from = .1, to = 1, by = .25)))
y = data.frame(a = c(1, 2, 3, 2), b = (seq(from = 1, to = .1, by = -.25)))
my_list <- list(x = , y = y)
my_list
$x
a b
1 1 0.10
2 2 0.35
3 3 0.60
4 2 0.85
$y
a b
1 1 1.00
2 2 0.75
3 3 0.50
4 2 0.25
My desired output looks like this:
my_list
$x
a b
1 1 0.10
2 2 0.65
3 3 0.60
4 2 0.15
$y
a b
1 1 1.00
2 2 0.25
3 3 0.50
4 2 0.75
I can achieve the desired output with the following for loop.
for(i in 1:length(my_list)){
my_list[[i]][my_list[[i]]['a'] == 2, 'b'] <-
1 - my_list[[i]][my_list[[i]]['a'] == 2, 'b']
}
BUT. When I try to roll this into lapply form like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
}
resultList <- lapply(X = my_list, FUN = invertfun)
I get a list with only the inverted values:
resultList
$x
[1] 0.65 0.15
$y
[1] 0.25 0.75
What am I missing here? I've tried to apply (pun intended) the insights from:
how to use lapply instead of a for loop, to perform a calculation on a list of dataframes in R
I'd appreciate any insights or alternative solutions. I'm trying to take my R skills to the next level and apply and similar functions seem to be the key.
We could use lapply to loop over each list and change the b column based on value in a column.
my_list[] <- lapply(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
my_list
#[[1]]
# a b
#1 1 0.10
#2 2 0.65
#3 3 0.60
#4 2 0.15
#[[2]]
# a b
#1 1 1.00
#2 2 0.25
#3 3 0.50
#4 2 0.75
The same could be done using map from purrr
library(purrr)
map(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
See Ronak's answer above for a fairly elegant solution using transform() or map(), but for those who are following in my footsteps, my original solution would work if I added a line in the custom function to return the full data frame like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
return(inputDF)
}
resultList <- lapply(X = my_list, FUN = invertfun)
UPDATE - On further testing, this solution throws an Error in x[[jj]][iseq] <- vjj : replacement has length zero when the desired 'a' value doesn't exist in one of the data frames. So best not to go down this road and use the accepted answer above.
lapply is typically not the best way to iteratively modify a list. lapply is going to generate a loop internally in any case, so usually easier to read if you do something more explicit:
for (i in seq_along(my_list)) {
my_list[[i]] <- within(my_list[[i]], {
b[a==2] <- 1 - b[a==2]
})}
If we replace within with with in the example above, we get the output from your initial solution, i.e. lapply(X = my_list, FUN = invertfun).
That is, instead of modifying the list in place the latter solutions replace the list elements with new vectors.

Store R loop result and combine it with new result

I'm pretty new to R loop and sorry if this question is too simple. I'm trying to write a loop to subset data. The codes are:
a <- sample(rep(1:5, 10), 10)
b <- sample(rep(1:5, 10), 10)
c <- data.frame(a, b)
s <- c(1,2)
for (i in s){
x <- data.frame()
x <- rbind(x, c[which(a==i),])
}
The x only includes the result for a=2. But when I deleted x and used print() command, it gave me a data frame under the conditions of a=1 and a=2. I don't know what's wrong with the loop. Thanks!!
You can avoid for loop and subset rows by matching the values in s1 with a1.
set.seed(1L)
a1 <- sample(rep(1:5, 10), 10)
b1 <- sample(rep(1:5, 10), 10)
c1 <- data.frame(a1, b1)
s1 <- c(1,2)
a1 %in% s1
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
c1[ a1 %in% s1, ]
# a1 b1
# 6 1 3
# 7 2 2
# 9 2 1
Already there good comments and answer has been mentioned for this. Still I wanted to clarify few points which can help OP.
Obviously for-loop are very much r-like as loops are not very efficient in many cases. Even though if you want to fix the problem in your loop then just modify it as:
# Calling seed will ensure same output from function like sample. This will
# generate consistent result in every attempt
set.seed(1)
a <- sample(rep(1:5, 10), 10)
b <- sample(rep(1:5, 10), 10)
c <- data.frame(a, b) # good to name it df
s <- c(1,2)
# Fix for-loop
x <- data.frame() #assign x out of the for-loop
for (i in s){
x <- rbind(x, c[which(a==i),])
}
#Result
> x
# a b
#6 1 3
#7 2 2
#9 2 1
# R-like approach
> c[c$a %in% s,] #use the column of 'c' dataframe directly in condition
# a b
#6 1 3
#7 2 2
#9 2 1

Get indices of repeated instances of elements of a vector in other vector (both very large)

I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.
A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3
# would result in this:
[[1]]
[1] 2 3 4 6 7
[[2]]
[1] 1 5 10
[[3]]
[1] 8 9
I first, naively, tried this:
b_indices <- lapply(B, function(b) which(A == b))
which is horribly inefficient, and apparently wouldn't complete in a few years.
The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...
Is there anything that could work faster?
This is fast:
A1 <- order(A, method = "radix")
split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1] 1 5 10
#
#$`3`
#[1] 8 9
B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)
system.time({
A1 <- order(A, method = "radix")
res <- split(A1, A[A1])
})
# user system elapsed
#8.650 1.056 9.704
data.table is arguably the most efficient way of dealing with Big Data in R and it would even let you avoid having to use that 1 million length vector all together!
require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9
Using your example data:
a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: 2 1, 5,10
2: 1 2,3,4,6,7
3: 3 8,9
Add this to your benchmarks. I dare say it should be significantly faster than using the built-in functions, if you test it at scale. The bigger the data the better the relative performance of data.table in my experience.
In my benchmark it only takes about 46% as long as order on my Debian laptop and only 5% as long as order on my Windows laptop with 8GB RAM and a 2.x GHz CPU.
B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+ res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
user system elapsed
4.25 0.22 4.50
We can also use dplyr
library(dplyr)
data_frame(A) %>%
mutate(B = row_number()) %>%
group_by(A) %>%
summarise(B = list(B)) %>%
.$B
#[[1]]
#[1] 2 3 4 6 7
#[[2]]
#[1] 1 5 10
#[[3]]
#[1] 8 9
In a smaller dataset of 1e5 size, it gives system.time
# user system elapsed
# 0.01 0.00 0.02
but with larger example as showed in the other post, it is slower. However, this is dplyr...

How to find which elements of one set are in another set?

I have two sets: A with columns x,y, and B also with columns x, y.
I need to find the index of the rows of A which are inside of B (both x and y must match).
I have come up with a simple solution (see below), but this comparison is inside of the loop and paste adds much more extra time.
B <- data.frame(x = sample(1:1000, 1000), y = sample(1:1000, 1000))
A <- B[sample(1:1000, 10),]
#change some elements
A$x[c(1,3,7,10)] <- A$x[c(1,3,7,10)] + 0.5
A$xy <- paste(A$x, A$y, sep='ZZZ')
B$xy <- paste(B$x, B$y, sep='ZZZ')
indx <- which(A$xy %in% B$xy)
indx
For example for a single observation an alternative to paste is almost 3 times faster
ind <- sample(1:1000, 1)
xx <- B$x[ind]
yy <- B$y[ind]
ind <- which(with(B, x==xx & y==yy))
# [1] 0.0160000324249268 seconds
xy <- paste(xx,'ZZZ',yy, sep='')
ind <- which(B$xy == xy)
# [1] 0.0469999313354492 seconds
How about using merge() to do the matching for you?
A$id <- seq_len(nrow(A))
sort(merge(A, B)$id)
# [1] 2 4 5 6 8 9
Edit:
Or, to get rid of two unnecessary sorts, use the sort= option to merge()
merge(A, B, sort=FALSE)$id
# [1] 2 4 5 6 8 9

Resources