I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)
Related
I have a series of dataframes that are noncumulative. During the cleaning process I want to loop through the dataframe and check if certain columns exists and if they aren't present create them. I can't for the life of me figure out a method to do this. I am not package shy and prefer them to base.
Any direction is much appreciated.
You can use this dummy data df and colToAdd columns to check if not exists to add
df <- data.frame(A = rnorm(5) , B = rnorm(5) , C = rnorm(5))
colToAdd <- c("B" , "D")
then apply the check if the column exists NULL produced else add your column e.g. rnorm(5)
add <- sapply(colToAdd , \(x) if(!(x %in% colnames(df))) rnorm(5))
data.frame(do.call(cbind , c(df , add)))
output
A B C D
1 1.5681665 -0.1767517 0.6658019 -0.8477818
2 -0.5814281 -1.0720196 0.5343765 -0.8259426
3 -0.5649507 -1.1552189 -0.8525945 1.0447395
4 1.2024881 -0.6584889 -0.1551638 0.5726059
5 0.7927576 0.5340098 -0.5139548 -0.7805733
I have managed to read in a data file, and subset out the 2 columns of info that I want to work with. I am now stuck because I need to split the data into chunks of varying sizes and apply a function (mean, sd) to them, save the chunks and plot the sd from each. Otherwise known generally as block averaging. Right now I have a data frame with 2 columns and 10005 rows. The head of it looks like this:
Frame CA
1 0.773
Is there an efficient way that I could subset pieces of the data from a:b so that I can dictate how the data is broken up by the "Frame" column? I have found really good answers on here but I am not sure what they mean fully or if they would work.
chunk <- function(x, n)
(mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n),
pmin(seq.int(from=1, to=length(x), by=n)+(n-1),
length(x)), SIMPLIFY=FALSE))
I'm not sure if it is what you're looking for but with closure, a data frame can be subsetted by arbitrary indices.
(If Frame can be subsetted by a:b, it is likely to be a sequence and thus a subset may be made by row index?)
df <- data.frame(group = sample(c("a", "b"), 20, replace = T),
val = rnorm(20))
# closure - returns a function that accepts from and to
subsetter <- function(from, to) {
function(x) {
x[from:to, ]
}
}
# from and to are specified
sub1 <- subsetter(2, 4)
sub2 <- subsetter(1, 5)
# data is split from to to
sub1(df)
#group val
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
sub2(df)
# group val
#1 b 0.4780080
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
#5 b 0.4449554
Just learning dplyr (and R) and I do not understand why this fails or what the correct approach to this is. I am looking for a general explanation rather than something specific to this contrived dataset.
Assume I have 3 files sizes with multipliers and I'd like to combine them into a single numeric column.
require(dplyr)
m <- data.frame(
K = 1E3,
M = 1E6,
G = 1E9
)
s <- data.frame(
size = 1:3,
mult = c('K', 'M', 'G')
)
Now I want to multiply the size by it's multiplier so I tried:
mutate(s, total = size * m[[mult]])
#Error in .subset2(x, i, exact = exact) :
# recursive indexing failed at level 2
which throws an error. I also tried:
mutate(s, total = size * as.numeric(m[mult]))
#1 1 K 1e+06
#2 2 M 2e+09
#3 3 G 3e+03
which is worse than an error (wrong answer)!
I tried a lot of other permutations but could not find the answer.
Thanks in advance!
Edit:
(or should this be another question)
akrun's answer worked great and I thought I understood but if I
rbind(s, c(4, NA))
then update the mutate to
mutate(s, total = size *
ifelse(is.na(mult), 1,
unlist(m[as.character(mult)])
it falls apart again with an "undefined columns selected"
The 'mult' column is 'factor' class. Convert it to 'character' for subsetting the 'm', `unlist' and then multiply with 'size'
mutate(s, new= size*unlist(m[as.character(mult)]))
# size mult new
#1 1 K 1e+03
#2 2 M 2e+06
#3 3 G 3e+09
If we look at how the 'factor' columns act based on the 'levels'
m[s$mult]
# M G K
#1 1e+06 1e+09 1000
We get the same order of output by using match between the names(m) and levels(s$mult)
m[match(names(m), levels(s$mult))]
# M G K
#1 1e+06 1e+09 1000
So, this might be the reason why you got a different result
If you don't mind changing the data structure of m, you could use
# change m to a table
m = as.data.frame(t(m))
m$mult = rownames(m)
colnames(m)[which(colnames(m) == "V1")] = "value"
# to avoid indexing
s %>%
inner_join(m) %>%
mutate(total = size*value) %>%
select(size, mult, total)
to keep things more dplyr based.
EDIT: Though it works, you may need to be a little bit careful about the data types in the columns though
I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1