I am new to R and would find some tips very helpful.
I have populated matrix X that has lists of rownames which are numeric.
These correspond to matrix (Y).
I would like to summate all the rows in matrix Y based on the rownames in Matrix X.
So X[,1] may contain a list of rownames which I want to extract the row sums of these particular rows in matrix Y.
I think where I'm having difficulty is where to put the rownames() in the statements - I've tried many different combinations using functions, with and if. Any guidance or tips would be very gratefully received. Thank you.
I have provided a simplified version of the problem below:
X Y
1 2 10 10 10
3 3 20 20 20
5 4 30 30 30
40 40 40
50 50 50
Z[1] (X[,1]) should equal [10+10+10]+[30+30+30]+[50+50+50]
Z[2] (X[,2]) should equal [20+20+20]+[30+30+30]+[40+40+40]
Z should be a vector of sums of Y's rows depending on the column of X's row name values.
You can achieve this as follows:
x <- data.frame(x)
sapply(x, function(r) sum(y[r, ]))
Output is:
X1 X2
270 270
Alternatively, you can name columns of matrix x and supply them to sapply. In this case, I went with easy conversion of x to data frame.
A solution based on data.table and reshape2 packages:
library(data.table)
library(reshape2)
X <- matrix(c(1,3,5,2,3,4), nrow = 3, ncol = 2)
Y <- 10*matrix(rep(1:5, each = 3), nrow = 5, byrow = TRUE)
# Convert to data.table
X.DT <- data.table(X)
Y.DT <- data.table(Y)
Z.DT <-
# First melt the X to get the column names as grouping 'variable'
# and the numeric values in 'value'
melt(X.DT, measure.vars = names(X.DT))[
# Sum the values of Y selected by the indicies stored in X
, .(Z = sum(Y.DT[value]))
, by = variable
]
Z.DT
Result looks like this:
variable Z
1: V1 270
2: V2 270
And if you need the result as a simple vector Z then you can do it like this:
Z <- Z.DT[,Z]
Z
[1] 270 270
For reference, the intermediary data.table that is returned by the melt function looks like this:
> melt(X.DT, measure.vars = names(X.DT))
variable value
1: V1 1
2: V1 3
3: V1 5
4: V2 2
5: V2 3
6: V2 4
Related
I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Using data.table:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1 in df2 has values lower than any in df1, and variable.3 in df2 has values larger than any in df1.
I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).
If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20
I have a huge data frame. I am stuck with if function. Let me first present the simple example and then I lay down my problem:
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
Problem: I want to run if function which sums column z values based for row which has same y and a only if the second row of each group has corresponding z equals 1
I am sorry but I am quite new in R so not able to present any reasonable codes which I have done by my own.
Any help would be highly appreciated.
As mentioned, your problem isn't clearly stated.
Perhaps you are looking to do something like this:
x$new <- with(x, ave(z, y, a, FUN = function(k)
ifelse(k[2] == 1, sum(k), NA)))
x
# z y a new
# 1 0 2 1 3
# 2 1 2 1 3
# 3 2 2 1 3
# 4 3 3 2 NA
# 5 4 3 2 NA
# 6 5 3 2 NA
Here, I've created a new column "new" which sums the values of "z" grouped by "y" and "a", but only if the second value in the group is equal to 1.
Since you say your data frame is quite large, you might want to convert your data frame to a data.table object using the data.table package. You will likely find that the required operations are much faster if you have a great many rows. However, the construction of the code for your case is not straight forward with data.table.
If I understnad what you want to do (which is not entirely clear to me) you could try the following:
library(data.table)
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
xx <- as.data.table(x) # Make a data.table object
setkey(xx, z) # Make the z column a key
xx[1, sum(a)] # Sum all values in column a where the key z = 1
[1] 1
# Now try the other sum you mention
xx[, sum(z), by = list(z = y)] # A column sum over groups defined by z = y
z V1
1: 2 2
2: 3 3
sum(xx[, sum(z), by = list(z = y)][, V1]) # Summing over the sums for each group should do it
[1] 5
To create the sum over the column a where z = 1, I made the z column a key. The syntax xx[1, sum(a)] sums a where the key value (z value) is 1.
I can create groups with the data.table object with by, which is analogous to a SQL WHERE clause if you are familiar with SQL. However, the result is the sum of the column z for each of groups created. This may be inefficient if you have a great many possible matching values where z = y. The outer sum adds the values for each group in the sub-selected V1 column of the inner result.
If you are going to use data.table in a serious way study the informative vignettes available for that package.
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014). data.table: Extensions of data.frame. R package version 1.9.2. http://CRAN.R-project.org/package=data.table
I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6