I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?
The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.
It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)
Related
I have a list of data frames with same column names where each dataframe corresponds to a month
June_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(100,200,250,450), Metric2=c(1000,2000,5000,6000))
July_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(140,250,125,400), Metric2=c(2000,3000,2000,3000))
Aug_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(200,150,250,600), Metric2=c(1500,2000,4000,2000))
Sep_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(500,500,1000,100), Metric2=c(500,4000,6000,8000))
lst1 <- list(Aug_2018,June_2018,July_2018,Sep_2018)
names(lst1) <- c("Aug_2018","June_2018","July_2018","Sep_2018")
I intend to create a new column in each of the data frames in the list named Percent_Change_Metric1 and Percent_Change_Metric2 by doing below calculation
for (i in names(lst1)){
lst1[[i]]$Percent_Change_Metric1 <- ((lst1[[i+1]]$Metric1-lst1[[i]]$Metric1)*100/lst1[[i]]$Metric1)
lst1[[i]]$Percent_Change_Metric2 <- ((lst1[[i+1]]$Metric2-lst1[[i]]$Metric2)*100/lst1[[i]]$Metric2)
}
However, obviously the i in for loop is against the names(lst1) and wouldn't work
Also, the dataframes in my list in random order and not ordered by month-year. So the calculation to subtract successive dataframes' columns isn't entirely accurate.
Please advise
How I go about with adding the Percent_change_Metric1 and
Percent_change_Metric2
How to choose the dataframe corresponding
to next month to arrive at the correct Percent_Change
Thanks for the guidance
Here is one option with base R
lst1[-length(lst1)] <- Map(function(x, y)
transform(y, Percent_Change_Metric1 = (x$Metric1 - Metric1) * 100/Metric1,
Percent_Change_Metric2 = (x$Metric2 - Metric2) * 100/Metric2),
lst1[-1], lst1[-length(lst1)])
My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))
I have a 18-by-48 matrix.
Is there a way to save each of the 18 rows automatically in a separate variable (e.g., from r1 to r18) ?
I'd definitely advise against splitting a data.frame or matrix into its constituent rows. If i absolutely had to split the rows up, I'd put them in a list then operate from there.
If you desperately had to split it up, you could do something like this:
toy <- matrix(1:(18*48),18,48)
variables <- list()
for(i in 1:nrow(toy)){
variables[[paste0("variable", i)]] <- toy[i,]
}
list2env(variables, envir = .GlobalEnv)
I'd be inclined to stop after the for loop and avoid the list2env. But I think this should give you your result.
I believe you can select a row r from your dataframe d by indexing without a column specified:
var <- d[r,]
Thus you can extract all of the rows into a variable by using
var <- d[1:length(d),]
Where var[1] is the first row, var[2] the second. Etc.. not sure if this is exactly what you are looking for. Why would you want 18 different variables for each row?
result <- data.frame(t(mat))
colnames(result) <- paste("r", 1:18, sep="")
attach(result)
your matrix is mat
I've done a little bit of digging for this result but most of the questions on here have information in regards to the cbind function, and basic matrix concatenation. What I'm looking to do is a little more complicated.
Let's say, for example, I have an NxM matrix whose first column is a unique identifier for each of the rows (and luckily in this instance is sorted by that identifier). For reasons which are inconsequential to this inquiry, I'm splitting the rows of this matrix into (n_i)xM matrices such that the sum of n_i = N.
I'm intending to run separate analysis on each of these sub-matrices and then combine the data together again with the usage of the unique identifier.
An example:
Let's say I have matrix data which is 10xM. After my split, I'll receive matrices subdata1 and subdata2. If you were to look at the contents of the matrices:
data[,1] = 1:10
subdata1[,1] = c(1,3,4,6,7)
subdata2[,1] = c(2,5,8,9,10)
I then manipulate the columns of subdata1 and subdata2, but preserve the information in the first column. I would like to combine this matrices again such that finaldata[,1] = 1:10, where finaldata is a result of the combination.
I realize now that I could use rbind and the sort the matrix, but for large matrices that is very inefficient.
I know R has some great functions out there for data management, is there a work around for this problem?
I may not fully understand your question, but as an example of general use, I would typically convert the matrices to dataframes and then do something like this:
combi <- rbind(dataframe1, dataframe2)
If you know they are matrices, you can do this with multidimensional arrays:
X <- matrix(1:100, 10,10)
s1 <- X[seq(1, 9,2), ]
s2 <- X[seq(2,10,2), ]
XX <- array(NA, dim=c(2,5,10) )
XX[1, ,] <- s1 #Note two commas, as it's a 3D array
XX[2, ,] <- s2
dim(XX) <- c(10,10)
XX
This will copy each element of s1 and s2 into the appropriate slice of the array, then drop the extra dimension. There's a decent chance that rbind is actually faster, but this way you won't need to re-sort it.
Caveat: you need equal sized splits for this approach.
merge is a very nice function: It merges matrices and data.frames, and returns a data.frame.
Having rather big character matrices,
is there another good way to merge -
without data.frame conversion?
Comment 1:
A small function to merge a named vector with a matrix or data.frame. Elements of the vector can link to multiple entries in the matrix:
expand <- function(v,m,by.m,v.name='v',...) {
df <- do.call(rbind,lapply(names(v),function(x) {
pos <- which(m[,by.m] %in% v[x])
cbind(x,m[pos,],...)
}))
colnames(df)[1] <- v.name
df
}
Example:
v <- rep(letters,each=3)[seq_along(letters)]
names(v) <- letters
m <- data.frame(a=unique(v),b=seq_along(unique(v)),stringsAsFactors=F)
expand(v,m,'a')
You can use a combination of match and cbind to do the equivalent of merge without conversion to data frame, a simple example:
st1 <- state.x77[ sample(1:50), ]
st2 <- as.matrix( USArrests )[ sample(1:50), ]
tmp1 <- match(rownames(st1), rownames(st2) )
st3 <- cbind( st1, st2[tmp1,] )
head(st3)
Keeping track of which columns you want, and merging whith many to 1 relationships or missing rows in one group require a bit more thought but are still possible.
No, not without either (a) overwriting the merge function or (b) creating a new merge.matrix() S3 function (this would be the right approach to the problem).
You can see in the merge help:
Value
A data frame.
Also, the merge.default function:
> merge.default
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
There is now a merge.Matrix function in the Matrix.utils package. This works on combinations of matrices as well as capital M Matrices, data.frames, etc.
The match solution is nice, but as someone pointed out does not work on m:n relationships. It also does not implement the other features of merge, including all.x, all.y, etc.