I have output from older software that wraps the record for each transaction into multiple rows. I want to unwrap these rows into one flat dataframe. I have found solutions to unwrap columns, but not rows, and can do what I need in a loop, but the output is large and I would prefer a faster solution than a loop.
Example: I read into R from a .csv file 6 pieces of information about each of two transactions ("tran") that come wrapped into four rows.
The following represents and mimics my data as I read it into R from a .csv file:
V1 <- c("tran1.col1", "tran1.col4","tran2.col1", "tran2.col4")
V2 <- c("tran1.col2", "tran1.col5", "tran2.col2", "tran2.col5")
V3 <- c("tran1.col3", "tran1.col6", "tran2.col3", "tran2.col6")
df <- as.data.frame(matrix(c(V1, V2, V3), ncol = 3))
I am looking to transform the above to the following:
X1 <- c("tran1.col1", "tran2.col1")
X2 <- c("tran1.col2", "tran2.col2")
X3 <- c("tran1.col3", "tran2.col3")
X4 <- c("tran1.col4", "tran2.col4")
X5 <- c("tran1.col5", "tran2.col5")
X6 <- c("tran1.col6", "tran2.col6")
df.x <- as.data.frame(matrix(c(X1, X2, X3, X4, X5, X6), ncol = 6))
I've looked at tidy routines to gather and spread datafiles as well as melt and decast in reshape, but as far as I can tell, I need to unwrap the rows first.
If all your inputs have 6 pieces of information by however many transactions, then the following should work.
vec <- as.character(unlist(t(df)))
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))
To break it down to explain what's happening ...
# Transpose the df (to a matrix)
matrix <- t(df)
# Now that the matrix is in this sequence it will allow us to unlist it so
# that it produces a vector in the correct sequence (i.e tran1.col1,
# tran1.col2 .. tran2.col1, tran1.col2)
vec <- unlist(matrix)
# Now we can coerce it back to a data.frame, defining the number of columns
# and creating it by row (rather than column)
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))
Related
I am trying to collect some data from multiple subsets of a data set and need to create a data frame to collect the results. My problem is don't know how to create an empty data frame with defined number of columns without actually having data to put into it.
collect1 <- c() ## i'd like to create empty df w/ 3 columns: `id`, `max1` and `min1`
for(i in 1:10){
collect1$id <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1 <- max(ss1$value)
collect1$min1 <- min(ss1$value)
}
I feel very dumb asking this question (I almost feel like I've asked it on SO before but can't find it) but would greatly appreciate any help.
Would a dataframe of NAs work?
something like:
data.frame(matrix(NA, nrow = 2, ncol = 3))
if you need to be more specific about the data type then may prefer: NA_integer_, NA_real_, NA_complex_, or NA_character_ instead of just NA which is logical
Something else that may be more specific that the NAs is:
data.frame(matrix(vector(mode = 'numeric',length = 6), nrow = 2, ncol = 3))
where the mode can be of any type. See ?vector
Just create a data frame of empty vectors:
collect1 <- data.frame(id = character(0), max1 = numeric(0), max2 = numeric(0))
But if you know how many rows you're going to have in advance, you should just create the data frame with that many rows to start with.
You can do something like:
N <- 10
collect1 <- data.frame(id = integer(N),
max1 = numeric(N),
min1 = numeric(N))
Now be careful that in the rest of your code, you forgot to use the row index for filling the data.frame row by row. It should be:
for(i in seq_len(N)){
collect1$id[i] <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1[i] <- max(ss1$value)
collect1$min1[i] <- min(ss1$value)
}
Finally, I would say that there are many alternatives for doing what you are trying to accomplish, some would be much more efficient and use a lot less typing. You could for example look at the aggregate function, or ddply from the plyr package.
You may use NULL instead of NA. This creates a truly empty data frame.
Here a solution if you want an empty data frame with a defined number of rows and NO columns:
df = data.frame(matrix(NA, ncol=1, nrow=10)[-1]
df = data.frame(matrix("", ncol = 3, nrow = 10))
It might help the solution given in another forum,
Basically is:
i.e.
Cols <- paste("A", 1:5, sep="")
DF <- read.table(textConnection(""), col.names = Cols,colClasses = "character")
> str(DF)
'data.frame': 0 obs. of 5 variables:
$ A1: chr
$ A2: chr
$ A3: chr
$ A4: chr
$ A5: chr
You can change the colClasses to fit your needs.
Original link is
https://stat.ethz.ch/pipermail/r-help/2008-August/169966.html
A more general method to create an arbitrary size data frame is to create a n-by-1 data-frame from a matrix of the same dimension. Then, you can immediately drop the first row:
> v <- data.frame(matrix(NA, nrow=1, ncol=10))
> v <- v[-1, , drop=FALSE]
> v
[1] X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<0 rows> (or 0-length row.names)
If only the column names are available like :
cnms <- c("Nam1","Nam2","Nam3")
To create an empty data frame with the above variable names, first create a data.frame object:
emptydf <- data.frame()
Now call zeroth element of every column, thus creating an empty data frame with the given variable names:
for( i in 1:length(cnms)){
emptydf[0,eval(cnms[i])]
}
seq_along may help to find out how many rows in your data file and create a data.frame with the desired number of rows
listdf <- data.frame(ID=seq_along(df),
var1=seq_along(df), var2=seq_along(df))
I have come across the same problem and have a cleaner solution. Instead of creating an empty data.frame you can instead save your data as a named list. Once you have added all results to this list you convert it to a data.frame after.
For the case of adding features one at a time this works best.
mylist = list()
for(column in 1:10) mylist$column = rnorm(10)
mydf = data.frame(mylist)
For the case of adding rows one at a time this becomes tricky due to mixed types. If all types are the same it is easy.
mylist = list()
for(row in 1:10) mylist$row = rnorm(10)
mydf = data.frame(do.call(rbind, mylist))
I haven't found a simple way to add rows of mixed types. In this case, if you must do it this way, the empty data.frame is probably the best solution.
I can't seem to find this specifically (I looked here: How to split a character vector into data frame?) and a few other places.
I am trying to split a character vector in R into a data frame, with a set number of columns, filling in NA for any extras or missing. As below (reproducible):
###Reproduce column vector
cv <- c("a1", "b1", "c1", "d1", "e1", "f1", "aa2", "bb2", "cc2", "dd2", "ee2", "ff2", "x1", "x2", "x3", "x4", "x5", "x6", "rr2", "tt3", "bb4")
###Desired data frame separating 6 columns
df.desired <- data.frame(col1=c("a1","aa2","x1","rr2"),col2=c("b1","bb2","x2","tt3"),col3=c("c1","cc2","x3","bb4"),col4=c("d1","dd2","x4",NA),col5=c("e1","ee2","x5",NA),col6=c("f1","ff2","x6",NA),stringsAsFactors = F)
Thanks in advance!
1) base Create a matrix of NA values of the requisite dimensions and then fill it with cv up to its length. Transpose that and convert to a data frame.
mat <- t(replace(matrix(NA, 6, ceiling(length(cv) / 6)), seq_along(cv), cv))
as.data.frame(mat, stringsAsFactors = FALSE)
2) another base solution Using the cv2 copy of cv expand its length to that required and then reshape it into a matrix. We used cv2 in order to preserve the original cv but if you don't mind adding NAs to the end of cv then you could just use it instead of creating cv2 reducing the code by one line (two lines if we can use mat rather than needing a data frame). This solution avoids needing to use transpose by making use of the byrow argument of matrix.
cv2 <- cv
length(cv2) <- 6 * ceiling(length(cv) / 6)
mat <- matrix(cv2,, 6, byrow = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)
3) base solution using ts This one gets the row and column indexes by extracting them from the times of a ts object rather than calculating the dimensions via numeric calculation. To do that create the times, tt, of a ts object from cv. tt itself is a ts object for which as.integer(tt) is the row index numbers and cycle(tt) is the column index numbers. Finally use tapply with that:
tt <- time(ts(cv, frequency = 6))
mat <- tapply(cv, list(as.integer(tt), cycle(tt)), c)
as.data.frame(mat, stringsAsFactors = FALSE)
4) rollapply Like (3) this one does not explicitly calculate the dimensions of mat. It uses rollapply in the zoo package with a simple function, Fillr to avoid this. The Fill function returns its argument x padded out with NAs on the right to a length of 6.
library(zoo)
Fill <- function(x) { length(x) <- 6; x }
mat <- rollapplyr(cv, 6, by = 6, Fill, align = "left", partial = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)
In all alternatives above omit the last line if a matrix mat is adequate as the result.
Added
As of R 4.0 stringsAsFaactors=FALSE is the default so it could be omitted above.
1) base R - split the vector using a grouping variable created with gl and then append NA at the end with length<-
lst <- split(cv, as.integer(gl(length(cv), 6, length(cv))))
as.data.frame(do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
# V1 V2 V3 V4 V5 V6
#1 a1 b1 c1 d1 e1 f1
#2 aa2 bb2 cc2 dd2 ee2 ff2
#3 x1 x2 x3 x4 x5 x6
#4 rr2 tt3 bb4 <NA> <NA> <NA>
Let's say there is a matrix - 'mat' which has 115 columns.
There is another matrix - 'res_mat' which has a column having 38 column names of the previous matrix 'mat'.
I want to create a third matrix - 'fin_mat' which will be a subset of the first matrix 'mat' having the columns which are stored as values in the column of the second matrix 'res_mat'.
Or in other words, I have a list of column names which is stored in a variable. How can I create a subset of the first matrix containing the columns which are stored in a variable?
Doesn't seem very difficult. If I understand your question correctly, something like this will do it.
# First make up some matrix
mat <- matrix(1:24, ncol = 6)
colnames(mat) <- paste0("Col", 1:6)
# These would be the columns to keep
res_mat <- matrix(c("Col1", "Col3", "Col4"), ncol = 1)
fin_mat <- mat[, res_mat[, 1]]
fin_mat
One way would be to use the dplyr package with the functions "select" and "one_of". One_of allows to select columns based on their names (in a string format).
Here is a simple example with the iris table, in which I extract the columns names "Sepal.Length" and "Sepal.Width".
library(dplyr)
mat1 <- iris
mat2 <- data.frame(names = c("Sepal.Length", "Sepal.Width")) %>%
mutate(names = as.character(names)) #make sure the names are characters
results <- mat1 %>% select(one_of(mat2$names))
It can be done pretty easily. In the code below, I ma creating a dataframe mat and another one res_mat. mat has the data and res_mat has a single column named- select_these_columns. the mat dataframe has 10 columns named a,b,c,d,e...,j. the select_thes_colscolumn of res_mat has five rows with entries a,b,c,d,e. ALl that needs to be done is pass the res_mat$select_these_cols to mat
a <- (matrix(rnorm(1000), nrow = 100, ncol = 10))
mat <- as.data.frame(a)
names(mat) <- letters[1:10]
res_mat <- data.frame(x = letters[1:5])
names(res_mat) <- 'select_these_cols'
fin_mat <- mat[res_mat$select_these_cols] # subsetting operation
I have 7 dataframes of experiments which are each subdivided into 15 repetition (or iteration). I am now interested in all 105 variable x for calculation later on in the analysis.
Imagine you have the following dataframes with randomized numbers and, for the sake of simplicity, pretend that all dataframes contain different numbers:
set.seed(2)
a <- runif(100, -1.5, 1.5)
b <- pnorm(rnorm(100))
c <- rnorm(100)
d <- rnorm(100)
e <- dnorm(rnorm(100))
iteration <- sort(sample(1:7, 100, replace=T), decreasing=F)
x <- f <- sample(1:1000, 100, replace=T)
df1 <- data.frame(a,b,c,d,e,iteration,x)
df2 <- data.frame(a,b,c,d,e,iteration,x)
df3 <- data.frame(a,b,c,d,e,iteration,x)
df4 <- data.frame(a,b,c,d,e,iteration,x)
df5 <- data.frame(a,b,c,d,e,iteration,x)
df6 <- data.frame(a,b,c,d,e,iteration,x)
df7 <- data.frame(a,b,c,d,e,iteration,x)
How can I break down all 105 variable x combination (df1$x of iteration 1, df1$x of iteration 2, ..., df7$x of iteration 7) so that I can calculate the following example nonsense equation for all 105 variable combination?
mean(df1$x of iteration 1) - sd(df1$x of iteration 1)
mean(df1$x of iteration 2) - sd(df1$x of iteration 2)
...
mean(df7$x of iteration 7) - sd(df7$x of iteration 7)
I have the following command in order to "extract" variable df1$x of iteration 1 but this would involve 208 more lines to come for the remaining variables:
df_1 <- df1[which(df1$iteration=='1'),]
df_1_final <- df_1[grepl("1", df_1$iteration), c(6, 7)]
Does this make sense? Is there not a better way to do that in Gnu R?
A possibility using dplyr. Probably easier to work with your data.frames in a list (from comments by #akrun)
library(dplyr)
bind_rows(mget(paste0('df', 1:7))) %>% # put your data.frames in a list -> data.frame
mutate(group=rep(1:7, each=100)) %>% # add a grouping column
group_by(group, iteration) %>% # group
summarise(mean(x) - sd(x)) # do your stuff
or in data.table
rbindlist(mget(paste0('df', 1:7)))[,mean(x)-sd(x) ,.(gr=rep(1:7,each=100),iteration)]
You could create a nonsense equation function and then utilize it in tapply() with, iteration as the INDEX argument, for each df. So for df1: tapply(df1$x, INDEX = df1$iteration, nonsenseFunction), which will return a list/array with all computations for each group(iteration) of df1.
I am trying to collect some data from multiple subsets of a data set and need to create a data frame to collect the results. My problem is don't know how to create an empty data frame with defined number of columns without actually having data to put into it.
collect1 <- c() ## i'd like to create empty df w/ 3 columns: `id`, `max1` and `min1`
for(i in 1:10){
collect1$id <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1 <- max(ss1$value)
collect1$min1 <- min(ss1$value)
}
I feel very dumb asking this question (I almost feel like I've asked it on SO before but can't find it) but would greatly appreciate any help.
Would a dataframe of NAs work?
something like:
data.frame(matrix(NA, nrow = 2, ncol = 3))
if you need to be more specific about the data type then may prefer: NA_integer_, NA_real_, NA_complex_, or NA_character_ instead of just NA which is logical
Something else that may be more specific that the NAs is:
data.frame(matrix(vector(mode = 'numeric',length = 6), nrow = 2, ncol = 3))
where the mode can be of any type. See ?vector
Just create a data frame of empty vectors:
collect1 <- data.frame(id = character(0), max1 = numeric(0), max2 = numeric(0))
But if you know how many rows you're going to have in advance, you should just create the data frame with that many rows to start with.
You can do something like:
N <- 10
collect1 <- data.frame(id = integer(N),
max1 = numeric(N),
min1 = numeric(N))
Now be careful that in the rest of your code, you forgot to use the row index for filling the data.frame row by row. It should be:
for(i in seq_len(N)){
collect1$id[i] <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1[i] <- max(ss1$value)
collect1$min1[i] <- min(ss1$value)
}
Finally, I would say that there are many alternatives for doing what you are trying to accomplish, some would be much more efficient and use a lot less typing. You could for example look at the aggregate function, or ddply from the plyr package.
You may use NULL instead of NA. This creates a truly empty data frame.
Here a solution if you want an empty data frame with a defined number of rows and NO columns:
df = data.frame(matrix(NA, ncol=1, nrow=10)[-1]
df = data.frame(matrix("", ncol = 3, nrow = 10))
It might help the solution given in another forum,
Basically is:
i.e.
Cols <- paste("A", 1:5, sep="")
DF <- read.table(textConnection(""), col.names = Cols,colClasses = "character")
> str(DF)
'data.frame': 0 obs. of 5 variables:
$ A1: chr
$ A2: chr
$ A3: chr
$ A4: chr
$ A5: chr
You can change the colClasses to fit your needs.
Original link is
https://stat.ethz.ch/pipermail/r-help/2008-August/169966.html
A more general method to create an arbitrary size data frame is to create a n-by-1 data-frame from a matrix of the same dimension. Then, you can immediately drop the first row:
> v <- data.frame(matrix(NA, nrow=1, ncol=10))
> v <- v[-1, , drop=FALSE]
> v
[1] X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<0 rows> (or 0-length row.names)
If only the column names are available like :
cnms <- c("Nam1","Nam2","Nam3")
To create an empty data frame with the above variable names, first create a data.frame object:
emptydf <- data.frame()
Now call zeroth element of every column, thus creating an empty data frame with the given variable names:
for( i in 1:length(cnms)){
emptydf[0,eval(cnms[i])]
}
seq_along may help to find out how many rows in your data file and create a data.frame with the desired number of rows
listdf <- data.frame(ID=seq_along(df),
var1=seq_along(df), var2=seq_along(df))
I have come across the same problem and have a cleaner solution. Instead of creating an empty data.frame you can instead save your data as a named list. Once you have added all results to this list you convert it to a data.frame after.
For the case of adding features one at a time this works best.
mylist = list()
for(column in 1:10) mylist$column = rnorm(10)
mydf = data.frame(mylist)
For the case of adding rows one at a time this becomes tricky due to mixed types. If all types are the same it is easy.
mylist = list()
for(row in 1:10) mylist$row = rnorm(10)
mydf = data.frame(do.call(rbind, mylist))
I haven't found a simple way to add rows of mixed types. In this case, if you must do it this way, the empty data.frame is probably the best solution.