stack data frame by rows

stack data frame by rows - r

I have a data set like this:
df <- data.frame(v1 = rnorm(12),
v2 = rnorm(12),
v3 = rnorm(12),
time = rep(1:3,4))
It looks like this:
> head(df)
v1 v2 v3 time
1 0.1462583 -1.1536425 3.0319594 1
2 1.4017828 -1.2532555 -0.1707027 2
3 0.3767506 0.2462661 -1.1279605 3
4 -0.8060311 -0.1794444 0.1616582 1
5 -0.6395198 0.4637165 -0.9608578 2
6 -1.6584524 -0.5872627 0.5359896 3
I now want to stack row 1-3 in a new column, then rows 4-6, then 7-9 and so on.
This is may naive way to do it, but there must be fast way, that doesn't use that many helper variables and loops:
l <- list()
for(i in 1:length(df)) {
l[[i]] <- stack(df[i:(i+2), -4])$values #time column is removed, was just for illustration
}
result <- do.call(cbind, l)
only base R should be used.

We can use split on the 'time' column
sapply(split(df[-4], cumsum(df$time == 1)), function(x) stack(x)$values)
Or instead of stack, unlist could be faster
sapply(split(df[-4], cumsum(df$time == 1)), unlist)
Based on the OP's code, it seems to be subsetting the rows based on the sequence of column
sapply(1:length(df), function(i) unlist(df[i:(i+2), -4]))

Related

Create dataframe from smallest vector available

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.

I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

Removing rows in data.frame having columns subsumed in others

I am trying to achieve something similar to unique in a data.frame where column each element of a column in a row are vectors. What I want to do is if the elements of the vector in the column of that hat row a subset or equal to another remove the row with smaller number of elements. I can achieve this with a nested for loop but since data contains 400,000 rows the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that second combination of supergrp and items in second and third row is contained in first row. So, I want to delete the second and third rows from the result. Similarly, rows 5 to 8 are contained in row 4. Finally, rows 9 and 10 are contained in the first row, so I want to delete rows 9 and 10.
Hence, my result would look like:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation is as follows::
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to implement this in R; may be using the tidy framework and dplyr chains but I am missing the trick. Apologies for a longish question. Any input would be highly appreciated.

I would try to get the items out of a list column and store them in a longer dataframe. Here is my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))

R: efficient way to apply a function according to the columns of a dataframe

I feel extremely stupid now but I can't come up with more than a for loop...
I have a data frame with numerical and factorial columns. I simply want the numerical columns to be scaled and the factorial columns to be kept as they are. For example
> set.seed(160)
> df1 <- data.frame(as.data.frame(matrix(rnorm(8), ncol=2)),
V3=factor(c("A", "A", "B", "B")))
> df1
V1 V2 V3
1 0.6185496 -0.6410203 A
2 -0.8722777 2.6520986 A
3 0.8529240 -1.4156009 B
4 0.3678875 -1.1615607 B
I'd like to get
> df1
V1 V2 V3
1 0.4901808 -0.2642698 A
2 -1.4493527 1.4780179 A
3 0.7950968 -0.6740765 B
4 0.1640750 -0.5396717 B
with a more efficient command than
for(i in 1:ncol(df1)) {
if(is.factor(df1[,i])) {df1[,i] <- df1[,i]}
else{df1[,i] <- scale(df1[,i])}
}
I tried various combinations of lapply(), sapply(), if(), ifelse() but nothing seemed to work (apply doesn't work because the df gets transformed into a matrix and I lose the factor/numeric structure). Any suggestions?
NB: I am not trying to apply a function based on the values in the columns but based on the type of column.

You can try the following, which is similar to a suggestion in the comments:
df1[sapply(df1, is.numeric)] <- scale(df1[sapply(df1, is.numeric)])
#> df1
# V1 V2 V3
#1 0.4901808 -0.2642698 A
#2 -1.4493527 1.4780179 A
#3 0.7950968 -0.6740765 B
#4 0.1640750 -0.5396717 B

This should work.
df1[] <- sapply(df1, function(i) if(is.numeric(i)) scale(i) else i)

How can this code be compacted?

Can the following code be made more "R like"?
Given data.frame inDF:
V1 V2 V3 V4
1 a ha 1;2;3 A
2 c hb 4 B
3 d hc 5;6 C
4 f hd 7 D
Inside df I want to
find all rows which for the "V3" column has multiple values
separated by ";"
then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
and then each replicated row receives in the "V3" column only one the initial values
Shortly, the output data.frame (= outDF) will look like:
V1 V2 V3 V4
1 a ha 1 A
1 a ha 2 A
1 a ha 3 A
2 c hb 4 B
3 d hc 5 C
3 d hc 6 C
4 f hd 7 D
So, if from inDF I want to get to outDF, I would write the following code:
#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)
#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])
# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)
#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value
for(i in rowlist[])
{
#count the number of individual values in one cell
value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1
# replicate each row a number of times equal with its value number, and transform it to character
extracted_inDF <- inDF[rep(i, times=value_nr[]),]
extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)
# split the values in V3 cell in individual values, place them in a list
value_ls <- str_split(inDF[i, 3], ";")
#initialize f, to use it later to increment both row number and element in the list of values
f = 1
# replace the multiple values with individual values
for(j in extracted_inDF[,3])
{
extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
f <- f+1
}
#put all the "demultiplied" rows in xDF
xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}
# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]
#create outDF
outDF <- merge(inDF, xDF, all=TRUE)
Could you please

I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)
However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.
I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.
For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.
For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.
Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.
Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.
Here is the above in action. I've named your dataset "mydf":
V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length) ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
# V1 V2 V3 V4
# 1 a ha 1 A
# 1.1 a ha 2 A
# 1.2 a ha 3 A
# 2 c hb 4 B
# 3 d hc 5 C
# 3.1 d hc 6 C
# 4 f hd 7 D
To share some more options...
The "data.table" package can actually be pretty useful for something like this.
library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")
Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.
library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ] ## Optional, perhaps
df2[order(df2$V1), ] ## Optional, perhaps

In this case, you can use the split-apply-combine paradigm for reshaping the data.
You want to split inDF by its rows, since you want to operate on each row separately. I've used the split function here to split it up by row:
spl = split(inDF, 1:nrow(inDF))
spl is a list that contains a 1-row data frame for each row in inDF.
Next, you'll want to apply a function to transform the split up data into the final format you need. Here, I'll use the lapply function to transform the 1-row data frames, using strsplit to break up the variable V3 into its appropriate parts:
transformed = lapply(spl, function(x) {
data.frame(V1=x$V1, V2=x$V2, V3=strsplit(x$V3, ";")[[1]], V4=x$V4)
})
tranformed is now a list where the first element has a 3-row data frame, the third element has a 2-row data frame, and the second and fourth have 1-row data frames.
The last step is to combine this list together into outDF, using do.call with the rbind function. That has the same effect of calling rbind with all of the elements of the transformed list.
outDF = do.call(rbind, transformed)
This yields the desired final data frame:
outDF
# V1 V2 V3 V4
# 1.1 a ha 1 A
# 1.2 a ha 2 A
# 1.3 a ha 3 A
# 2 c hb 4 B
# 3.1 d hc 5 C
# 3.2 d hc 6 C
# 4 f hd 7 D

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks

This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2

This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex