How to reshape multiple rows to a single row with several columns - r

This may seem an obvious questions for someone who has practice with reshape package but I'm trying to get use to its functions and I can't figure out the right syntax!
Let's have the following data frame,
df <- data.frame(matrix(1:12,ncol=3),row.names=letters[1:4])
X1 X2 X3
a 1 5 9
b 2 6 10
c 3 7 11
d 4 8 12
how can we bind the rows into columns in order to get the following result?
X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
1 5 9 2 6 10 3 7 11 4 8 12
Thank you

This too would work:
vec <- c(t(df))
names(vec) <- c(outer(colnames(df), rownames(df), paste, sep="."))
## > vec
## X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
## 1 5 9 2 6 10 3 7 11 4 8 12

Since you want it as a vector, there's no need for reshape perse. You can just unlist it and then use setNames to set the names accordingly.
df.t <- as.data.frame(t(df))
vec <- unlist(df.t, use.names=FALSE) # gives a vector not matrix/data.frame
vec.names <- do.call(paste, c(expand.grid(rownames(df.t), colnames(df.t)), sep="."))
vec <- setNames(vec, vec.names)
# X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
# 1 5 9 2 6 10 3 7 11 4 8 12

Here's one:
m <- melt(cbind(df, rn=rownames(df)), id.vars='rn')
cast(m, ~ rn + variable)
## value a_X1 a_X2 a_X3 b_X1 b_X2 b_X3 c_X1 c_X2 c_X3 d_X1 d_X2 d_X3
## 1 (all) 1 5 9 2 6 10 3 7 11 4 8 12
Or as Arun indicates, acast gives a matrix (without the additional value column):
acast(m, . ~ variable+rn)
## X1_a X1_b X1_c X1_d X2_a X2_b X2_c X2_d X3_a X3_b X3_c X3_d
## [1,] 1 2 3 4 5 6 7 8 9 10 11 12
(Note that the permutation is in the other order, due to the formula being flipped.)

Related

Iterate through columns to sum the previous 2 numbers of each row

In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))

Moving down columns in data frames in R

Suppose I have the next data frame:
df<-data.frame(step1=c(1,2,3,4),step2=c(5,6,7,8),step3=c(9,10,11,12),step4=c(13,14,15,16))
step1 step2 step3 step4
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
4 4 8 12 16
and what I have to do is something like the following:
df2<-data.frame(col1=c(1,2,3,4,5,6,7,8,9,10,11,12),col2=c(5,6,7,8,9,10,11,12,13,14,15,16))
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
How can I do that? consider that more steps can be included (example, 20 steps).
Thanks!!
We can design a function to achieve this task. df_final is the final output. Notice that bin is an argument that the users can specify how many columns to transform together.
# A function to conduct data transformation
trans_fun <- function(df, bin = 3){
# Calculate the number of new columns
new_ncol <- (ncol(df) - bin) + 1
# Create a list to store all data frames
df_list <- lapply(1:new_ncol, function(num){
return(df[, num:(num + bin - 1)])
})
# Convert each data frame to a vector
dt_list2 <- lapply(df_list, unlist)
# Convert dt_list2 to data frame
df_final <- as.data.frame(dt_list2)
# Set the column and row names of df_final
colnames(df_final) <- paste0("col", 1:new_ncol)
rownames(df_final) <- 1:nrow(df_final)
return(df_final)
}
# Apply the trans_fun
df_final <- trans_fun(df)
df_final
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16
Here is a method using dplyr and reshape2 - this assumes all of the columns are the same length.
library(dplyr)
library(reshape2)
Drop the last column from the dataframe
df[,1:ncol(df)-1]%>%
melt() %>%
dplyr::select(col1=value) -> col1
Drop the first column from the dataframe
df %>%
dplyr::select(-step1) %>%
melt() %>%
dplyr::select(col2=value) -> col2
Combine the dataframes
bind_cols(col1, col2)
This should do the work:
df2 <- data.frame(col1 = 1:(length(df$step1) + length(df$step2)))
df2$col1 <- c(df$step1, df$step2, df$step3)
df2$col2 <- c(df$step2, df$step3, df$step4)
Things to point:
The important thing to see in the first line of the code, is the need for creating a table with the right amount of rows
Calling a columns that does not exist will create one, with that name
Deleting columns in R should be done like this df2$col <- NULL
Are you not just looking to do:
df2 <- data.frame(col1 = unlist(df[,-nrow(df)]),
col2 = unlist(df[,-1]))
rownames(df2) <- NULL
df2
col1 col2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
6 6 10
7 7 11
8 8 12
9 9 13
10 10 14
11 11 15
12 12 16

R - Subset dataframe to include only subjects with more than 1 record

I'd like to subset a dataframe to include all records for subjects that have >1 record, and exclude those subjects with only 1 record.
Let's take the following dataframe;
mydata <- data.frame(subject_id = factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10)),
variable = rnorm(15))
The code below gives me the subjects with >1 record using duplicated();
duplicates <- mydata[duplicated(mydata$subject_id),]$subject_id
But I want to retain in my subset all records for each subject with >1 record, so I tried;
mydata[mydata$subject_id==as.factor(duplicates),]
Which does not return the result I'm expecting.
Any ideas?
A data.table solution
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
library(data.table)
setDT(mydata)[, .SD[.N > 1], by = subject_id] # #Thanks David.
# subject_id variable
# 1: 4 -1.3325937
# 2: 4 -0.4465668
# 3: 5 0.5696061
# 4: 5 -2.8897176
# 5: 6 -0.8690183
# 6: 6 -0.4617027
# 7: 9 -0.1503822
# 8: 9 -0.6281268
# 9: 9 1.3232209
A simple alternative is to use dplyr:
library(dplyr)
dfr <- data.frame(a=sample(1:2,10,rep=T), b=sample(1:5,10, rep=T))
dfr <- group_by(dfr, b)
dfr
# Source: local data frame [10 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 5
# 4 2 1
# 5 1 2
# 6 1 3
# 7 2 1
# 8 2 4
# 9 1 4
# 10 2 4
filter(dfr, n() > 1)
# Source: local data frame [8 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 1
# 4 1 2
# 5 2 1
# 6 2 4
# 7 1 4
# 8 2 4
Here you go (I changed your variable to var <- rnorm(15):
set.seed(11)
subject_id<-as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
var<-rnorm(15)
mydata<-as.data.frame(cbind(subject_id,var))
x1 <- c(names(table(mydata$subject_id)[table(mydata$subject_id) > 1]))
x2 <- which(mydata$subject_id %in% x1)
mydata[x2,]
subject_id var
4 4 0.3951076
5 4 -2.4129058
6 5 -1.3309979
7 5 -1.7354382
8 6 0.4020871
9 6 0.4628287
12 9 -2.1744466
13 9 0.4857337
14 9 1.0245632
Try:
> mydata[mydata$subject_id %in% mydata[duplicated(mydata$subject_id),]$subject_id,]
subject_id variable
4 4 -1.3325937
5 4 -0.4465668
6 5 0.5696061
7 5 -2.8897176
8 6 -0.8690183
9 6 -0.4617027
12 9 -0.1503822
13 9 -0.6281268
14 9 1.3232209
I had to edit your data frame a little bit:
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
Now to get all the rows for subjects that appear more than once:
mydata[duplicated(mydata$subject_id)
| duplicated(mydata$subject_id, fromLast = TRUE), ]
# subject_id variable
# 4 4 -1.3325937
# 5 4 -0.4465668
# 6 5 0.5696061
# 7 5 -2.8897176
# 8 6 -0.8690183
# 9 6 -0.4617027
# 12 9 -0.1503822
# 13 9 -0.6281268
# 14 9 1.3232209
Edit: this would also work, using your duplicates vector:
mydata[mydata$subject_id %in% duplicates, ]

subset with pattern

Say I have a data frame df
df <- data.frame( a1 = 1:10, b1 = 2:11, c2 = 3:12 )
I wish to subset the columns, but with a pattern
df1 <- subset( df, select= (pattern = "1") )
To get
> df1
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Is this possible?
It is possible to do this via
subset(df, select = grepl("1", names(df)))
For automating this as a function, one can use use [ to do the subsetting. Couple that with one of R's regular expression functions and you have all you need.
By way of an example, here is a custom function implementing the ideas I mentioned above.
Subset <- function(df, pattern) {
ind <- grepl(pattern, names(df))
df[, ind]
}
Note this does not error checking etc and just relies upon grepl to return a logical vector indicating which columns match pattern, which is then passed to [ to subset by columns. Applied to your df this gives:
> Subset(df, pattern = "1")
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Same same but different:
df2 <- df[, grep("1", names(df))]
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Base R now has a convenience function endsWith():
df[, endsWith(names(df), "1")]
In data.table you can do:
library(data.table)
setDT(df)
df[, .SD, .SDcols = patterns("1")]
# Or more precisely
df[, .SD, .SDcols = patterns("1$")]
In dplyr:
library(dplyr)
select(df, ends_with("1"))

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Resources