Binning and Naming New Columns with Mean of Binned Columns - r

This probably has been asked already, but I could not find it. I have a data set, where column names are numbers, and row names are sample names (see below).
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0
I want to bin the columns, say every 4 columns, by summation, and then name the new columns with the mean of the binned columns. For the above table I would end up with:
"599.785" "599.816"
"A" 0 7
"B" 0 4
"C" 0 4
"D" 3 4
The new column names, 599.785 and 599.816, are average of the column names that were binned. I think something like cut would work for a vector of numbers, but I am not sure how to implement it for large data frames. Thanks for any help!

colnames <- c("599.773", "599.781", "599.789", "599.797",
"599.804", "599.812" ,"599.82" ,"599.828" )
mat <- matrix(scan(), nrow=4, byrow=TRUE)
0 0 0 0 0 2 1 4
0 0 0 0 0 1 0 3
0 0 0 0 2 1 0 1
3 0 0 0 3 1 0 0
colnames(mat)=colnames
rownames(mat) = LETTERS[1:4]
sRows <- function(mat, cols) rowSums(mat[, cols])
sapply(1:(dim(mat)[2]/4), function(base) sRows(mat, base:(base+4)) )
[,1] [,2]
A 0 2
B 0 1
C 2 3
D 6 4
accum <- sapply(1:(dim(mat)[2]/4), function(base)
sRows(mat, base:(base+4)) )
colnames(accum) <- sapply(1:(dim(mat)[2]/4),
function(base)
mean(as.numeric(colnames(mat)[ base:(base+4)] )) )
accum
#-------
599.7888 599.7966
A 0 2
B 0 1
C 2 3
D 6 4

First of all Using numeric values as columns names is not a good/standard habit.
Even I am here giving a solution as the desired OP.
## read data without checking names
dt <- read.table(text='
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0',header=TRUE, check.names =FALSE)
cols <- as.numeric(colnames(dt))
## create a factor to groups columns
ff <- rep(c(TRUE,FALSE),each=length(cols)/2)
## using tapply to group operations by ff
vals <- do.call(cbind,tapply(cols,ff,
function(x)
rowSums(dt[,paste0(x)])))
nn <- tapply(cols,ff,mean)
## names columns with means
colnames(vals) <- nn[colnames(vals)]
vals
599.816 599.785
A 7 0
B 4 0
C 4 0
D 4 3

Related

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Replace a sequence in data frame column

I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!
Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0
You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.
Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<
Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.

How can I create this special sequence?

I would like to create the following vector sequence.
0 1 0 0 2 0 0 0 3 0 0 0 0 4
My thought was to create 0 first with rep() but not sure how to add the 1:4.
Create a diagonal matrix, take the upper triangle, and remove the first element:
d <- diag(0:4)
d[upper.tri(d, TRUE)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
If you prefer a one-liner that makes no global assignments, wrap it up in a function:
(function() { d <- diag(0:4); d[upper.tri(d, TRUE)][-1L] })()
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
And for code golf purposes, here's another variation using d from above:
d[!lower.tri(d)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
rep and rbind up to their old tricks:
rep(rbind(0,1:4),rbind(1:4,1))
#[1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
This essentially creates 2 matrices, one for the value, and one for how many times the value is repeated. rep does not care if an input is a matrix, as it will just flatten it back to a vector going down each column in order.
rbind(0,1:4)
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 1 2 3 4
rbind(1:4,1)
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 1 1 1
You can use rep() to create a sequence that has n + 1 of each value:
n <- 4
myseq <- rep(seq_len(n), seq_len(n) + 1)
# [1] 1 1 2 2 2 3 3 3 3 4 4 4 4 4
Then you can use diff() to find the elements you want. You need to append a 1 to the end of the diff() output, since you always want the last value.
c(diff(myseq), 1)
# [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
Then you just need to multiply the original sequence with the diff() output.
myseq <- myseq * c(diff(myseq), 1)
myseq
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
unlist(lapply(1:4, function(i) c(rep(0,i),i)))
# the sequence
s = 1:4
# create zeros vector
vec = rep(0, sum(s+1))
# assign the sequence to the corresponding position in the zeros vector
vec[cumsum(s+1)] <- s
vec
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
Or to be more succinct, use replace:
replace(rep(0, sum(s+1)), cumsum(s+1), s)
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4

Expand a single column to a wide/model matrix format

Suppose I have a column in a matrix or data.frame as follows:
df <- data.frame(col1=sample(letters[1:3], 10, TRUE))
I want to expand this out to multiple columns, one for each level in the column, with 0/1 entries indicating presence or absence of level for each row
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
for (i in 1:length(levels(df$col1))) {
curLetter <- levels(df$col1)[i]
newdf[which(df$col1 == curLetter), curLetter] <- 1
}
newdf
I know there's a simple clever solution to this, but I can't figure out what it is.
I've tried expand.grid on df, which returns itself as is. Similarly melt in the reshape2 package on df returned df as is. I've also tried reshape but it complains about incorrect dimensions or undefined columns.
Obviously, model.matrix is the most direct candidate here, but here, I'll present three alternatives: table, lapply, and dcast (the last one since this question is tagged reshape2.
table
table(sequence(nrow(df)), df$col1)
#
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
lapply
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
newdf[] <- lapply(names(newdf), function(x)
{ newdf[[x]][df[,1] == x] <- 1; newdf[[x]] })
newdf
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
dcast
library(reshape2)
dcast(df, sequence(nrow(df)) ~ df$col1, fun.aggregate=length, value.var = "col1")
# sequence(nrow(df)) a b c
# 1 1 1 0 0
# 2 2 0 1 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 1 0 0
# 6 6 0 0 1
# 7 7 0 0 1
# 8 8 0 1 0
# 9 9 0 1 0
# 10 10 1 0 0
It's very easy with model.matrix
model.matrix(~ df$col1 + 0)
The term + 0 means that the intercept is not included. Hence, you receive a dummy variable for each factor level.
The result:
df$col1a df$col1b df$col1c
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 0 1 0
9 1 0 0
10 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$col1`
[1] "contr.treatment"

Converting a column of type 'list' to multiple columns in a data frame

I have a data frame with one column which is a list, like so:
>head(movies$genre_list)
[[1]]
[1] "drama" "action" "romance"
[[2]]
[1] "crime" "drama"
[[3]]
[1] "crime" "drama" "mystery"
[[4]]
[1] "thriller" "indie"
[[5]]
[1] "thriller"
[[6]]
[1] "drama" "family"
I want to convert this one column to multiple columns, one for each unique element across the lists (in this case, genres), and have them as binary columns. I'm looking for an elegant solution, which doesn't involve first finding out how many genres are there, and then creating a column for each, and then checking each list element to then populate the genre columns. I tried unlist, but it doesn't work with a vector of lists in the way I want.
Thanks!
Here are a few approaches:
movies <- data.frame(genre_list = I(list(
c("drama", "action", "romance"),
c("crime", "drama"),
c("crime", "drama", "mystery"),
c("thriller", "indie"),
c("thriller"),
c("drama", "family"))))
Update, years later....
You can use the mtabulate function from "qdapTools" or the unexported charMat function from my "splitstackshape" package.
Syntax would be:
library(qdapTools)
mtabulate(movies$genre_list)
# action crime drama family indie mystery romance thriller
# 1 1 0 1 0 0 0 1 0
# 2 0 1 1 0 0 0 0 0
# 3 0 1 1 0 0 1 0 0
# 4 0 0 0 0 1 0 0 1
# 5 0 0 0 0 0 0 0 1
# 6 0 0 1 1 0 0 0 0
or
splitstackshape:::charMat(movies$genre_list, fill = 0)
# action crime drama family indie mystery romance thriller
# [1,] 1 0 1 0 0 0 1 0
# [2,] 0 1 1 0 0 0 0 0
# [3,] 0 1 1 0 0 1 0 0
# [4,] 0 0 0 0 1 0 0 1
# [5,] 0 0 0 0 0 0 0 1
# [6,] 0 0 1 1 0 0 0 0
Update: A couple of more direct approaches
Improved option 1: Use table somewhat directly:
table(rep(1:nrow(movies), sapply(movies$genre_list, length)),
unlist(movies$genre_list, use.names=FALSE))
Improved option 2: Use a for loop.
x <- unique(unlist(movies$genre_list, use.names=FALSE))
m <- matrix(0, ncol = length(x), nrow = nrow(movies), dimnames = list(NULL, x))
for (i in 1:nrow(m)) {
m[i, movies$genre_list[[i]]] <- 1
}
m
Below is the OLD answer
Convert the list to a list of tables (in turn converted to data.frames):
tables <- lapply(seq_along(movies$genre_list), function(x) {
temp <- as.data.frame.table(table(movies$genre_list[[x]]))
names(temp) <- c("Genre", paste("Record", x, sep = "_"))
temp
})
Use Reduce to merge the resulting list. If I understand your end goal correctly, this results in the transposed form of the result you are interested in.
merged_tables <- Reduce(function(x, y) merge(x, y, all = TRUE), tables)
merged_tables
# Genre Record_1 Record_2 Record_3 Record_4 Record_5 Record_6
# 1 action 1 NA NA NA NA NA
# 2 drama 1 1 1 NA NA 1
# 3 romance 1 NA NA NA NA NA
# 4 crime NA 1 1 NA NA NA
# 5 mystery NA NA 1 NA NA NA
# 6 indie NA NA NA 1 NA NA
# 7 thriller NA NA NA 1 1 NA
# 8 family NA NA NA NA NA 1
Transposing and converting NA to 0 is pretty straightforward. Just drop the first column and re-use it as the column names for the new data.frame
movie_genres <- setNames(data.frame(t(merged_tables[-1])), merged_tables[[1]])
movie_genres[is.na(movie_genres)] <- 0
movie_genres
Using the same input as in the other replies here are some alternatives:
1) factor/table/rbind
> levs <- levels(factor(unlist(movies[[1]])))
> as.data.frame(do.call(rbind, lapply(lapply(movies[[1]], factor, levs), table)))
action crime drama family indie mystery romance thriller
1 1 0 1 0 0 0 1 0
2 0 1 1 0 0 0 0 0
3 0 1 1 0 0 1 0 0
4 0 0 0 0 1 0 0 1
5 0 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
2) make.groups/xtabs
> library(lattice)
> m <- do.call(make.groups, movies[[1]])
> as.data.frame.matrix(xtabs(~ which + data, m))
action crime drama family indie mystery romance thriller
c("drama", "action", "romance") 1 0 1 0 0 0 1 0
c("crime", "drama") 0 1 1 0 0 0 0 0
c("crime", "drama", "mystery") 0 1 1 0 0 1 0 0
c("thriller", "indie") 0 0 0 0 1 0 0 1
thriller 0 0 0 0 0 0 0 1
c("drama", "family") 0 0 1 1 0 0 0 0
2a) make.groups/dcast This one is a variation on alternative 2 using dcast from reshape2 in place of as.data.frame.matrix and xtabs. Melted data frame m is from alternative 2.
library(reshape2)
dcast(m, which ~ data, fun.aggregate = length, value.var = "which")
UPDATE: Added alternative 2.
UPDATE 2: Added alternative 2a.

Resources