I want to find the best "R way" to flatten a dataframe that looks like this:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5 Treat-c,Treat-d,Treat-e
So it will be structured like this:
CAT COUNT1 COUNT2 COUNT3 TREAT1 TREAT2 TREAT3
A 1 2 3 Treat-a Treat-b NA
B 4 5 NA Treat-c Treat-d Treat-e
Example code to generate the source dataframe:
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5)
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
I believe I need a combination of rbind and unlist? Any help would be greatly appreciated.
- Tim
Here is a solution using base R, accepting vectors of any length inside your list and no need to specify which columns of the dataframe you want to collapse. Part of the solution was generated using this answer.
df2 <- do.call(cbind,lapply(df,function(x){
#check if it is a list, otherwise just return as is
if(is.list(x)){
return(data.frame(t(sapply(x,'[',seq(max(sapply(x,length)))))))
} else{
return(x)
}
}))
As of R 3.2 there is lengths to replace sapply(x, length) as well,
df3 <- do.call(cbind.data.frame, lapply(df, function(x) {
# check if it is a list, otherwise just return as is
if (is.list(x)) {
data.frame(t(sapply(x,'[', seq(max(lengths(x))))))
} else {
x
}
}))
data used:
df <- structure(list(CAT = structure(1:2, .Label = c("A", "B"), class = "factor"),
COUNT = list(1:3, 4:5), TREAT = list(c("Treat-a", "Treat-b"
), c("Treat-c", "Treat-d", "Treat-e"))), .Names = c("CAT",
"COUNT", "TREAT"), row.names = c(NA, -2L), class = "data.frame")
Here is another way in base r
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5)
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
Create a helper function to do the work
f <- function(l) {
if (!is.list(l)) return(l)
do.call('rbind', lapply(l, function(x) `length<-`(x, max(lengths(l)))))
}
Always test your code
f(df$TREAT)
# [,1] [,2] [,3]
# [1,] "Treat-a" "Treat-b" NA
# [2,] "Treat-c" "Treat-d" "Treat-e"
Apply it
df[] <- lapply(df, f)
df
# CAT COUNT.1 COUNT.2 COUNT.3 TREAT.1 TREAT.2 TREAT.3
# 1 A 1 2 3 Treat-a Treat-b <NA>
# 2 B 4 5 NA Treat-c Treat-d Treat-e
There's a deleted answer here that indicates that "splitstackshape" could be used for this. It can, but the deleted answer used the wrong function. Instead, it should use the listCol_w function. Unfortunately, in its present form, this function is not vectorized across columns, so you would need to nest the calls to listCol_w for each column that needs to be flattened.
Here's the approach:
library(splitstackshape)
listCol_w(listCol_w(df, "COUNT", fill = NA), "TREAT", fill = NA)
## CAT COUNT_fl_1 COUNT_fl_2 COUNT_fl_3 TREAT_fl_1 TREAT_fl_2 TREAT_fl_3
## 1: A 1 2 3 Treat-a Treat-b NA
## 2: B 4 5 NA Treat-c Treat-d Treat-e
Note that fill = NA has been specified because it defaults to fill = NA_character_, which would otherwise coerce all the values to character.
Another alternative would be to use transpose from "data.table". Here's a possible implementation (looks scary, but using the function is easy). Benefits are that (1) you can specify the columns to flatten, (2) you can decide whether you want to drop the original column or not, and (3) it's fast.
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}
Usage would be...
Keeping original columns:
flatten(df, c("COUNT", "TREAT"))
# CAT COUNT TREAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1,2,3 Treat-a,Treat-b 1 2 3 Treat-a Treat-b NA
# 2: B 4,5 Treat-c,Treat-d,Treat-e 4 5 NA Treat-c Treat-d Treat-e
Dropping original columns:
flatten(df, c("COUNT", "TREAT"), TRUE)
# CAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1 2 3 Treat-a Treat-b NA
# 2: B 4 5 NA Treat-c Treat-d Treat-e
See this gist for a comparison with the other solutions proposed.
Related
I'm trying to write a function which takes a data.table, a list of columns and a list of values and selects rows such that each column is filtered by the respective value.
So, given the following data.table:
> set.seed(1)
> dt = data.table(sample(1:5, 10, replace = TRUE),
sample(1:5, 10, replace = TRUE),
sample(1:5, 10, replace = TRUE))
> dt
V1 V2 V3
1: 1 5 5
2: 4 5 2
3: 1 2 2
4: 2 2 1
5: 5 1 4
6: 3 5 1
7: 2 5 4
8: 3 1 3
9: 3 1 2
10: 1 5 2
A call to filterDT(dt, c(V1, V3), c(1, 2)) would select the rows where V1 = 1 and V3 = 2 (rows 3 and 10 above).
My best thought was to use .SD and .SDcols to stand in for the desired columns and then do a comparison within i (from dt[i,j,by]):
> filterDT <- function(dt, columns, values) {
dt[.SD == values, , .SDcols = columns]
}
> filterDT(dt, c("V1", "V3"), c(1, 2))
Empty data.table (0 rows and 3 cols): V1,V2,V3
Unfortunately, this doesn't work, even if only filtering by one column.
I've noticed all examples of .SD I've found online use it in j, which tells me I'm probably doing something very wrong.
Any suggestions?
Assuming that the 'values' to be filtered are the ones corresponding to the 'columns' selected, we can do a comparison with Map and Reduce with &
dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
As a function
filterDT <- function(dt, columns, values) {
dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
}
filterDT(dt, c("V1", "V3"), c(1, 2))
# V1 V2 V3
#1: 1 4 2
Or another option is setkey
setkeyv(dt, c("V1", "V3"))
dt[.(1, 2)]
# V1 V2 V3
#1: 1 4 2
I think you should be able to write a function that joins using an arbitrary number of columns:
#' Filter a data.table on an arbitrary number of columns
#'
#' #param dt data.table to filter
#' #param ... named columns to filter on and their values
filter_dt <- function(dt, ...) {
filter_criteria <- as.data.table(list(...))
dt[filter_criteria, on = names(filter_criteria), nomatch=0]
}
# A few examples:
filter_dt(dt, V1=1, V3=2)
filter_dt(dt, V1=2, V2=2, V3=5)
filter_dt(dt, V1=c(5,4,4), V3=c(1,2,5))
Basically the function constructs a new data.table from the arguments supplied to ..., each argument becoming a column in the new data.table filter_criteria. This is then supplied to the i argument of dt with the column names of filter_criteria used as the columns in the join.
I have a very basic problem and can't find a solution, so sorry in advance for the beginner question.
I have a data frame with several ID columns and 30 numerical columns. I want to multiply all values of those 30 columns with the same factor. I want to keep the the rest of the data frame unchanged. I figured that dplyr and transmute_all or transmute_at are my friends, but I can't find a way to express the function Column1:Column30 * factor. All examples given use simple functions like mean and that doesn't help me with the expression.
I would use mutate_at. For example:
library(dplyr)
mtcars %>%
mutate_at(vars(mpg:qsec),
.funs = funs(. * 3))
I'll give a solution with data.table, the dplyr version should be close to identical.
library(data.table)
# convert to data.table format to use data.table syntax
setDT(my_df)
# .SD refers to all the columns mentioned in the .SDcols argument
# (all columns by default when this argument is not specified)
# - instead of using backticks around *, you could use quotes: "*"
my_df[ , lapply(.SD, `*`, factor), .SDcols = Column1:Column30]
On some made-up data
set.seed(0123498)
# create fake data
DT = setDT(replicate(8, rnorm(5), simplify = FALSE))
DT
# V1 V2 V3 V4 V5 V6 V7 V8
# 1: -0.2685077 -1.06491111 0.7307661 0.09880937 0.2791274 -0.5589676 1.5320685 0.4730013
# 2: 1.0783236 -0.17810929 -0.2578453 0.95940860 1.0990367 -0.6983235 0.9530062 -1.3800769
# 3: 1.1730611 -0.48828441 -1.6314077 -0.76117268 -0.5753245 -0.7370099 0.3982160 -0.8088035
# 4: 0.2060451 -0.07105785 -1.1878591 -0.83464592 2.1872117 -0.4390479 0.1428239 1.2634280
# 5: 1.6142695 0.46381602 0.5315299 2.34790945 -1.2977851 1.0428450 1.9292390 0.5337248
scalar = 3
DT[ , lapply(.SD, "*", scalar), .SDcols = V4:V6]
# V4 V5 V6
# 1: 0.2964281 0.8373822 -1.676903
# 2: 2.8782258 3.2971101 -2.094970
# 3: -2.2835180 -1.7259734 -2.211030
# 4: -2.5039378 6.5616352 -1.317144
# 5: 7.0437283 -3.8933554 3.128535
If it's all numeric columns you want to multiply, (or if you can easily write a test) I'd use lapply with an is.numeric test:
Calling the data frame dd (and using iris to demonstrate):
dd = iris
dd[] = lapply(dd, FUN = function(x) if (is.numeric(x)) return(x * 2) else return(x))
This is equivalent to a simple for loop, which also works just fine.
for (i in 1:ncol(dd)) {
if (is.numeric(dd[[i]])) dd[[i]] = dd[[i]] * 2
}
Another way is to use lapply only on the relevant columns, e.g.:
dd[1:30] = lapply(dd[1:30], "*", 2)
Since dplyr version 1.0, you can use across():
dd = iris
dd = dd %>%
mutate(across(where(is.numeric), function(x) x * 2))
May be this will help you, just R base
> set.seed(100)
> df = data.frame(id=rep(1:5), val1=rnorm(5), val2=rnorm(5), val3=rnorm(5))
> df
id val1 val2 val3
1 1 -0.50219235 0.3186301 0.08988614
2 2 0.13153117 -0.5817907 0.09627446
3 3 -0.07891709 0.7145327 -0.20163395
4 4 0.88678481 -0.8252594 0.73984050
5 5 0.11697127 -0.3598621 0.12337950
# Multiply by 2 all columns except id column
> df[, !colnames(df) %in% c("id")] <- df[, !colnames(df) %in% c("id")] * 2
> df
id val1 val2 val3
1 1 -1.0043847 0.6372602 0.1797723
2 2 0.2630623 -1.1635814 0.1925489
3 3 -0.1578342 1.4290654 -0.4032679
4 4 1.7735696 -1.6505189 1.4796810
5 5 0.2339425 -0.7197243 0.2467590
>
You could just use apply
my_df <- data_frame(//some data)
my_scaled_df <- apply(data_frame, 2, transformation_logic)
For this you can use try:
y <- xx[-(1:2)]*100
this "xx[-(1:2)]" is non numeric columns so you need to exclude these from the calculation.
After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))