This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))
Related
After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
I am looking to turn a dataframe (or datatable) such as
dt <- data.table(a = c(1,2,4), b = c(NA,3,5), d = c(NA,8,NA))
into something with one column, such as
dt <- data.table(combined = list(list(1,NA,NA),list(2,3,8),list(4,5,NA))
None of the following work:
dt[,combined := as.list(a,b,d)]
dt[,combined := do.call(list,list(a,b,d))]
dt[,combined := cbind(a,b,d)]
dt[,combined := lapply(list(a,b,d),list)]
Note that this is different from the question here, data.frame rows to a list, which returns a different shaped object (I think it's just a plain list, with each row as an item in the list, rather than a vector of lists)
You can use purrr::transpose(), which transposes a list of vectors to a list of lists:
dt[, combined := purrr::transpose(.(a,b,d))]
dt
# a b d combined
#1: 1 NA NA <list>
#2: 2 3 8 <list>
#3: 4 5 NA <list>
combined = list(list(1,NA_real_,NA_real_),list(2,3,8),list(4,5,NA_real_))
identical(dt$combined, combined)
# [1] TRUE
If you don't want to use an extra package, you can use data.table::transpose with a little extra effort:
dt[, combined := lapply(transpose(.(a,b,d)), as.list)]
identical(dt$combined, combined)
# [1] TRUE
To make #David's comment more explicit, and generalize the data.table approach to SE version, which allows you to pass in columns names as character vector and avoids hard coding column names, you can do, to learn more about SE vs NSE (you can refer to vignette("nse")):
dt[, combined := lapply(transpose(.SD), as.list), .SDcols = c("a","b","d")]
This makes all sublists named, but the values correspond to the combined list:
identical(lapply(dt$combined, setNames, NULL), combined)
# [1] TRUE
If you don't want to use any functions:
dt[, combined := .(.(.SD)), by = 1:nrow(dt)]
# because you want to transform each row to a list, normally you can group the data frame
# by the row id, and turn each row into a list, and store the references in a new list
# which will be a column in the resulted data.table
dt$combined
#[[1]]
# a b d
#1: 1 NA NA
#[[2]]
# a b d
#1: 2 3 8
#[[3]]
# a b d
#1: 4 5 NA
Or: dt[, combined := .(.(.(a,b,d))), by = 1:nrow(dt)] which gives you closer to the exact desired output.
dfOrig <- data.frame(rbind("1",
"C",
"531404",
"3",
"B",
"477644"))
setnames(dfOrig, "Value")
I have a single column vector, which actually comprises two observations of three variables. How do I convert it to a data.frame with the following structure:
ID Code Tag
"1" "C" "531404"
"3" "B" "477644"
Obviously, this is just a toy example to illustrate a real-world problem with many more observations and variables.
Here's another approach - it does rely on the dfOrig column being ordered 1,2,3,1,2,3 etc.
x <- c("ID", "Code", "Tag") # new column names
n <- length(x) # number of columns
res <- data.frame(lapply(split(as.character(dfOrig$Value), rep(x, nrow(dfOrig)/n)),
type.convert))
The resulting data is:
> str(res)
#'data.frame': 2 obs. of 3 variables:
# $ Code: Factor w/ 2 levels "B","C": 2 1
# $ ID : int 1 3
# $ Tag : int 531404 477644
As you can see, the column classes have been converted. In case you want the Code column to be character instead of factor you can specify stringsAsFactors = FALSE in the data.frame call.
And it looks like this:
> res
# Code ID Tag
#1 C 1 531404
#2 B 3 477644
Note: You have to get the column name order in x in line with the order of the entries in dfOrig$Value.
If you want to get the column order of res as specified in x, you can use the following:
res <- res[, match(x, names(res))]
Maybe convert to matrix with nrow:
# set number of columns
myNcol <- 3
# convert to matrix, then dataframe
res <- data.frame(matrix(dfOrig$Value, ncol = myNcol, byrow = TRUE),
stringsAsFactors = FALSE)
# convert the type and add column names
res <- as.data.frame(lapply(res, type.convert),
col.names = c("resID", "Code", "Tag"))
res
# resID Code Tag
# 1 1 C 531404
# 2 3 B 477644
You can create a sequence of numbers
x <- seq(1:nrow(dfOrig)) %% 3 #you can change this 3 to number of columns you need
data.frame(ID = dfOrig$Value[x == 1],
Code = dfOrig$Value[x == 2],
Tag = dfOrig$Value[x == 0])
#ID Code Tag
#1 1 C 531404
#2 3 B 477644
Another approach would be splitting the dataframe according to the sequence generated above and then binding the columns using do.call
x <- seq(1:nrow(dfOrig))%%3
res <- do.call("cbind", split(dfOrig,x))
You can definitely change the column names
colnames(res) <- c("Tag", "Id", "Code")
# Tag Id Code
#3 531404 1 C
#6 477644 3 B
I want to find the best "R way" to flatten a dataframe that looks like this:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5 Treat-c,Treat-d,Treat-e
So it will be structured like this:
CAT COUNT1 COUNT2 COUNT3 TREAT1 TREAT2 TREAT3
A 1 2 3 Treat-a Treat-b NA
B 4 5 NA Treat-c Treat-d Treat-e
Example code to generate the source dataframe:
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5)
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
I believe I need a combination of rbind and unlist? Any help would be greatly appreciated.
- Tim
Here is a solution using base R, accepting vectors of any length inside your list and no need to specify which columns of the dataframe you want to collapse. Part of the solution was generated using this answer.
df2 <- do.call(cbind,lapply(df,function(x){
#check if it is a list, otherwise just return as is
if(is.list(x)){
return(data.frame(t(sapply(x,'[',seq(max(sapply(x,length)))))))
} else{
return(x)
}
}))
As of R 3.2 there is lengths to replace sapply(x, length) as well,
df3 <- do.call(cbind.data.frame, lapply(df, function(x) {
# check if it is a list, otherwise just return as is
if (is.list(x)) {
data.frame(t(sapply(x,'[', seq(max(lengths(x))))))
} else {
x
}
}))
data used:
df <- structure(list(CAT = structure(1:2, .Label = c("A", "B"), class = "factor"),
COUNT = list(1:3, 4:5), TREAT = list(c("Treat-a", "Treat-b"
), c("Treat-c", "Treat-d", "Treat-e"))), .Names = c("CAT",
"COUNT", "TREAT"), row.names = c(NA, -2L), class = "data.frame")
Here is another way in base r
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5)
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
Create a helper function to do the work
f <- function(l) {
if (!is.list(l)) return(l)
do.call('rbind', lapply(l, function(x) `length<-`(x, max(lengths(l)))))
}
Always test your code
f(df$TREAT)
# [,1] [,2] [,3]
# [1,] "Treat-a" "Treat-b" NA
# [2,] "Treat-c" "Treat-d" "Treat-e"
Apply it
df[] <- lapply(df, f)
df
# CAT COUNT.1 COUNT.2 COUNT.3 TREAT.1 TREAT.2 TREAT.3
# 1 A 1 2 3 Treat-a Treat-b <NA>
# 2 B 4 5 NA Treat-c Treat-d Treat-e
There's a deleted answer here that indicates that "splitstackshape" could be used for this. It can, but the deleted answer used the wrong function. Instead, it should use the listCol_w function. Unfortunately, in its present form, this function is not vectorized across columns, so you would need to nest the calls to listCol_w for each column that needs to be flattened.
Here's the approach:
library(splitstackshape)
listCol_w(listCol_w(df, "COUNT", fill = NA), "TREAT", fill = NA)
## CAT COUNT_fl_1 COUNT_fl_2 COUNT_fl_3 TREAT_fl_1 TREAT_fl_2 TREAT_fl_3
## 1: A 1 2 3 Treat-a Treat-b NA
## 2: B 4 5 NA Treat-c Treat-d Treat-e
Note that fill = NA has been specified because it defaults to fill = NA_character_, which would otherwise coerce all the values to character.
Another alternative would be to use transpose from "data.table". Here's a possible implementation (looks scary, but using the function is easy). Benefits are that (1) you can specify the columns to flatten, (2) you can decide whether you want to drop the original column or not, and (3) it's fast.
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}
Usage would be...
Keeping original columns:
flatten(df, c("COUNT", "TREAT"))
# CAT COUNT TREAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1,2,3 Treat-a,Treat-b 1 2 3 Treat-a Treat-b NA
# 2: B 4,5 Treat-c,Treat-d,Treat-e 4 5 NA Treat-c Treat-d Treat-e
Dropping original columns:
flatten(df, c("COUNT", "TREAT"), TRUE)
# CAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1 2 3 Treat-a Treat-b NA
# 2: B 4 5 NA Treat-c Treat-d Treat-e
See this gist for a comparison with the other solutions proposed.
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
I want to add a new column with "NA"s in my dataframe:
A B
1 14379 32094
2 151884 174367
3 438422 449382
But I need it to be located between col. A and B, like this:
A C B
1 14379 NA 32094
2 151884 NA 174367
3 438422 NA 449382
I know how to add col. C after col. B, but that is not helpful to me... Anyone know how to do it?
In 2 steps, you can reorder the columns:
dat$C <- NA
dat <- dat[, c("A", "C", "B")]
A C B
1 0.596068 NA -0.7783724
2 -1.464656 NA -0.8425972
You can also use append
dat <- data.frame(A = rnorm(2), B = rnorm(2))
as.data.frame(append(dat, list(C = NA), after = 1))
A C B
1 -0.7046408 NA 0.2117638
2 0.8402680 NA -2.0109721
If you use data.table you can use the function setcolorder. Note that NA is stored as logical variable, if you want to have the column initiated as an integer, double or character column, you can use NA_integer, NA_real_ or NA_character_
eg
library(data.table)
DT <- data.table(DF)
# add column `C` = NA
DT[, C := NA]
setcolorder(DT, c('A','C','B'))
DT
## A C B
## 1: 14379 NA 32094
## 2: 151884 NA 174367
## 3: 438422 NA 449382
You could do this in one line
setcolorder(DT[, C: = NA], c('A','B','C'))
You can also use the package tibble, which has a very interesting function (among many other) for that : add_column()
library(tibble)
df <- data.frame("a" = 1:5, "b" = 6:10)
add_column(df, c = rep(NA, nrow(df)), .after = 1)
That function is easy to use, and you can use the argument .before instead.
I wrote a function to append columns onto (into) a data.frame. It allows you to name the column as well, and does a few checks...
append_col <- function(x, cols, after=length(x)) {
x <- as.data.frame(x)
if (is.character(after)) {
ind <- which(colnames(x) == after)
if (any(is.null(ind))) stop(after, "not found in colnames(x)\n")
} else if (is.numeric(after)) {
ind <- after
}
stopifnot(all(ind <= ncol(x)))
cbind(x, cols)[, append(1:ncol(x), ncol(x) + 1:length(cols), after=ind)]
}
examples:
# create data
df <- data.frame("a"=1:5, "b"=6:10)
# append column
append_col(df, list(c=1:5))
# append after an column index
append_col(df, list(c=1:5), after=1)
# or after a named column
append_col(df, list(c=1:5), after="a")
# multiple columns / single values work as expected
append_col(df, list(c=NA, d=4:8), after=1)
(One advantage of calling cbind at the end of the function and indexing is that characters within the data.frame are not coerced to factors as would be the case if using as.data.frame(append(x, cols, after=ind)))