How to duplicate columns in a data frame - r

Is there any efficient way, without using for loops, to duplicate the columns in a data frame? For example, if I have the following data frame:
Var1 Var2
1 1 0
2 2 0
3 1 1
4 2 1
5 1 2
6 2 2
And I specify that column Var1 should be repeated twice, and column Var2 three times, then I would like to get the following:
Var1 Var1 Var2 Var2 Var2
1 1 1 0 0 0
2 2 2 0 0 0
3 1 1 1 1 1
4 2 2 1 1 1
5 1 1 2 2 2
6 2 2 2 2 2
Any help would be greatly appreciated!

We can replicate the column names (rep), use that as index to duplicate the columns. By default, the data.frame columns can have only unique column names, so it will use make.unique to add .1, .2 as suffix to the duplicate column names in 'df2'. If we don't want that, we can remove the suffix part with sub.
df2 <- df1[rep(names(df1), c(2,3))]
names(df2) <- sub('\\..*', '', names(df2))
df2
# Var1 Var1 Var2 Var2 Var2
#1 1 1 0 0 0
#2 2 2 0 0 0
#3 1 1 1 1 1
#4 2 2 1 1 1
#5 1 1 2 2 2
#6 2 2 2 2 2
Or as #Frank mentioned in the comments, we can also do
`[.noquote`(df1,c(1,1,2,2,2))

Related

How to triplicate and rearrange columns [duplicate]

Is there any efficient way, without using for loops, to duplicate the columns in a data frame? For example, if I have the following data frame:
Var1 Var2
1 1 0
2 2 0
3 1 1
4 2 1
5 1 2
6 2 2
And I specify that column Var1 should be repeated twice, and column Var2 three times, then I would like to get the following:
Var1 Var1 Var2 Var2 Var2
1 1 1 0 0 0
2 2 2 0 0 0
3 1 1 1 1 1
4 2 2 1 1 1
5 1 1 2 2 2
6 2 2 2 2 2
Any help would be greatly appreciated!
We can replicate the column names (rep), use that as index to duplicate the columns. By default, the data.frame columns can have only unique column names, so it will use make.unique to add .1, .2 as suffix to the duplicate column names in 'df2'. If we don't want that, we can remove the suffix part with sub.
df2 <- df1[rep(names(df1), c(2,3))]
names(df2) <- sub('\\..*', '', names(df2))
df2
# Var1 Var1 Var2 Var2 Var2
#1 1 1 0 0 0
#2 2 2 0 0 0
#3 1 1 1 1 1
#4 2 2 1 1 1
#5 1 1 2 2 2
#6 2 2 2 2 2
Or as #Frank mentioned in the comments, we can also do
`[.noquote`(df1,c(1,1,2,2,2))

How do I add column values based on matching IDs in R?

I have two data frames:
A:
ID Var1 Var2 Var3
1 0 3 4
2 1 5 0
3 1 6 7
B:
ID Var1 Var2 Var3
1 2 4 2
2 2 1 1
3 0 2 1
4 1 0 3
I want to add the columns from A and B based on matching ID's to get data frame C, and keep row 4 from B (even though it does not have a matching ID from A):
ID Var1 Var2 Var3
1 2 7 6
2 3 6 1
3 1 8 8
4 1 0 3
rbind and aggregate by ID:
aggregate(. ~ ID, data=rbind(A,B), sum)
# ID Var1 Var2 Var3
#1 1 2 7 6
#2 2 3 6 1
#3 3 1 8 8
#4 4 1 0 3
In data.table you can similarly do:
library(data.table)
setDT(rbind(A,B))[, lapply(.SD, sum), by=ID]
And there would be analogous solutions in dplyr and sql or whatever else. Bind the rows, group by ID, sum.

R covert values in rows with string of multiple values to columns in a dataframe [duplicate]

This question already has answers here:
R: Convert delimited string into variables
(3 answers)
Closed 5 years ago.
I have a dataframe like this {each row in B is a string with values joined with $ symbol}:
A B
a 1$2$3
b 2$4$5
c 3$2$5
Now I want something like this{I want to create columns which say that the value is present in that row(of column B) or not.}:
A B 1 2 3 4 5
a 1$2$3 1 1 1 0 0
b 2$4$5 0 1 0 1 1
c 3$5 0 0 1 0 1
I want to do this without using any loops in R. Please help me
Thanks in advance
Here's another attempt. First, I get all the unique values across the B column and then combine table with factor while specifying these levels for all the splits of the B column (edited after some comments from #akrun)
temp <- strsplit(as.character(df$B), "\\$") # Save the split column
lvls <- unique(unlist(temp)) # Get unique values
df[lvls] <- do.call(rbind, lapply(temp, function(x) table(factor(x, levels = lvls))))
df
# A B 1 2 3 4 5
# 1 a 1$2$3 1 1 1 0 0
# 2 b 2$4$5 0 1 0 1 1
# 3 c 3$2$5 0 1 1 0 1
One option would to split the "B" column by $ into a list, convert the character class to numeric, stack the list to a data.frame, change the 'ind' column to numeric, using sparseMatrix we convert it to binary matrix and then cbind with the original dataset to get the expected output.
lst <- lapply(strsplit(as.character(df1$B), "[$]"), as.numeric)
df2 <- stack(setNames(lst, seq_along(lst)))
df2$ind <- as.numeric(as.character(df2$ind))
library(Matrix)
cbind(df1, as.matrix(sparseMatrix(df2$ind, df2$values, x=1)))
# A B 1 2 3 4 5
#1 a 1$2$3 1 1 1 0 0
#2 b 2$4$5 0 1 0 1 1
#3 c 3$2$5 0 1 1 0 1
You can also try cSplit_e from my "splitstackshape" package:
library(splitstackshape)
cSplit_e(mydf, "B", "$", fill = 0)
# A B B_1 B_2 B_3 B_4 B_5
# 1 a 1$2$3 1 1 1 0 0
# 2 b 2$4$5 0 1 0 1 1
# 3 c 3$2$5 0 1 1 0 1
Or, mtabulate from "qdapTools":
library(qdapTools)
cbind(mydf, mtabulate(strsplit(mydf$B, "\\$")))
# A B 1 2 3 4 5
# 1 a 1$2$3 1 1 1 0 0
# 2 b 2$4$5 0 1 0 1 1
# 3 c 3$2$5 0 1 1 0 1

specifying column name when one column is selected using grep in r

I am having an issue with the grep function. Specifically, when I tell R to get all the columns that start with a certain letter using the function, and there is only one such column, all that is yielded is the data with the code as the column name like this:
> head(newdat1)
i1 b2 b1 b17
1 1 1 2 0
2 1 1 2 0
3 1 1 2 0
4 1 1 2 0
5 2 1 1 0
6 3 1 1 1
datformeanfill<-as.data.frame(newdat1[,grep("^i", colnames(newdat1))])
> head(datformeanfill)
newdat1[, grep("^i", colnames(newdat1))]
1 1
2 1
3 1
4 1
5 2
6 3
As opposed to if I have two or more columns that start with the same letter:
datnotformeanfill<-as.data.frame(newdat1[,grep("^b", colnames(newdat1))])
> head(datnotformeanfill)
b2 b1 b17
1 1 2 1
2 1 2 1
3 1 2 1
4 1 2 1
5 1 1 1
6 1 1 2
Where we see the column names are maintained, and it does the same if I have multiple "i". Please help thanks!
Use
datformeanfill <- newdat1[,grep("^i", colnames(newdat1)), drop=FALSE]
to ensure you always get back a data.frame. See ?'[.data.frame' for the details.

Filter rows based on multiple column conditions R

Suppose I have a dataset that has 100-odd columns and I need to keep only those rows in the data which meets one condition applied across all 100 columns.. How do I do this?
Suppose, its like below... I need to only keep rows where either of Col1 or 2 or 3 or 4 is >0
Col1 Col2 Col3 Col4
1 1 3 4
0 0 4 2
4 3 4 3
2 1 0 2
1 2 0 3
0 0 0 0
In above example, except last row all rows will make it .. I need to place results in same dataframe as original. not sure if I can use the lapply to loop through the columns where>0 or I can use subset.. Any help is appreciated
Can I use column indices and do df<-subset(df,c(2:100)>0). This doesn't give me the right result.
Suppose your data.frame is DF then using [ will do the work for you.
> DF[DF[,1]>0 | DF[,2] >0 | DF[,3] >0 | DF[,4] >0, ]
Col1 Col2 Col3 Col4
1 1 1 3 4
2 0 0 4 2
3 4 3 4 3
4 2 1 0 2
5 1 2 0 3
If you have hundreds of columns you can use this alternative approach
> DF[rowSums(DF)=!0, ]
Col1 Col2 Col3 Col4
1 1 1 3 4
2 0 0 4 2
3 4 3 4 3
4 2 1 0 2
5 1 2 0 3
dat <- read.table(header = TRUE, text = "
Col1 Col2 Col3 Col4
1 1 3 4
0 0 4 2
4 3 4 3
2 1 0 2
1 2 0 3
0 0 0 0
")
You can use data.table to automatically accomodate however many columns your data.frame happens to have. Here's one way but there's probably a more elegant method of doing this with data.table:
require(data.table)
dt <- data.table(dat)
dt[rowSums(dt>0)>0]
# Col1 Col2 Col3 Col4
# 1: 1 1 3 4
# 2: 0 0 4 2
# 3: 4 3 4 3
# 4: 2 1 0 2
# 5: 1 2 0 3

Resources