Combine two identical dataframe columns into comma seperated columns in R - r

I have two identically structured dataframe (same amount of rows, columns and same headers). What I would like to do is to combine the two into one dataframe that has comma seperated columns.
I know how to do it with this dummy data frames, but using it on my own data would be very cumbersome.
This are my dummy data frames, the headers of my "real" data are "1","2","3" etc. while those of the dummy data frames are "X1","X2","X3" etc.
> data1
X1 X2 X3 X4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
> data2
X1 X2 X3 X4
1 8 9 13 14
2 9 10 14 15
3 10 11 15 16
What I would like:
>data3
new1 new2 new3 new4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
How I managed to get this output. But, it is too cumbersome for a large dataset I think.:
data1<- data.frame('1'=1:3, '2'=2:4, '3'=3:5,'4'=4:6)
data2<- data.frame('1'=8:10, '2'=9:11, '3'=13:15,'4'=14:16)
names(data1) <- c("1a","2a","3a","4a")
names(data2) <- c("1b","2b","3b","4b")
data3<- cbind(data1,data2)
cols.1 <- c('1a','1b'); cols.2 <-c('2a','2b')
cols.3 <- c('3a','3b'); cols.4 <-c('4a','4b')
data3$new1 <- apply( data3[ , cols.1] , 1 , paste , collapse = "," )
data3$new2 <- apply( data3[ , cols.2] , 1 , paste , collapse = "," )
data3$new3 <- apply( data3[ , cols.3] , 1 , paste , collapse = "," )
data3$new4 <- apply( data3[ , cols.4] , 1 , paste , collapse = "," )
data3 <-data3[,c(9:12)]
Is there a way in which I can iterate this, perhaps with a for loop? Any help would be appreciated.
These posts are somehow similar:
Same question but for rows in stead of columns:
how to convert column values into comma seperated row vlaues
Similar, but didn't work on my large dataset:
Paste multiple columns together

using only base:
data1 <- data.frame(x1 = 1:3, x2 = 2:4, x3 = 3:5, x4 = 4:6)
data2 <- data.frame(x1 = 8:10, x2 = 9:11, x3 = 13:15, x4 = 14:16)
data3 <- mapply(function(x, y){paste(x,y, sep = ",")}, data1, data2)
data3 <- as.data.frame(data3)
x1 x2 x3 x4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16

Here's a basic for loop approach:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
#> newdf
# X1 X2 X3 X4
# 1 1,8 2,9 3,13 4,14
# 2 2,9 3,10 4,14 5,15
# 3 3,10 4,11 5,15 6,16
Line by line explanation:
initialize new empty dataframe of appropriate dimensions:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
loop through 1,2,..n columns and fill each column with the paste results:
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
Disclaimer that this may be very slow on large datasets - a dplyr or data.frame approach (and perhaps some v/s/apply*() statement) will be faster, if you are interested in learning those methods.

Related

How to partition to multiple .csv from df based on whitespace row?

I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.
ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))

How to split a dataframe and attach the splitted part in new column?

I want to split a dataframe by changing values in the first column and afterward attach the split part in a new column. An example is given below. However, I end up with a list that I can't process back to a handy dataframe.
the desired output should look like df_goal, which is not yet properly formatted.
#data
x <-c(1,2,3)
y <-c(20200101,20200101,20200101)
z <-c(4.5,5,7)
x_name <- "ID"
y_name <- "Date"
z_name <- "value"
df <-data.frame(x,y,z)
names(df) <- c(x_name,y_name,z_name)
#processing
df$date <-format(as.Date(as.character(df$date), format="%Y%m%d"))
df01 <- split(df, f = df$ID)
#goal
a <-c(1)
b <-c(20200101)
c <-c(4.5)
d <-c(2)
e <-c(20200101)
f <-c(5)
g <-c(3)
h <-c(20200101)
i <-c(7)
df_goal <- data.frame(a,b,c,d,e,f,g,h,i)
You can use Reduce and cbind to cbind each row of a data.frame in one row and keep the type of the columns.
Reduce(function(x,y) cbind(x, df[y,]), 2:nrow(df), df[1,])
# ID Date value ID Date value ID Date value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(cbind(df[1,], df[2,]), df[3,])
or do.call with split:
do.call(cbind, split(df, 1:nrow(df)))
# 1.ID 1.Date 1.value 2.ID 2.Date 2.value 3.ID 3.Date 3.value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(df[1,], df[2,], df[3,])
In case you have several rows per ID you can try:
x <- split(df, df$ID)
y <- max(unlist(lapply(x, nrow)))
do.call(cbind, lapply(x, function(i) i[1:y,]))
This is a possible solution for your example :
new_df = data.frame(list(df[1,],df[2,],df[3,]))
And if you want to generalize that on a bigger data.frame :
new_list = list()
for ( i in 1:dim(df)[1] ){
new_list[[i]] = df[i,]
}
new_df = data.frame(new_list)
One option could be:
setNames(Reduce(c, asplit(df, 1)), letters[1:Reduce(`*`, dim(df))])
a b c d e f g h i
1.0 20200101.0 4.5 2.0 20200101.0 5.0 3.0 20200101.0 7.0
Maybe you can try the following code
df_goal <- data.frame(t(c(t(df))))
such that
> df_goal
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 20200101 4.5 2 20200101 5 3 20200101 7

Removing characters from column value and adding a new letter

I have the following data frame df1. I want to remove "/" from all values in column x2 and add letter v at the end of each value in x2.
df1
x1 x2
1 aa/bb/cc
2 ff/bb/cc
3 uu/bb/cc
Resulting df2
df2
x1 x2
1 aabbccv
2 ffbbccv
3 uubbccv
You can use gsub to remove the / and paste0 to add the v in each row:
df2 <- transform(df1, x2 = paste0(gsub("/", "", x2, fixed = TRUE), "v"))
df2
# x1 x2
#1 1 aabbccv
#2 2 ffbbccv
#3 3 uubbccv

Replace strings in all dataframe cells by corresponding entries in another data frame

I have a dataframe with a differing number of names in a cell of a dataframe which I want to replace with corresponding numbers of another dataframe. Afterwards, I want to proceed and calculate the mean and maximum but thats not part of my problem.
df_with_names <-read.table(text="
id names
1 AA,BB
2 AA,CC,DD
3 BB,CC
4 AA,BB,CC,DD
",header=TRUE,sep="")
The dataframe with the correspoding numbers looks like
df_names <-read.table(text="
name number_1 number_2
AA 20 30
BB 12 14
CC 13 29
DD 14 27
",header=TRUE,sep="")
At the end of the first step it should be
id number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
From here I know how to proceed but I don't know how to get there.
I tried to separate the names of each row in a loop into a dataframe and then replace the names but I always fail to get the right column of df_with_names. After a while, I doubt that replace() is the function I am looking for. Who can help?
library(data.table)
dt1 = as.data.table(df_with_names)
dt2 = as.data.table(df_names)
setkey(dt2, name)
dt2[setkey(dt1[, strsplit(as.character(names), split = ","), by = id], V1)][,
lapply(.SD, paste0, collapse = ","), keyby = id]
# id name number_1 number_2
#1: 1 AA,BB 20,12 30,14
#2: 2 AA,CC,DD 20,13,14 30,29,27
#3: 3 BB,CC 12,13 14,29
#4: 4 AA,BB,CC,DD 20,12,13,14 30,14,29,27
The above first splits the names along the comma in the first data.table, then joins that with the second one (after setting keys appropriately) and collapses all of the resulting columns back with a comma.
Another all in one:
data2match <- strsplit(df_with_names$names, ',')
lookup <- function(lookfor, in_df, return_col, search_col=1) {
in_df[, return_col][match(lookfor, in_df[, search_col])]
}
output <-
# for each number_x column....
sapply(names(df_names)[-1],
function(y) {
# for each set of names
sapply(data2match,
function(x) paste(sapply(x, lookup, df_names,
y, USE.NAMES=F), collapse=','))
})
data.frame(id=1:nrow(output), output)
Produces:
id number_1 number_2
1 1 20,12 30,14
2 2 20,13,14 30,29,27
3 3 12,13 14,29
4 4 20,12,13,14 30,14,29,27
Note: make sure both dataframes are ordered by id otherwise you may see unexpected results
listing <- df_with_names
listing <- strsplit(as.character(listing$names),",")
col1 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),2])
col2 <- lapply(listing, function(x) df_names[(df_names[[1]] %in% x),3])
col1 <- unlist(lapply(col1, paste0, collapse = ","))
col2 <- unlist(lapply(col2, paste0, collapse = ","))
data.frame(number_1 = col1, number_2 = col2 )
number_1 number_2
1 20,12 30,14
2 20,13,14 30,29,27
3 12,13 14,29
4 20,12,13,14 30,14,29,27
I don't like names like "names" or "name", so I went with "nam":
do.call( rbind, # reassembles the individual lists
apply(df_with_names, 1, # for each row in df_with_names
function(x) lapply( # lapply(..., paste) to each column
# Next line will read each comma separated value and
# and match to rows of df_names[] and return cols 2:3
df_names[ df_names$nam %in% scan(text=x[2], what="", sep=",") ,
2:3, drop=FALSE] , # construct packet of text digits
paste0, collapse=",") ) )
number_1 number_2
[1,] "20,12" "30,14"
[2,] "20,13,14" "30,29,27"
[3,] "12,13" "14,29"
[4,] "20,12,13,14" "30,14,29,27"
(I'm surprised that scan(text= ... a factor variable actually succeeded.)
Another method:
df3 = data.frame(id=df1$id,
number_1=as.character(df1$names),
number_2=as.character(df1$names), stringsAsFactors=FALSE)
for(n1 in 1:nrow(df3))
for(n2 in 1:nrow(df2)){
df3[n1,2] = sub(df2[n2,1],df2[n2,2], df3[n1,2] )
df3[n1,3] = sub(df2[n2,1],df2[n2,3], df3[n1,3] )
}
df3
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27
I think it would actually be worth your while to rearrange your df_with_names dataset to make things more straight-forward:
spl <- strsplit(as.character(df_with_names$names), ",")
df_with_names <- data.frame(
id=rep(df_with_names$id, sapply(spl, length)),
name=unlist(spl)
)
# id name
#1 1 AA
#2 1 BB
#3 2 AA
#4 2 CC
#5 2 DD
#6 3 BB
#7 3 CC
#8 4 AA
#9 4 BB
#10 4 CC
#11 4 DD
aggregate(
. ~ id,
data=merge(df_with_names, df_names, by="name")[-1],
FUN=function(x) paste(x,collapse=",")
)
# id number_1 number_2
#1 1 20,12 30,14
#2 2 20,13,14 30,29,27
#3 3 12,13 14,29
#4 4 20,12,13,14 30,14,29,27

Create new data frame depending on the most extreme value in rows

I have the following data frame and I would like to create a new one that will be like the one below.
ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7
New data frame
ID1 ID2 ID3 ID4
X x3 x2 x2 x2
Y y2 y2 y1 y2
Z z2 z2 z2 z2
Basically the idea is the following:
For each ID I want to find which of the rownames (x1_X,x2_X,x3_X) has the most extreme value and assign this to name X since in the rownames I have subgroups.
My data frame is huge: 1700 columns and 100000 rows.
First we need to split the group and subgroup labels:
grp <- strsplit(row.names(df), "_")
And if performance is an issue, I think data.table is our best choice:
library(data.table)
df$group <- sapply(grp, "[", 2)
subgroup <- sapply(grp, "[", 1)
dt <- data.table(df)
And we now have access to the single line:
result <- dt[,lapply(.SD, function(x) subgroup[.I[which.max(x)]]), by=group]
Which splits the data.table by the character after the underscore (by=group) and then, for every column of the rectangular subset (.SD) we get the index in the sub-rectangle (which.max), and then map it back to the whole data.table (.I), and then extract the relevant subgroup (subgroup).
The data.table package is meant to be quite efficient, though you might want to look into indexing your data.table if you're going to be querying it multiple times.
Your table:
df <- read.table (text= " ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7", header = T)
Split rownames to get groups:
library(plyr)
df_names <- ldply(strsplit (rownames(df), "_"))
colnames(df_names) <- c ("group1", "group2")
df2 <- cbind (df, df_names)
Create new table:
df_new <- data.frame (matrix(nrow = length(unique (df2$group2)),
ncol = ncol(df)))
colnames(df_new) <- colnames(df)
rownames (df_new) <- unique (df_names[["group2"]])
Filling new table with a loop:
for (i in 1:ncol (df_new)) {
for (k in 1:nrow (df_new)) {
col0 <- colnames (df_new)[i]
row0 <- rownames (df_new)[k]
sub0 <- df2 [df2$group2 == row0, c(col0, "group1")]
df_new [k,i] <- sub0 [sub0[1]==max (sub0[1]), 2]
}
}

Resources