R: Transform data frame into pseudoCSV - r

let's have a two column data frame like this:
A 1
A 2
A 4
A 5
B 2
B 13
C 1
C 3
C 6
C 18
D 8
E 2
E 112
...
Is there a quick method in R how to transform it to such two columns dataframe?
A 1;2;4;5
B 2;13
C 1;3;6;18
D 8
E 2;112
And how to put it back to the first structure again?
Thank you

A base R option would be (comments from #David Arenburg)
res1 <- aggregate(Col2 ~ Col1, df1, paste, collapse = ";")
Or using data.table
library(data.table)
res2 <- setDT(df1)[, list(Col2=paste(Col2, collapse=";")), Col1]
Or with dplyr
library(dplyr)
res3 <- df1 %>%
group_by(Col1) %>%
summarise(Col2= paste(Col2, collapse=";") )
Update
To convert the output back to the original structure
library(splitstackshape)
cSplit(res2, 'Col2', ';', 'long')
data
df1 <- structure(list(Col1 = c("A", "A", "A", "A", "B", "B", "C", "C",
"C", "C", "D", "E", "E"), Col2 = c(1L, 2L, 4L, 5L, 2L, 13L, 1L,
3L, 6L, 18L, 8L, 2L, 112L)), .Names = c("Col1", "Col2"),
class = "data.frame", row.names = c(NA, -13L))

paste() with collapse = ";" is used in aggregate() to concatenate V2. To return it to the original structure, strsplit() is used to split V2 in lapply() - do.call() is just to bind the resulting list row-wise.
df <- read.table(header = F, text = "
A 1
A 2
A 4
A 5
B 2
B 13
C 1
C 3
C 6
C 18
D 8
E 2
E 112")
df1 <- aggregate(df, by = list(df$V1), FUN = function(x) paste(x, collapse = ";"))[,-2]
names(df1) <- c("V1", "V2")
df1
# V1 V2
#1 A 1;2;4;5
#2 B 2;13
#3 C 1;3;6;18
#4 D 8
#5 E 2;112
df <- do.call(rbind, lapply(unique(df1$V1), function(x) {
df <- data.frame(x, strsplit(df1[df1$V1 == x, 2], ";"))
names(df) <- c("V1", "V2")
df
}))
df
# V1 V2
#1 A 1
#2 A 2
#3 A 4
#4 A 5
#5 B 2
#6 B 13
#7 C 1
#8 C 3
#9 C 6
#10 C 18
#11 D 8
#12 E 2
#13 E 112

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

Transpose Rows in batches to Columns in R

My data.frame df looks like this:
A 1
A 2
A 5
B 2
B 3
B 4
C 3
C 7
C 9
I want it to look like this:
A B C
1 2 3
2 3 7
5 4 9
I have tried spread() but probably not in the right way. Any ideas?
We can use unstack from base R
unstack(df1, col2 ~ col1)
# A B C
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or with split
data.frame(split(df1$col2, df1$col1))
Or if we use spread or pivot_wider, make sure to create a sequence column
library(dplyr)
library(tidyr)
df1 %>%
group_by(col1) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col1, values_from = col2) %>%
# or use
# spread(col1, col2) %>%
select(-rn)
# A tibble: 3 x 3
# A B C
# <int> <int> <int>
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or using dcast
library(data.table)
dcast(setDT(df1), rowid(col1) ~ col1)[, .(A, B, C)]
data
df1 <- structure(list(col1 = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), col2 = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 7L, 9L)),
class = "data.frame", row.names = c(NA,
-9L))
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), rowid(col1)~col1, value.var = 'col2')[, col1 := NULL][]
# A B C
#1: 1 2 3
#2: 2 3 7
#3: 5 4 9

How to sum df when it contains characters?

I am trying to prep my data and I am stuck with one issue. Lets say I have the following data frame:
df1
Name C1 Val1
A a x1
A a x2
A b x3
A c x4
B d x5
B d x6
...
and I want to narrow down the df to
df2
Name C1 Val
A a,b,c x1+x2+x3+x4
B d x5+x6
...
while a is a character value and x is numeric value
I have been trying using sapply, rowsum and
df2<- aggregate(df1, list(df1[,1]), FUN= summary)
but it just can't put the character values in a list for each Name.
Can someone help me how to receive df2?
m <- function(x) if(is.numeric(x<- type.convert(x)))sum(x) else toString(unique(x))
aggregate(.~Name,df1,m)
Name C1 Val1
1 A a, b, c 10
2 B d 11
where
df1
Name C1 Val1
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B d 5
6 B d 6
This is your df, I give it numbers 1 to 6 in Val1
df <-
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), C1 = structure(c(1L, 1L, 2L, 3L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Val1 = 1:6), row.names = c(NA,
-6L), class = "data.frame")
We just use summarise:
df %>%
group_by(Name) %>%
summarise(C1=paste(unique(C1),collapse=","),Val1=sum(Val1))
# A tibble: 2 x 3
Name C1 Val1
<fct> <chr> <int>
1 A a,b,c 10
2 B d 11
Quick and easy dplyr solution:
library(dplyr)
library(stringr)
df1 %>%
mutate(Val1_num = as.numeric(str_extract(Val1, "\\d+"))) %>%
group_by(Name) %>%
summarise(C1 = paste(unique(C1), collapse = ","),
Val1 = paste(unique(Val1), collapse = "+"),
Val1_num = sum(Val1_num))
#> # A tibble: 2 x 4
#> Name C1 Val1 Val1_num
#> <chr> <chr> <chr> <dbl>
#> 1 A a,b,c x1+x2+x3+x4 10
#> 2 B d x5+x6 11
Or in base:
df2 <- aggregate(df1, list(df1[,1]), FUN = function(x) {
if (all(grepl("\\d", x))) {
sum(as.numeric(gsub("[^[:digit:]]", "", x)))
} else {
paste(unique(x), collapse = ",")
}
})
df2
#> Group.1 Name C1 Val1
#> 1 A A a,b,c 10
#> 2 B B d 11
data
df1 <- read.csv(text = "
Name,C1,Val1
A,a,x1
A,a,x2
A,b,x3
A,c,x4
B,d,x5
B,d,x6", stringsAsFactors = FALSE)

How to use R to get all pairs from two column with index

I would like to use R to get all pairs from two column with index. It may need some loop to finish this function. For example, turn two columns with the gene name and index:
a 1,
b 1,
c 1,
d 2,
e 2
into a new matrix
a b 1,
b c 1,
a c 1,
d e 2
Can anyone help?
A tidyverse option using combn on a grouped data.frame:
library(tidyverse)
df %>% group_by(index) %>%
summarise(gene = list(as_data_frame(t(combn(gene, 2))))) %>%
unnest(.sep = '_')
## # A tibble: 4 × 3
## index gene_V1 gene_V2
## <int> <chr> <chr>
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
The same logic can be replicated in base R:
df2 <- aggregate(gene ~ index, df, function(x){t(combn(x, 2))})
do.call(rbind, apply(df2, 1, data.frame))
## index gene.1 gene.2
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
Data
df <- structure(list(gene = c("a", "b", "c", "d", "e"), index = c(1L,
1L, 1L, 2L, 2L)), .Names = c("gene", "index"), row.names = c(NA,
-5L), class = "data.frame")
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'index', we get the combn of 'gene', transpose it and set the names of the 2nd and 3rd column (if needed).
library(data.table)
setnames(setDT(df)[, transpose(combn(gene, 2, FUN = list)),
by = index], 2:3, paste0("gene", 1:2))[]
# index gene1 gene2
#1: 1 a b
#2: 1 a c
#3: 1 b c
#4: 2 d e

Merging files on the basis of columns

I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1
12 13 a
13 15 b
14 17 c
4 9 d
. . .
. . .
81 23 h
File 2
12 13 e
3 10 b
14 17 c
4 9 j
. . .
. . .
1 2 k
File 3
12 13 m
13 15 k
1 7 x
24 9 d
. . .
. . .
1 2 h
and so on.
I want to merge them to obtain the following result
12 13 a e m
13 15 b k
14 17 c c
4 9 d j
3 10 b
24 9 d
. . .
. . .
81 23 h
1 2 k
1 7 x
The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).
However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.
Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.
Here are two approaches:
Option 1: "dplyr" + "tidyr"
Use mget to put all of your data.frames into a list.
Use rbind_all to convert that list into a single data.frame.
Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)
combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
group_by(V1, V2) %>%
mutate(time = sequence(n())) %>%
ungroup() %>%
spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
#
# V1 V2 1 2 3
# 1 1 7 x
# 2 3 10 b
# 3 4 9 d j
# 4 12 13 a e m
# 5 13 15 b k
# 6 14 17 c c
# 7 24 9 d
Option 2: "data.table"
Use mget to put all of your data.frames into a list.
Use rbindlist to convert that list into a single data.table.
Use sequence(.N) to generate your sequence by your groups.
Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)
dcast.data.table(
rbindlist(mget(ls(pattern = "^file\\d")))[,
time := sequence(.N), by = list(V1, V2)],
V1 + V2 ~ time, value.var = "V3", fill = "")
# V1 V2 1 2 3
# 1: 1 7 x
# 2: 3 10 b
# 3: 4 9 d j
# 4: 12 13 a e m
# 5: 13 15 b k
# 6: 14 17 c c
# 7: 24 9 d
Both of these answers assume we are starting with the following sample data:
file1 <- structure(
list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L),
V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L),
V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L),
V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"),
class = "data.frame", row.names = c(NA, -4L))

Resources