How can I reshape my data, moving rows to new columns? - r

I know that my problem is trival, however now I'm learing methods how to reshape data in different ways, so please be understanding.
I have data like this:
Input = (
'col1 col2
A 2
B 4
A 7
B 3
A 4
B 2
A 4
B 6
A 3
B 3')
df = read.table(textConnection(Input), header = T)
> df
col1 col2
1 A 2
2 B 4
3 A 7
4 B 3
5 A 4
6 B 2
7 A 4
8 B 6
9 A 3
10 B 3
And I'd like to have something like this, where the column names are not important:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
So far, I did something like:
res_1 <- aggregate(col2 ~., df, toString)
col1 col2
1 A 2, 7, 4, 4, 3
2 B 4, 3, 2, 6, 3
And it actually works, however, I have one column and valiues are comma separated, instead of being in new columns, so I decided to fix it up:
res_2 <- do.call("rbind", strsplit(res_1$col2, ","))
[,1] [,2] [,3] [,4] [,5]
[1,] "2" " 7" " 4" " 4" " 3"
[2,] "4" " 3" " 2" " 6" " 3"
Adn finally combine it and remove unnecessary columns:
final <- cbind(res_1,res_2)
final$col2 <- NULL
col1 1 2 3 4 5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
So I have my desired output, but I'm not satisfied about the method, I'm sure there's one easy and short command for this. As I said I'd like to learn new more elegant options using different packages.
Thanks!

You can simply do,
do.call(rbind, split(df$col2, df$col1))
# [,1] [,2] [,3] [,4] [,5]
#A 2 7 4 4 3
#B 4 3 2 6 3
You can wrap it to data.frame() to convert from matrix to df

The question is tagged with reshape2 and reshape so we show the use of that package and the base reshape function. Also the use of dplyr/tidyr is illustrated. Finally we show a data.table solution and a second base R solution using xtabs.
reshape2 Add a group column and then convert from long to wide form:
library(reshape2)
df2 <- transform(df, group = paste0("v", ave(1:nrow(df), col1, FUN = seq_along)))
dcast(df2, col1 ~ group, value.var = "col2")
giving:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
2) reshape Using df2 from (1) we have the following base R solution using the reshape function:
wide <- reshape(df2, dir = "wide", idvar = "col1", timevar = "group")
names(wide) <- sub(".*\\.", "", names(wide))
wide
giving:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
3) dplyr/tidyr
library(dplyr)
library(tidyr)
df %>%
group_by(col1) %>%
mutate(group = paste0("v", row_number())) %>%
ungroup %>%
pivot_wider(names_from = "group", values_from = "col2")
giving:
# A tibble: 2 x 6
col1 v1 v2 v3 v4 v5
<fct> <int> <int> <int> <int> <int>
1 A 2 7 4 4 3
2 B 4 3 2 6 3
4) data.table
library(data.table)
as.data.table(df)[, as.list(col2), by = col1]
giving:
col1 V1 V2 V3 V4 V5
1: A 2 7 4 4 3
2: B 4 3 2 6 3
5) xtabs Another base R solution uses df2 from (1) and xtabs. This produces an object of class c("xtabs", "table")`. Note that it labels the dimensions.
xtabs(col2 ~., df2)
giving:
group
col1 v1 v2 v3 v4 v5
A 2 7 4 4 3
B 4 3 2 6 3

Related

Crossing .name_repair with duplicated column names

I would like to combine two dataframes using crossing, but some have the same columnnames. For that, I would like to add "_nameofdataframe" to these columns. Here are some reproducible dataframes (dput below):
> df1
person V1 V2 V3
1 A 1 3 3
2 B 4 4 5
3 C 2 1 1
> df2
V2 V3
1 2 5
2 1 6
3 1 2
When I run the following code it will return duplicated column names:
library(tidyr)
crossing(df1, df2, .name_repair = "minimal")
#> # A tibble: 9 × 6
#> person V1 V2 V3 V2 V3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 3 3 1 2
#> 2 A 1 3 3 1 6
#> 3 A 1 3 3 2 5
#> 4 B 4 4 5 1 2
#> 5 B 4 4 5 1 6
#> 6 B 4 4 5 2 5
#> 7 C 2 1 1 1 2
#> 8 C 2 1 1 1 6
#> 9 C 2 1 1 2 5
As you can see it returns the column names while being duplicated. My desired output should look like this:
person V1 V2_df1 V3_df1 V2_df2 V3_df2
1 A 1 3 3 1 2
2 A 1 3 3 1 6
3 A 1 3 3 2 5
4 B 4 4 5 1 2
5 B 4 4 5 1 6
6 B 4 4 5 2 5
7 C 2 1 1 1 2
8 C 2 1 1 1 6
9 C 2 1 1 2 5
So I was wondering if anyone knows a more automatic way to give the duplicated column names a name like in the desired output above with crossing?
dput of df1 and df2:
df1 <- structure(list(person = c("A", "B", "C"), V1 = c(1, 4, 2), V2 = c(3,
4, 1), V3 = c(3, 5, 1)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(V2 = c(2, 1, 1), V3 = c(5, 6, 2)), class = "data.frame", row.names = c(NA,
-3L))
As you probably know, the .name_repair parameter can take a function. The problem is crossing() only passes that function one argument, a vector of the concatenated column names() of both data frames. So we can't easily pass the names of the data frame objects to it. It seems to me that there are two solutions:
Manually add the desired suffix to an anonymous function.
Create a wrapper function around crossing().
1. Manually add the desired suffix to an anonymous function
We can simply supply the suffix as a character vector to the anonymous .name_repair parameter, e.g. suffix = c("_df1", "_df2").
crossing(
df1,
df2,
.name_repair = \(x, suffix = c("_df1", "_df2")) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
# person V1 V2_df1 V3_df1 V2_df2 V3_df2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 3 3 1 2
# 2 A 1 3 3 1 6
# 3 A 1 3 3 2 5
# 4 B 4 4 5 1 2
# 5 B 4 4 5 1 6
# 6 B 4 4 5 2 5
# 7 C 2 1 1 1 2
# 8 C 2 1 1 1 6
# 9 C 2 1 1 2 5
The disadvantage of this is that there is a room for error when typing the suffix, or that we might forget to change it if we change the names of the data frames.
Also note that we are checking for names which appear twice. If one of your original data frames already has broken (duplicated) names then this function will also rename those columns. But I think it would be unwise to try to do any type of join if either data frame did not have unique column names.
2. Create a wrapper function around crossing()
This might be more in the spirit of the tidyverse. Thecrossing() docs to which you linked state crossing() is a wrapper around expand_grid(). The source for expand_grid() show that it is basically a wrapper which uses map() to apply vctrs::vec_rep() to some inputs. So if we want to add another function to the call stack, there are two ways I can think of:
Using deparse(substitute())
crossing_fix_names <- function(df_1, df_2) {
suffixes <- paste0(
"_",
c(deparse(substitute(df_1)), deparse(substitute(df_2)))
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Output the same as above
crossing_fix_names(df1, df2)
The disadvantage of this is that deparse(substitute()) is ugly and can occasionally have surprising behaviour. The advantage is we do not need to remember to manually add the suffixes.
Using match.call()
crossing_fix_names2 <- function(df_1, df_2) {
args <- as.list(match.call())
suffixes <- paste0(
"_",
c(
args$df_1,
args$df_2
)
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Also the same output
crossing_fix_names2(df1, df2)
As we don't have the drawbacks of deparse(substitute()) and we don't have to manually specify the suffix, I think this is the probably the best approach.
test for the condition using dputs :
colnames(df1) %in% colnames(df2)
[1] FALSE FALSE TRUE TRUE
rename
colnames(df2) <- paste0(colnames(df2), '_df2')
then cbind
cbind(df1,df2)
person V1 V2 V3 V2_df2 V3_df2
1 A 1 3 3 2 5
2 B 4 4 5 1 6
3 C 2 1 1 1 2
not so elegant, but usefully discernible later.

split columns in "the middle" in R

I have an example data frame as such:
df_1 <- as.data.frame(cbind(c(14, 27, 38), c(25, 33, 52), c(85, 12, 23)))
Now, I want to split all these columns down the middle so that i get something that would look like this:
df_2 <- as.data.frame(cbind(c(1, 2, 3), c(4, 7, 8), c(2,3,5), c(5, 3, 2), c(8, 1, 2), c(5, 2, 3)))
So my question then is: Is there a command/package that can do this automatically?
In my real data frame I am looking to split columns by name, from an earlier regression where i got the names by inserting:
paste0(names(df)[i], "~", names(df)[j]) into my loop.
My thought, however, is that this will be quite easy once i find the right command for the data frames given above.
Thanks in advance!
You can use strsplit in base R:
as.data.frame(t(apply(df_1, 1, \(x) as.numeric(unlist(strsplit(as.character(x), ""))))))
V1 V2 V3 V4 V5 V6
1 1 4 2 5 8 5
2 2 7 3 3 1 2
3 3 8 5 2 2 3
Another possible solution:
library(tidyverse)
map(df_1, ~ str_split(.x, "", simplify = T)) %>% as.data.frame %>%
`names<-`(str_c("V", 1:ncol(.))) %>% type.convert(as.is = T)
#> V1 V2 V3 V4 V5 V6
#> 1 1 4 2 5 8 5
#> 2 2 7 3 3 1 2
#> 3 3 8 5 2 2 3
Thanks for the answers, they were a lot of help!
I ended up using the tidyr package with command:
test <- as.data.frame(separate(data = test, col = "V1", into = c("col_1", "col_2"), sep = "\\~"))
This worked great for me since I ran a regression earlier and had a good operator for separation: "~"
A base R, option would be to use read.fwf
v1 <- do.call(paste0, df_1)
read.fwf(textConnection(v1), widths = rep(1, max(nchar(v1))))
-output
V1 V2 V3 V4 V5 V6
1 1 4 2 5 8 5
2 2 7 3 3 1 2
3 3 8 5 2 2 3
Another option is to use the splitstackshape package:
df_2 <- df_1 %>%
splitstackshape::cSplit(., names(.), sep = "", stripWhite = F, type.convert = F) %>%
setnames(paste0("V", 1:ncol(.)))
Output
df_2
V1 V2 V3 V4 V5 V6
1: 1 4 2 5 8 5
2: 2 7 3 3 1 2
3: 3 8 5 2 2 3

How to remove just a part of a string in a data frame in R?

I would like to remove part of a string from a V2 column in a df.
df
V1 V2
3 scale_KD_1
10 scale_KD_5
4 scale_KD_10
7 scale_KD_7
The desired outcome would be:
df
V1 V2
3 1
10 5
4 10
7 7
Using readr and stringr packages:
library(readr)
df %>% mutate(V2 = parse_number(V2))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
library(stringr)
df %>% mutate(V2 = str_remove(V2, '.*_'))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
There are many ways to accomplish this. Just check which one is faster. Besides the ones mentioned by #Karthik S, you can try these ones:
library(dplyr)
library(stringr)
df %>%
mutate(V2 = str_extract(V2, '\\d+$'))
df %>%
mutate(V2 = str_remove(V2, '\\D+'))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
You can use sub to remove everything until _:
df$V2 <- sub(".*_", "", df$V2)
#df$V2 <- sub("\\D*", "", df$V2) #Some Alternatives
#df$V2 <- sub("[^[:digit:]]*", "", df$V2)
df
# V1 V2
#1 3 1
#2 10 5
#3 4 10
#4 7 7
Data:
df <- read.table(header=T, text=" V1 V2
3 scale_KD_1
10 scale_KD_5
4 scale_KD_10
7 scale_KD_7")

Master view of multiple dataframes with common columns

I have three dataframes like below:
df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','B','C','E','I'),col2=c(4,6,8,2,9))
df1 <- data.frame(col1=c('A','D','C','E','I'),col2=c(4,7,8,2,9))
The differences between any two files could be as below:
anti_join(df2, df3)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 B 6
# 2 I 9
anti_join(df3, df2)
# Joining, by = c("col1", "col2")
# [1] col1 col2
# <0 rows> (or 0-length row.names)
anti_join(df1, df2)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 D 7
anti_join(df2, df1)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 B 6
I would like to create a master dataframe with all the values in col1 and col2 specific to each dataframe. If there is no such value present, it should populate NA.
col1 df1_col2 df2_col2 df3_col2
1 A 4 4 4
2 B NA 6 NA
3 C 8 8 8
4 E 2 2 2
5 I 9 9 NA
6 D 7 NA NA
The essence of the above output could be established from the above anti_join commands. However, it does not provide the complete picture at once. Any thoughts on how to achieve this?
Edit: For multiple values in col2 for col1, the output is a little messier. For example, A has values 4, 3.
df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','A','B','C','E','I'),col2=c(4,3,6,8,2,9))
df1 <- data.frame(col1=c('A','A','D','C','E','I'),col2=c(4,3,7,8,2,9))
lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
lst_of_frames %>%
imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
reduce(full_join, by = "col1")
It gives the below output.
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 4 3 4
# 3 A 3 4 4
# 4 A 3 3 4
# 5 D 7 NA NA
# 6 C 8 8 8
# 7 E 2 2 2
# 8 I 9 9 NA
# 9 B NA 6 NA
The interesting part of the output is:
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 4 3 4
# 3 A 3 4 4
# 4 A 3 3 4
whereas the expected output is:
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 3 3 NA
You may use the full_join function from the dplyr package.
df_master <- df1 %>%
full_join(df2, by = "col1") %>%
full_join(df3, by = "col1") %>%
select(col1, df1_col2 = col2.x,
df2_col2 = col2.y,
df3_col2 = col2)
col1 df1_col2 df2_col2 df3_col2
1 A 4 4 4
2 D 7 NA NA
3 C 8 8 8
4 E 2 2 2
5 I 9 9 NA
6 B NA 6 NA
Similar to #tamtam's answer, but a little programmatic if you have a dynamic list of frames.
lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
# lst_of_frames <- tibble::lst(df1, df2, df3) # thanks, #user63230
library(dplyr)
library(purrr) # imap, reduce
lst_of_frames %>%
imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
reduce(full_join, by = "col1")
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 D 7 NA NA
# 3 C 8 8 8
# 4 E 2 2 2
# 5 I 9 9 NA
# 6 B NA 6 NA
It's important (for automatically renaming the columns) that the list-of-frames be a named list; my assumption was the name of the frame variable list(df1=df1), but it could just as easily be list(A=df1) to produce a column named A_col2 in the end.

Flatten data frame and shift rows to columns

I have a data frame like so:
df <- data.frame(
id = c(1, 1, 2, 2),
V1 = c(1:4),
V2 = c(5:8),
V3 = c(9:12))
Printed to the console it looks like this:
# id V1 V2 V3
# 1 1 1 5 9
# 2 1 2 6 10
# 3 2 3 7 11
# 4 2 4 8 12
Now, I would like to transform it to this shape:
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
How can I do this with base R or the tidyverse?
a possible tidyverse solution
wide <- df %>%
group_by(id) %>%
mutate(obs = row_number()) %>%
gather(var, val, V1:V3) %>%
unite(comb, obs, var) %>%
spread(comb, val)
colnames(wide)[-1] <- paste("V", seq(1,ncol(wide) -1), sep = "")
# A tibble: 2 x 7
# Groups: id [2]
# id V1 V2 V3 V4 V5 V6
#1 1 1 5 9 2 6 10
#2 2 3 7 11 4 8 12
You could do it with e.g. using by.
df2 <- do.call(rbind,
by(df, df$id, function(x) c(x[1, "id"], as.vector(t(x[names(x) != "id"]))))
)
colnames(df2) <- c("id", paste0("V", seq(ncol(df2)-1)))
id V1 V2 V3 V4 V5 V6
1 1 1 5 9 2 6 10
2 2 3 7 11 4 8 12
Base R:
lists <- Map(function(x) data.frame(c(x[1,], x[2,-1])), split(df, df$id))
df2 <- do.call(rbind, lists)
To change the column names:
colnames(df2) <- c("id", paste0("V", seq_along(df2[-1])))
And the result:
# > df2
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12

Resources