Gathering columns with unequal length with same name in R - r

I want to gather data from experiments from data frame into columns. Data is in the following form;
and I want to arrange data in the format given below;
Is there any simple method available to do it in R/RStudio? I tried tidyr, rbind and cbind as suggested in different examples. But I am unable to do ad found that these are not much relevant.
It will be great if someone help me to understand.
Thanks

Using data.table function melt with patterns in measure.vars, you can try the following:
library(data.table)
#Some data in the same format as you have
df <- read.table(text="
1 0.8525099 0.5598105 0.4242143 0 0.06016425 0.678719492 0.4852765 0.4970301 0.1657070
2 0.1237982 0.2853534 0.8281460 0.42586728 0.31214568 0.647306659 0.5445816 0.4250520 0.9975251
3 0.4907858 0.4925835 0.6689135 0.06042183 0.47391134 0.002571686 0.5267215 0.4291427 NA
4 0.8524778 0.1091856 0.6529887 0.24606198 0.44869099 0.540201766 0.6263992 0.1448730 NA
")
#assigning names to columns
colnames(df) <- c("Run", rep(c("Logging", "Salinity","surface"),3))
setDT(df) #converting df into a data.table
df #Similar as your initial data frame
Run Logging Salinity surface Logging Salinity surface Logging Salinity surface
1: 1 0.8525099 0.5598105 0.4242143 0.00000000 0.06016425 0.678719492 0.4852765 0.4970301 0.1657070
2: 2 0.1237982 0.2853534 0.8281460 0.42586728 0.31214568 0.647306659 0.5445816 0.4250520 0.9975251
3: 3 0.4907858 0.4925835 0.6689135 0.06042183 0.47391134 0.002571686 0.5267215 0.4291427 NA
4: 4 0.8524778 0.1091856 0.6529887 0.24606198 0.44869099 0.540201766 0.6263992 0.1448730 NA
df2 <- melt(df, #melting data, converting from wide to long
id.vars = 1, # here we attempt to fix the first column "Runs"
measure.vars = patterns(Logging="Logging", # here we look up for a pattern of column names to convert into measure
Salinity= "Salinity",
surface="surface")
)
#Output
df2
Run variable Logging Salinity surface
1: 1 1 0.85250990 0.55981050 0.424214300
2: 2 1 0.12379820 0.28535340 0.828146000
3: 3 1 0.49078580 0.49258350 0.668913500
4: 4 1 0.85247780 0.10918560 0.652988700
5: 1 2 0.00000000 0.06016425 0.678719492
6: 2 2 0.42586728 0.31214568 0.647306659
7: 3 2 0.06042183 0.47391134 0.002571686
8: 4 2 0.24606198 0.44869099 0.540201766
9: 1 3 0.48527650 0.49703010 0.165707000
10: 2 3 0.54458160 0.42505200 0.997525100
11: 3 3 0.52672150 0.42914270 NA
12: 4 3 0.62639920 0.14487300 NA
And finally remove column variable
#Removing column variable (second column in df2) you get your result
df2[, -2]
Run Logging Salinity surface
1: 1 0.85250990 0.55981050 0.424214300
2: 2 0.12379820 0.28535340 0.828146000
3: 3 0.49078580 0.49258350 0.668913500
4: 4 0.85247780 0.10918560 0.652988700
5: 1 0.00000000 0.06016425 0.678719492
6: 2 0.42586728 0.31214568 0.647306659
7: 3 0.06042183 0.47391134 0.002571686
8: 4 0.24606198 0.44869099 0.540201766
9: 1 0.48527650 0.49703010 0.165707000
10: 2 0.54458160 0.42505200 0.997525100
11: 3 0.52672150 0.42914270 NA
12: 4 0.62639920 0.14487300 NA

You can use tidyr::pivot_longer. Using #Chriss Paul's data.
tidyr::pivot_longer(df, cols = -Run, names_to = '.value')
# Run Logging Salinity surface
# <int> <dbl> <dbl> <dbl>
# 1 1 0.853 0.560 0.424
# 2 1 0 0.0602 0.679
# 3 1 0.485 0.497 0.166
# 4 2 0.124 0.285 0.828
# 5 2 0.426 0.312 0.647
# 6 2 0.545 0.425 0.998
# 7 3 0.491 0.493 0.669
# 8 3 0.0604 0.474 0.00257
# 9 3 0.527 0.429 NA
#10 4 0.852 0.109 0.653
#11 4 0.246 0.449 0.540
#12 4 0.626 0.145 NA
PS - It is not advised to have data with duplicate column names.

I created a similar data set and applied the following code based on binding since you mentioned it in your question, it may sound verbose but it gets you to the desired output:
library(dplyr)
df <- tibble(
runs = c(1, 2, 3, 4),
col1 = c(3, 4, 5, 5),
col2 = c(5, 3, 1, 4),
col3 = c(6, 4, 9, 2),
col1 = c(0, 2, 2, 1),
col2 = c(2, 3, 1, 7),
col3 = c(2, 4, 9, 9),
col1 = c(3, 4, 5, 7),
col2 = c(3, 3, 1, 4),
col3 = c(3, 2, NA, NA), .name_repair = "minimal")
df %>%
select(2:4) %>%
bind_rows(df %>%
select(5:7)) %>%
bind_rows(df %>%
select(8:10)) %>%
select(run, col1:col3)
Ok there are two other ways I thought you might be interested to know, since it was your question. These are not my codes completely and I got help for that but there are great alternative ways of dealing with the same problem:
df %>%
pivot_longer(cols = starts_with("col"), names_to = c(".value")) %>% # Pay attention to the `.value` sentinel it indicates that component of the name defines the name of the column containing the cell values, overriding values_to.
group_by(runs) %>%
mutate(id = row_number()) %>%
ungroup() %>%
arrange(id) %>%
select(-id)
# A tibble: 12 x 4
runs col1 col2 col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 6
2 2 4 3 4
3 3 5 1 9
4 4 5 4 2
5 1 0 2 2
6 2 2 3 4
7 3 2 1 9
8 4 1 7 9
9 1 3 3 3
10 2 4 3 2
11 3 5 1 NA
12 4 7 4 NA
The above code is proposed by # Ashley G which was amazing.
And the base R alternative:
data.frame(df$runs,
sapply(split.default(df[-1], names(df)[-1]), unlist),
row.names = NULL)
df.runs col1 col2 col3 # Here `split.default` splits the columns of data frame whereas `split` splits the rows.
1 1 3 5 6
2 2 4 3 4
3 3 5 1 9
4 4 5 4 2
5 1 0 2 2
6 2 2 3 4
7 3 2 1 9
8 4 1 7 9
9 1 3 3 3
10 2 4 3 2
11 3 5 1 NA
12 4 7 4 NA
The base R code is written by #Ronak Shah, for which I'm very grateful.

Related

Merging multiple connected columns

I have two different columns for several samples, which are connected. I want to merge all columns of type 1 to one column and all of type 2 to one column, but the rows should stay connected.
Example:
a1 <- c(1, 2, 3, 4, 5)
b1 <- c(1, 4, 9, 16, 25)
a2 <- c(2, 4, 6, 8, 10)
b2 <- c(4, 8, 12, 16, 20)
df1 <- data.frame(a1, b1, a2, b2)
a1 b1 a2 b2
1 1 1 2 4
2 2 4 4 8
3 3 9 6 12
4 4 16 8 16
5 5 25 10 20
I want to have it like this:
a b
1 1 1
2 2 4
3 2 4
4 3 9
5 4 8
6 4 16
7 5 25
8 6 12
9 8 16
10 10 20
My case
This is the example in my case. I have a lot of columns with different names and I want to extract abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new data frame, with all abs_dist in one column and all mean_vel in one column, but still connected.
I tried with unlist, but then of course the connection gets lost.
Thanks in advance.
A base R option using reshape
subset(
reshape(
setNames(df1, gsub("(\\d)", ".\\1", names(df1))),
direction = "long",
varying = 1:ncol(df1)
),
select = -c(time, id)
)
gives
a b
1.1 1 1
2.1 2 4
3.1 3 9
4.1 4 16
5.1 5 25
1.2 2 4
2.2 4 8
3.2 6 12
4.2 8 16
5.2 10 20
An option with pivot_longer from tidyr by specifying the names_sep as a regex lookaround to match between a lower case letter ([a-z]) and a digit in the column names
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
select(-grp)
-output
# A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 1 1
# 2 2 4
# 3 2 4
# 4 4 8
# 5 3 9
# 6 6 12
# 7 4 16
# 8 8 16
# 9 5 25
#10 10 20
With the edited post, we need to change the names_sep i.e. the delimiter is now _ between a lower case letter and a digit
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])_(?=[0-9])") %>%
select(-grp)
or with base R, use split.default on the substring of column names into a list of data.frame, then unlist each list element by looping over the list and convert to data.frame
data.frame(lapply(split.default(df1, sub("\\d+", "", names(df1))),
unlist, use.names = FALSE))
For the sake of completeness, here is a solution which uses data.table::melt() and the patterns() function to specify columns which belong together:
library(data.table)
melt(setDT(df1), measure.vars = patterns(a = "a", b = "b"))[
order(a,b), !"variable"]
a b
1: 1 1
2: 2 4
3: 2 4
4: 3 9
5: 4 8
6: 4 16
7: 5 25
8: 6 12
9: 8 16
10: 10 20
This reproduces the expected result for OP's sample dataset.
A more realistic example: reshape only selected columns
With the edit of the question, the OP has clarifified that the production data contains many more columns than those which need to be reshaped:
I have a lot of columns with different names and I want to extract
abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new
data frame, with all abs_dist in one column and all mean_vel in one
column, but still connected.
So, the OP wants to extract and reshape the columns of interest in one go while ignoring all other data in the dataset.
To simulate this situation, we need a more elaborate dataset which includes other columns as well:
df2 <- cbind(df1, c1 = 11:15, c2 = 21:25)
df2
a1 b1 a2 b2 c1 c2
1 1 1 2 4 11 21
2 2 4 4 8 12 22
3 3 9 6 12 13 23
4 4 16 8 16 14 24
5 5 25 10 20 15 25
With a modified version of the code above
library(data.table)
cols <- c("a", "b")
result <- melt(setDT(df2), measure.vars = patterns(cols), value.name = cols)[, ..cols]
setorderv(result, cols)
result
we get
a b
1: 1 1
2: 2 4
3: 3 9
4: 4 16
5: 5 25
6: 2 4
7: 4 8
8: 6 12
9: 8 16
10: 10 20
For the production dataset as pictured in the edit, the OP needs to set
cols <- c("abs_dist", "mean_vel")

dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe

I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4

using merge to create blank rows

I am running multiple simulations with the same input parameters. Some simulations complete earlier than others and I need to extend the results of the shorter simulations so that I can analyse the data with all runs included. This means filling up 'short' runs with repeats of the final values until they are the same length as the 'long' runs with the same input parameters.
I would like a dplyr solution because the real datasets are massive and dplyr has fast joins.
Here is my attempt.
library(dplyr)
sims <- data.frame("run" = c(1, 1, 1, 2, 2, 3, 3),
"type" = c("A", "A", "A", "A", "A", "B", "B"),
"step" = c(0, 1, 2, 0, 1, 0, 1),
"value" = seq(1:7))
allSteps <- data.frame("type" = c("A", "A", "A", "B", "B"),
"step" = c(0, 1, 2, 0, 1))
merged <- full_join(sims, allSteps,
by = c("type", "step"))
This gets the output:
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
3 B 0 6
3 B 1 7
But I actually want the following because run 2 is of type A and should therefore be expanded to the same length as run 1 (also type A):
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
2 A 2 NA # extra line here
3 B 0 6
3 B 1 7
I will then use fill to get to my desired result of:
run type step value
1 A 0 1
1 A 1 2
1 A 2 3
2 A 0 4
2 A 1 5
2 A 2 5 # filled replacement of NA
3 B 0 6
3 B 1 7
I am sure this is a duplicate of some question but the various search terms I used didn't manage to surface it.
We don't really need the data.frame allSteps if at least one of the runs contains the full sequence for each type. Instead we can use tidyr::expand() in combination with a self-join:
library(tidyr)
sims %>% group_by(type) %>%
expand(run, step) %>%
full_join(sims, by = c("type", "step", "run")) %>%
select(2,1,3,4)
# run type step value
# <dbl> <fctr> <dbl> <int>
#1 1 A 0 1
#2 1 A 1 2
#3 1 A 2 3
#4 2 A 0 4
#5 2 A 1 5
#6 2 A 2 NA
#7 3 B 0 6
#8 3 B 1 7
Using tidyr::complete to get missing combinations, then use fill to fill NAs with last non-NA value:
library(tidyr)
sims %>%
group_by(type) %>%
complete(run, step) %>%
select(run, type, step, value) %>%
ungroup() %>%
fill(value)
# # A tibble: 8 x 4
# run type step value
# <dbl> <fct> <dbl> <int>
# 1 1.00 A 0 1
# 2 1.00 A 1.00 2
# 3 1.00 A 2.00 3
# 4 2.00 A 0 4
# 5 2.00 A 1.00 5
# 6 2.00 A 2.00 5
# 7 3.00 B 0 6
# 8 3.00 B 1.00 7
We can split the data frame by run and do a right_join on allSteps for each of them to have all the combinations you desire. Then we join back and fill.
It's a bit more general than current solutions in that you could have steps in allSteps that may not be in sims or in the sims subset you're working on.
library(tidyverse)
sims %>%
split(.$run) %>%
map_dfr(right_join,allSteps,.id = "id") %>%
group_by(type,id) %>%
fill(run,value,.direction="down") %>%
ungroup %>%
filter(!is.na(run)) %>%
select(-id)
# # A tibble: 8 x 4
# run type step value
# <dbl> <fctr> <dbl> <int>
# 1 1 A 0 1
# 2 1 A 1 2
# 3 1 A 2 3
# 4 2 A 0 4
# 5 2 A 1 5
# 6 2 A 2 5
# 7 3 B 0 6
# 8 3 B 1 7

Collapsing rows with dplyr

I am new to R and am trying to collapse rows based on row values with dplyr. The following example shows the sample data.
set.seed(123)
df<-data.frame(A=c(rep(1:4,4)),
B=runif(16,min=0,max=1),
C=rnorm(16, mean=1,sd=0.5))
A B c
1 1 0.36647435 0.7485365
2 2 0.51864614 0.8654337
3 3 0.04596929 0.9858012
4 4 0.15479619 1.1294208
5 1 0.76712372 1.2460700
6 2 0.17666676 0.7402996
7 3 0.89759874 1.2699954
8 4 0.90267735 0.7101804
9 1 0.91744223 0.3451281
10 2 0.25472599 0.8604743
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1 0.21157810 1.3170205
14 2 0.14947268 1.2789700
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853
I want to summarize/collapse two rows based on the condition that the rows in column A with values 1 and 2 as one row (as mean of row 1 and 2) . Therefore the final result will have only 12 rows because the other 4 rows has been collapsed.
I tried to use the following dplyr function but to little avail.
install.packages ("tidyverse")
library (tidyverse)
df %>% summarize_each( fun(i){ for i %in% c(1,2)funs(mean) })
The expected output is something like:
A B C
1 1.5 0.4425602 0.8069851
3 3 0.04596929 0.9858012
4 4 0.15479619 1.1294208
5 1.5 0.4718952 0.9931848
7 3 0.89759874 1.2699954
8 4 0.90267735 0.7101804
9 1.5 0.5860841 0.6028012
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1.5 0.1805254 1.297995
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853
Thank you in advance.
By making the implicit, order based groupings explicit, the summary can
be done with a single summarise_all call.
# Generate the data
set.seed(1)
df <- data.frame(
A = c(rep(1:4, 4)),
B = runif(16, min = 0, max = 1),
C = rnorm(16, mean = 1, sd = 0.5)
)
library(dplyr)
new <- df %>%
group_by(grp = rep(
1:4, # vector containing names of groups to create
each = 4 # number of elements in each group
)) %>%
group_by(mean_grp = cumsum(A > 2) + 1, add = T) %>%
summarise_all(mean) %>%
ungroup()
new
#> # A tibble: 12 x 5
#> grp mean_grp A B C
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1.5 0.3188163 1.067598241
#> 2 1 2 3.0 0.5728534 1.755890584
#> 3 1 3 4.0 0.9082078 1.194921618
#> 4 2 1 1.5 0.5500358 0.291014883
#> 5 2 2 3.0 0.9446753 1.562465459
#> 6 2 3 4.0 0.6607978 0.977533195
#> 7 3 1 1.5 0.3454502 1.231911487
#> 8 3 2 3.0 0.2059746 1.410610598
#> 9 3 3 4.0 0.1765568 1.296950661
#> 10 4 1 1.5 0.5355633 1.425278418
#> 11 4 2 3.0 0.7698414 1.037282492
#> 12 4 3 4.0 0.4976992 0.005324152
I would recommend keeping the grouping variables in your data after the
summary (everything is simpler if you include them in the first place),
but if you want to, you can drop them with
new %>% select(-grp, -mean_grp).
PS. In order to avoid having "magic numbers" (such as the 1:4 and each = 4 when creating grp) included in the code, you could also create the first grouping variable as:
grp = cumsum(A < lag(A, default = A[1])) + 1
Assuming that the original data are ordered such that a new group starts each time the value of A is less than the previous value of A.
One option would be to process the rows with A equal to 1 or 2 separately from the other rows and then bind them back together:
set.seed(3)
df<-data.frame(A=c(rep(1:4,4)),B=runif(16,min=0,max=1),c=rnorm(16, mean=1,sd=0.5))
df %>%
filter(A %in% 1:2) %>%
group_by(tmp=cumsum(A==1)) %>%
summarise_all(mean) %>%
ungroup %>% select(-tmp) %>%
bind_rows(df %>% filter(!A %in% 1:2))
A B c
<dbl> <dbl> <dbl>
1 1.5 0.4877790 1.0121278
2 1.5 0.6032474 0.8840735
3 1.5 0.6042946 0.5996850
4 1.5 0.5456424 0.6198039
5 3.0 0.3849424 0.6276092
6 4.0 0.3277343 0.4343907
7 3.0 0.1246334 1.0760229
8 4.0 0.2946009 0.8461718
9 3.0 0.5120159 1.6121568
10 4.0 0.5050239 1.0999058
11 3.0 0.8679195 0.8981359
12 4.0 0.8297087 0.1667626

Creating a new data frame using existing data

I would like to create a new data from my existing data frame "ab". The new data frame should look like "Newdf".
a<- c(1:5)
b<-c(11:15)
ab<-data.frame(C1=a,c2=b)
ab
df<-c(1,11,2,12,3,13,4,14,5,15)
CMT<-c(1:2)
CMT1<-rep.int(CMT,times=5)
Newdf<-data.frame(DV=df,Comp=CMT1)
Newdf
Can we use dplyr package? If yes, how?
More importantly than dplyr, you'd need tidyr:
library(tidyr)
library(dplyr)
ab %>%
gather(Comp, DV) %>%
mutate(Comp = recode(Comp, "C1" = 1, "c2" = 2))
# Comp DV
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 11
# 7 2 12
# 8 2 13
# 9 2 14
# 10 2 15
Using dplyr and tidyr gives you something close...
library(tidyr)
library(dplyr)
df2 <- ab %>%
mutate(Order=1:n()) %>%
gather(key=Comp,value=DV,C1,c2) %>%
arrange(Order) %>%
mutate(Comp=recode(Comp,"C1"=1,"c2"=2)) %>%
select(DV,Comp)
df2
DV Comp
1 1 1
2 11 2
3 2 1
4 12 2
5 3 1
6 13 2
7 4 1
8 14 2
9 5 1
10 15 2
Although the OP has asked for a dpylr solution, I felt challenged to look for a data.table solution. So, FWIW, here is an alternative approach using melt().
Note that this solution does not depend on specific column names in ab as the two other dplyr solutions do. In addition, it should be working for more than two columns in ab as well (untested).
library(data.table)
melt(setDT(ab, keep.rownames = TRUE), id.vars = "rn", value.name = "DV"
)[, Comp := rleid(variable)
][order(rn)][, c("rn", "variable") := NULL][]
# DV Comp
# 1: 1 1
# 2: 11 2
# 3: 2 1
# 4: 12 2
# 5: 3 1
# 6: 13 2
# 7: 4 1
# 8: 14 2
# 9: 5 1
#10: 15 2
Data
ab <- structure(list(C1 = 1:5, c2 = 11:15), .Names = c("C1", "c2"),
row.names = c(NA, -5L), class = "data.frame")
ab
# C1 c2
#1 1 11
#2 2 12
#3 3 13
#4 4 14
#5 5 15

Resources