To change tabular data to a different format in R - r

Change below data
pos BZ_SP BZ_SP_m1 BZ_SP_m2 CL_SP CL_SP_m1 CL_SP_m2
1 -300000 2 3 2540544 1 2
2 0 0 0 -118621 3 4
to look this way
CurveGroup SpreadId SpreadMonth1 SpreadMonth2 Position
BZ_SP 1 2 3 -300000
CL_SP 1 1 2 2540544
BZ_SP 2 0 0 0
CL_SP 2 3 4 -118621

gather the input into long form and then separate the variable into Curvegroup and suffix. spread it back out to wide form. Rename and rearrange the columns.
library(dplyr)
library(tidyr)
DF %>%
gather(variable, value, -pos) %>%
separate(variable, c("CurveGroup", "suffix"), sep = 5, fill = "right") %>%
spread(suffix, value) %>%
select(CurveGroup, SpreadId = "pos", SpreadMonth1 = "_m1", SpreadMonth2 = "_m2",
Position = "V1")
giving:
CurveGroup SpreadId SpreadMonth1 SpreadMonth2 Position
1 BZ_SP 1 2 3 -300000
2 CL_SP 1 1 2 2540544
3 BZ_SP 2 0 0 0
4 CL_SP 2 3 4 -118621
Note: The input DF in reproducible form is:
DF <- structure(list(pos = 1:2, BZ_SP = c(-300000L, 0L), BZ_SP_m1 = c(2L,
0L), BZ_SP_m2 = c(3L, 0L), CL_SP = c(2540544L, -118621L), CL_SP_m1 = c(1L,
3L), CL_SP_m2 = c(2L, 4L)), .Names = c("pos", "BZ_SP", "BZ_SP_m1",
"BZ_SP_m2", "CL_SP", "CL_SP_m1", "CL_SP_m2"),
class = "data.frame", row.names = c(NA, -2L))
Update: Simplified.

Related

How to replace the values in a binary matrix with values from a dataframe?

The matrix I have looks something like this:
Plot A B C
1 1 0 0
2 1 0 1
3 1 1 0
And I have a dataframe that looks like this
A 5
B 4
C 2
What I would like to do is replace the "1" values in the matrix with the corresponding values in the dataframe, like this:
Plot A B C
1 5 0 0
2 5 0 2
3 5 4 0
Any suggestions on how to do this in R? Thank you!
An option with tidyverse
library(dplyr)
df1 %>%
mutate(across(all_of(df2$col1),
~ replace(.x, .x== 1, df2$col2[match(cur_column(), df2$col1)])))
-output
Plot A B C
1 1 5 0 0
2 2 5 0 2
3 3 5 4 0
data
df1 <- structure(list(Plot = 1:3, A = c(1L, 1L, 1L), B = c(0L, 0L, 1L
), C = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(col1 = c("A", "B", "C"), col2 = c(5, 4, 2)),
class = "data.frame", row.names = c(NA,
-3L))

How to get merged data frame from two data frames having some same columns(R)

I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0

R function for collapsing multiple ranges of different columns from wide to long format?

I've a dataset with multiple different ranges of columns in each row (each row corresponds to one individual), as below. Each instance of the different column types have 3 levels (0,1 and 2).
id col1_0 col1_1 col1_2 col2_0 col2_1 col2_2 col3_0 col3_1 col3_2
1 0 1 3 2 2 3 3 4 5
2 1 1 2 2 4 7 4 5 5
.
.
etc.
What I would need is to collapse all col1 into one column, all col2 into another and all col3's into another, for each id. As below.
id x col1 col2 col4
1 0 0 2 3
1 1 1 2 4
1 2 3 3 5
2 0 1 2 4
2 1 1 4 5
2 2 1 7 5
.
.
etc.
In addition, I would also need to create an x-column with values 0,1 and 2, for each id. However, I only manage to collapse the first range of columns (col1) with the code below.
library(tidyverse)
longer_data <- dataframe %>%
group_by(id) %>%
pivot_longer(col1_0:col1_2, names_to = "x1", values_to = "col1")
x1 here creates a column with the original column names. So I would create need an additional x-column that only keeps the last numbers of the original column names.
Is there a way to achieve this? Many thanks in advance!
We don't need any group_by. It can be directly done with pivot_longer by specifying the names_sep and the .value in names_to. Note the order of .value and x. It implies the values of that column should go into the each of those prefixes before the _ and the new column with suffix stub goes into 'x'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_to = c('.value', 'x'), names_sep = "_")
-output
# A tibble: 6 x 5
# id x col1 col2 col3
# <int> <chr> <int> <int> <int>
#1 1 0 0 2 3
#2 1 1 1 2 4
#3 1 2 3 3 5
#4 2 0 1 2 4
#5 2 1 1 4 5
#6 2 2 2 7 5
data
df1 <- structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)),
class = "data.frame", row.names = c(NA,
-2L))
Here is a base R option using reshape, where timevar="x" creates a column named x, and sep="_" helps to fetch the last numbers of the original column names.
res <- reshape(
df,
direction = "long",
idvar = "id",
varying = -1,
timevar = "x",
sep = "_"
)
res <- res[order(res$id), ]
Output
> res
id x col1 col2 col3
1.0 1 0 0 2 3
1.1 1 1 1 2 4
1.2 1 2 3 3 5
2.0 2 0 1 2 4
2.1 2 1 1 4 5
2.2 2 2 2 7 5
Data
> dput(df)
structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)), class = "data.frame", row.names = c(NA,
-2L))

Creating duplicated data frames with different ID

I have a question for the community and hoping for some help.
I am trying to duplicate a data frame like the one below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
I am trying to duplicate the above data frame 100 times so, it would read as below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
2 0 1
2 2 2
2 4 3
2 6 1
3 0 1
3 2 2
3 4 3
3 6 1
4 0 1
4 2 2
4 4 3
4 6 1
.....
100 0 1
100 2 2
100 4 3
100 6 1
Does anyone have a good solution for this or a resource to read up on this?
Thanks!
We can use replicate
out <- do.call(rbind, replicate(100, df1, simplify = FALSE))
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or another option is rep
out <- df1[rep(seq_len(nrow(df1)), 100),]
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or make use of uncount
library(tidyr)
library(dplyr)
uncount(df1, 100) %>%
mutate(ID = as.integer(gl(n(), nrow(df1), n()))
Or another option is
df1 %>%
nest_by(ID) %>%
uncount(100) %>%
mutate(ID = row_number()) %>%
unnest(c(data))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L), Time = c(0L, 2L, 4L, 6L
), Solve = c(1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-4L))

group_by() and percentages: summarise() drops the columns I also need - R

I have this df:
> df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
+ Children = sample(0:2, 10, replace = TRUE),
+ Teens = sample(1:3, 10, replace = TRUE),
+ stringsAsFactors = FALSE)
> df
Adults Children Teens
1 5 0 1
2 5 1 2
3 5 2 3
4 5 2 2
5 0 1 2
6 5 1 3
7 0 2 3
8 4 2 1
9 4 0 1
10 1 2 1
We can see that Children doesn't have 3,4,5 values and Teens doesn't have 0,4,5 values. However, we know that Adults, Children, and Teens could have from 0 to 5.
When I use group_by() with summarise(), summarise drops the columns I'm not grouping. The code:
df %>%
group_by(Adults) %>% mutate(n_Adults = n()) %>%
group_by(Teens) %>% mutate(n_Teens = n()) %>%
group_by(Children) %>% mutate(n_Children = n())
And when I group by c(0,1,2,3,4,5) (in order to have all the possible values) it gives me this error:
Error in mutate_impl(.data, dots) : Column `c(0, 1, 2, 3, 4, 5)` must be length 10 (the number of rows) or one, not 6
I'm looking for this output:
Values n_Adults n_Children n_Teens p_Adults p_Children p_Teens
0 2 2 0 0.2 0.2 0
1 1 3 4 0.1 0.1 0.4
2 0 5 3 0 0 0.3
3 0 0 3 0 0 0.3
4 2 0 0 0.2 0.2 0
5 5 0 0 0.5 0.5 0
Where n_ is the count of the respective column and p_ is the percentage of the respective column.
We can gather the data into 'long' format, get the frequency with count after converting the 'value' to factor with levels specified as 0:5, spread to 'wide' format and create the 'p' columns by dividing with the sum of each column and if needed change the column name (with rename_at)
library(tidyverse)
gather(df) %>%
count(key, value = factor(value, levels = 0:5)) %>%
spread(key, n, fill = 0) %>%
mutate_at(2:4, list(p = ~./sum(.)))%>%
rename_at(2:4, ~ paste0(.x, "_n"))
data
df <- structure(list(Adults = c(1L, 1L, 4L, 3L, 3L, 5L, 1L, 4L, 4L,
1L), Children = c(1L, 1L, 2L, 2L, 0L, 2L, 0L, 0L, 1L, 0L), Teens = c(1L,
2L, 3L, 1L, 1L, 3L, 1L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
library(reprex)
library(tidyverse)
set.seed(20)
df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
Children = sample(0:2, 10, replace = TRUE),
Teens = sample(1:3, 10, replace = TRUE),
stringsAsFactors = FALSE)
df
#> Adults Children Teens
#> 1 5 2 2
#> 2 4 2 1
#> 3 1 0 2
#> 4 3 2 1
#> 5 5 0 1
#> 6 5 1 1
#> 7 0 0 3
#> 8 0 0 3
#> 9 1 0 1
#> 10 2 2 3
df_adults <- df %>%
count(Adults) %>%
rename( n_Adults = n)
df_childred <- df %>%
count(Children) %>%
rename( n_Children = n)
df_teens <- df %>%
count(Teens) %>%
rename( n_Teens = n)
df_new <- data.frame(unique_id = 0:5)
df_new <- left_join(df_new,df_adults, by = c("unique_id"="Adults"))
df_new <- left_join(df_new,df_childred, by = c("unique_id"="Children"))
df_new <- left_join(df_new,df_teens, by = c("unique_id"="Teens"))
df_new <- df_new %>%
replace_na(list( n_Adults=0, n_Children=0, n_Teens=0))
df_new %>%
mutate(p_Adults = n_Adults/sum(n_Adults),p_Children = n_Children/sum(n_Children), p_Teens = n_Teens/sum(n_Teens))
#> unique_id n_Adults n_Children n_Teens p_Adults p_Children p_Teens
#> 1 0 2 5 0 0.2 0.5 0.0
#> 2 1 2 1 5 0.2 0.1 0.5
#> 3 2 1 4 2 0.1 0.4 0.2
#> 4 3 1 0 3 0.1 0.0 0.3
#> 5 4 1 0 0 0.1 0.0 0.0
#> 6 5 3 0 0 0.3 0.0 0.0
Created on 2019-02-25 by the reprex package (v0.2.1)

Resources