Expand data frame and add a new variable - r

I have a data frame structured like this:
+----------+------+--------+-------+
| Location | year | group1 | Value |
+----------+------+--------+-------+
| a | 2020 | 1 | x |
| a | 2020 | 2 | y |
| a | 2020 | 3 | z |
| a | 2021 | 1 | x |
| a | 2021 | 2 | y |
| a | 2021 | 3 | z |
| b | 2020 | 1 | x |
| b | 2020 | 2 | y |
| b | 2020 | 3 | z |
+----------+------+--------+-------+
I would like to expand the data frame to include 3 rows for every location, year, and group1 combination and generate a group2 variable that identifies these new combinations (1-3). Ideally, the data frame will look like this:
+----------+------+--------+-------+--------+
| Location | year | group1 | Value | group2 |
+----------+------+--------+-------+--------+
| a | 2020 | 1 | x | 1 |
| a | 2020 | 1 | x | 2 |
| a | 2020 | 1 | x | 3 |
| a | 2020 | 2 | y | 1 |
| a | 2020 | 2 | y | 2 |
| a | 2020 | 2 | y | 3 |
| ... | ... |... |... |... |
+----------+------+--------+-------+--------+
I was able to expand the dataframe to the correct number of total rows using the following code:
df[rep(seq_len(nrow(df)),3), 1:4]
But couldn't figure out how to add the group2 variable shown above.

With tidyr you can use expand - this will expand your data frame to all combinations of values with your sequence of 1 to 3:
library(tidyverse)
df %>%
group_by(Location, year, group1, Value) %>%
expand(group2 = 1:3)
Output
Location year group1 Value group2
<fct> <dbl> <int> <fct> <int>
1 a 2020 1 x 1
2 a 2020 1 x 2
3 a 2020 1 x 3
4 a 2020 2 y 1
5 a 2020 2 y 2
6 a 2020 2 y 3
...
Your approach looks close, and I suppose you could just add on group2 like this:
cbind(df[rep(seq_len(nrow(df)), each = 3), ], group2 = 1:3)

Here is the solution you are looking for
library(dplyr)
# 1. Data set
df <- data.table(
location = c("a","a","a","a","a","a","b","b","b"),
year = c(2020,2020,2020,2021,2021,2021,2020,2020,2020),
group1 = c(1,2,3,1,2,3,1,2,3),
value = c("x","y","z","x","y","z","x","y","z"),
stringsAsFactors = FALSE)
# 2. Your code to expand data frame
df <- df[rep(seq_len(nrow(df)), 3), 1:4]
# 3. Arrange
df <- df %>% arrange(location, year, group1, value)
# 4. Add 'group2'
df <- df %>%
group_by(location, year, group1, value) %>%
mutate(group2 = cumsum(group1) / group1) %>%
arrange(location, year, group1, value, group2)
Hope it works

We can use crossing from tidyr
library(tidyr)
library(dplyr)
crossing(df1, group2 = 1:3)
# A tibble: 27 x 5
# Location year group1 Value group2
# <chr> <int> <int> <chr> <int>
# 1 a 2020 1 x 1
# 2 a 2020 1 x 2
# 3 a 2020 1 x 3
# 4 a 2020 2 y 1
# 5 a 2020 2 y 2
# 6 a 2020 2 y 3
# 7 a 2020 3 z 1
# 8 a 2020 3 z 2
# 9 a 2020 3 z 3
#10 a 2021 1 x 1
# … with 17 more rows
Or create a list column and then unnest
df1 %>%
mutate(group2 = list(1:3)) %>%
unnest(c(group2))
data
df1 <- structure(list(Location = c("a", "a", "a", "a", "a", "a", "b",
"b", "b"), year = c(2020L, 2020L, 2020L, 2021L, 2021L, 2021L,
2020L, 2020L, 2020L), group1 = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), Value = c("x", "y", "z", "x", "y", "z", "x", "y", "z"
)), class = "data.frame", row.names = c(NA, -9L))

Related

Merge rows with different values into a single row in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 months ago.
I have a dataset that looks like this:
ID | age | disease
smith192 | 17 | lung_cancer
green484 | 12 | diabetes
green484 | 13 | heart_irregularities
tom584 | 12 | colon_cancer
tom584 | 14 | diabetes
tom584 | 15 | malnutrition
And I would like R to organize it into this:
ID | age_1 | disease_1 | age_2 | disease_2 | age_3 | disease_3 |
smith192 | 17 | lung_cancer | NA | NA | NA | NA |
green484 | 12 | diabetes | 13 | heart_irregularities | NA | NA |
tom584 | 12 | colon_cancer | 14 | diabetes | 15 | malnutrition |
Any help would be greatly appreciated!
You could create disease indices for each ID and then pivot the data to wide.
base
df |>
transform(n = ave(ID, ID, FUN = seq)) |>
reshape(direction = "wide", idvar = "ID", timevar = "n", v.names = c("age", "disease"))
# ID age.1 disease.1 age.2 disease.2 age.3 disease.3
# 1 smith192 17 lung_cancer NA <NA> NA <NA>
# 2 green484 12 diabetes 13 heart_irregularities NA <NA>
# 4 tom584 12 colon_cancer 14 diabetes 15 malnutrition
tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
pivot_wider(ID, names_from = n, values_from = c(age, disease))
# # A tibble: 3 × 7
# ID age_1 age_2 age_3 disease_1 disease_2 disease_3
# <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 smith192 17 NA NA lung_cancer NA NA
# 2 green484 12 13 NA diabetes heart_irregularities NA
# 3 tom584 12 14 15 colon_cancer diabetes malnutrition
Data
df <- structure(list(ID = c("smith192", "green484", "green484", "tom584",
"tom584", "tom584"), age = c(17, 12, 13, 12, 14, 15), disease = c("lung_cancer",
"diabetes", "heart_irregularities", "colon_cancer", "diabetes",
"malnutrition")), class = "data.frame", row.names = c(NA, -6L))

R: rehape from "wide" to "long", keeping some variables "wide"

I have data file in wide format, with a set of recurring variables (var1 var2, below)
data have:
| ID | background vars| var1.A | var2.A | var1.B | var2.B | var1.C | var2.C |
| -: | :------------- |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | data1 | 1 | 2 | 3 | 4 | 5 | 6 |
| 2 | data2 | 7 | 8 | 9 | 10 | 11 | 12 |
I need to reshape it "half way" into to long format, i.e. keep a each var group together (wide), and each recurrence in a different line (long).
data want:
| ID | background vars | recurrence | var1 | var2 |
| -: | :-------------- |:----------:|:------:|:------:|
| 1 | data1 | A | 1 | 2 |
| 1 | data1 | B | 3 | 4 |
| 1 | data1 | C | 5 | 6 |
| 2 | data2 | A | 7 | 8 |
| 2 | data2 | B | 9 | 10 |
| 2 | data2 | C | 11 | 12 |
I found some solutions for this using reshape() gather() and melt().
However, all these collapse ALL variables to long format, and do not allow for some variables to be kept "wide").
How can data be shaped this way using R?
Use the keyword '.value' in the names_to argument to keep that part of the column name in wide format:
tidyr::pivot_longer(df, c(-ID, -`background vars`),
names_sep = '\\.',
names_to = c('.value', 'recurrence'))
#> # A tibble: 6 x 5
#> ID `background vars` recurrence var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 1
If you need your code to be easily readable/comprehensible and you feel that ".value" in #Allan's example is a little opaque, you might consider a two-step pivot - simply pivot_long() and then immediately pivot_wide() with different parameters:
df <- structure(
list(
ID = 1:2,
background.vars = c("data1", "data2"),
var1.A = c(1L, 7L),
var2.A = c(2L, 8L),
var1.B = c(3L, 9L),
var2.B = c(4L, 10L),
var1.C = c(5L, 11L),
var2.C = c(6L, 12L)),
class = "data.frame",
row.names = c(NA, -2L)
)
require(tidyr)
#> Loading required package: tidyr
long.df <-
pivot_longer(df,
c(-ID, -`background.vars`), #lengthen all columns but these
names_sep = "\\.", #split column names wherever there is a '.'
names_to = c("var", "letter"))
long.df
#> # A tibble: 12 × 5
#> ID background.vars var letter value
#> <int> <chr> <chr> <chr> <int>
#> 1 1 data1 var1 A 1
#> 2 1 data1 var2 A 2
#> 3 1 data1 var1 B 3
#> 4 1 data1 var2 B 4
#> 5 1 data1 var1 C 5
#> 6 1 data1 var2 C 6
#> 7 2 data2 var1 A 7
#> 8 2 data2 var2 A 8
#> 9 2 data2 var1 B 9
#> 10 2 data2 var2 B 10
#> 11 2 data2 var1 C 11
#> 12 2 data2 var2 C 12
pivot_wider(long.df, names_from = "var")
#> # A tibble: 6 × 5
#> ID background.vars letter var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 12
Created on 2022-05-24 by the reprex package (v2.0.1)

How can I group_by() and then concatenate values of one column into a single column in R using dplyr?

I have data in the form of:
M | Y | title | terma | termb | termc
4 | 2009 | titlea | 2 | 0 | 1
6 | 2001 | titleb | 0 | 1 | 0
4 | 2009 | titlec | 1 | 0 | 1
I'm using dplyr's group_by() and summarise() to count instances of terms for each title:
data %>%
gather(key = term, value = total, terma:termc) %>%
group_by(m, y, title, term) %>%
summarise(total = sum(total))
Which gives me something like this:
M | Y | title |term | count
4 | 2009 | titlea | terma | 2
4 | 2009 |titlea |termc | 1
6 | 2001 | titleb | termb | 1
4 | 2009 | titlec | terma | 1
4 | 2009 | titlec | termc | 1
Instead, I would like to be able to group by M, Y, and term, then concatenate any titles that are grouped and add their totals together. Desired output would look like this:
M | Y | title | term | count
4 | 2009 | titlea, titlec | terma | 3
4 | 2009 | titlea, titlec | termc | 2
6 | 2001 | titleb | termb | 1
How can I do this? Any help appreciated!
#akrun was very close. This ended up working:
data %>%
pivot_longer(cols = terma:termc), names_to = 'term', values_to = 'count') %>%
filter(count != 0) %>%
group_by(M, Y, term) %>%
summarise(title = toString(title), count = sum(count))
We can do
library(dplyr)
library(tidyr)
data %>%
mutate_at(vars(starts_with('term')), na_if, '0') %>%
pivot_longer(cols = starts_with('term'), names_to = 'term',
values_to = 'count', values_drop_na = TRUE) %>%
group_by(M, Y, term) %>%
summarise(title = toString(title), count = sum(count))
# A tibble: 3 x 5
# Groups: M, Y [2]
# M Y term title count
# <int> <int> <chr> <chr> <int>
#1 4 2009 terma titlea, titlec 3
#2 4 2009 termc titlea, titlec 2
#3 6 2001 termb titleb 1
data
data <- structure(list(M = c(4L, 6L, 4L), Y = c(2009L, 2001L, 2009L),
title = c("titlea", "titleb", "titlec"), terma = c(2L, 0L,
1L), termb = c(0L, 1L, 0L), termc = c(1L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))

R : How to tag a subject if one of their columns has a certain value

This is what my data looks like:
+---------+--+----------+--+
| Subj_ID | | Location | |
+---------+--+----------+--+
| 1 | | 1 | |
| 1 | | 2 | |
| 1 | | 3 | |
| 2 | | 1 | |
| 2 | | 4 | |
| 2 | | 2 | |
| 3 | | 1 | |
| 3 | | 2 | |
| 3 | | 5 | |
+---------+--+----------+--+
In this dataset, only subject 1 has a location value of 3, so I want to label subject 1 as YES for intervention. Since subject 2 and 3 didn't have a location value of 3, they need to be labeled as false.
This is what I want the data to look like.
| Subj_ID | | Location | Intervention |
+---------+--+----------+--------------+
| 1 | | 1 | YES |
| 1 | | 2 | YES |
| 1 | | 3 | YES |
| 2 | | 1 | NO |
| 2 | | 4 | NO |
| 2 | | 3 | NO |
| 3 | | 1 | NO |
| 3 | | 2 | NO |
| 3 | | 5 | NO |
+---------+--+----------+-----+
Thanks in advance for the help! Dplyr preferred if possible.
An option with dplyr is after grouping by 'Subj_ID', check whether 3 is %in/% Location which returns a single TRUE/FALSE, change that to a numeric index to replace the values with "NO", "YES"
library(dplyr)
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = c("NO", "YES")[(3 %in% Location)+1])
# A tibble: 9 x 3
# Groups: Subj_ID [3]
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 YES
#2 1 2 YES
#3 1 3 YES
#4 2 1 NO
#5 2 4 NO
#6 2 2 NO
#7 3 1 NO
#8 3 2 NO
#9 3 5 NO
Or use any
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = case_when(any(Location == 3) ~ "YES", TRUE ~ "NO"))
Or using base R
df1$Intervention <- with(df1, c("NO", "YES")[1 + (Subj_ID %in%
Subj_ID[Location == 3])])
data
df1 <- data.frame(Subj_ID = rep(1:3, each = 3),
Location = c(1:3, 1, 4, 2, 1, 2, 5))
We can use match for each Subj_ID to check if 3 is present in any Location.
library(dplyr)
df %>%
group_by(Subj_ID) %>%
mutate(Intervention = c('Yes', 'No')[is.na(match(3,Location)) + 1])
#Can also use
#mutate(Intervention = c('No', 'Yes')[(match(3,Location, nomatch = 0L) > 0) + 1])
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 Yes
#2 1 2 Yes
#3 1 3 Yes
#4 2 1 No
#5 2 4 No
#6 2 2 No
#7 3 1 No
#8 3 2 No
#9 3 5 No
data
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))

R programming: ifelse on multiple tables [duplicate]

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Update columns by joining more than one columns
(2 answers)
Closed 4 years ago.
I am just venturing into R programming and finding my way around.
Lets say I have a table as below:
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2
I need to change the sales values of certain products based on another table. Please find below:
Product | Sales
A | 10
B | 7
C | 15
My final table should be:
Store | Product | Sales
X | A | 10
X | B | 7
X | C | 15
Y | A | 10
Y | B | 7
Y | C | 15
Z | A | 10
Z | B | 7
Z | C | 15
I have 2 methods of doing this now:
1) Using joins
2) Using an if-else statement inside a for loop to subset the
Is there any other way to do this more effectively and in fewer steps?
Thanks in advance!
EDIT: I forgot to mention an exception earlier. What if my dataset is like below?
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
X | D | 4
Y | A | 1
Y | B | 2
Y | C | 5
Y | D | 2
Z | A | 3
Z | B | 6
Z | C | 2
Z | D | 3
There's an extra product(D) with sales. I want to retain the value of sales for that product if it is not present in the 2nd table which is:
Product | Sales
A | 10
B | 7
C | 15
How about this join?
Since you want to change the Sales values of certain Products only so to illustrate this I have considered only two products in lookup_df
library(dplyr)
df %>%
left_join(lookup_df, by = "Product", suffix = c("_Original", "_New")) %>%
mutate(Sales_New = coalesce(Sales_New, Sales_Original))
Output is:
Store Product Sales_Original Sales_New
1 X A 2 10
2 X B 1 1
3 X C 3 15
4 Y A 1 10
5 Y B 2 2
6 Y C 5 15
7 Z A 3 10
8 Z B 6 6
9 Z C 2 15
Sample data:
df <- structure(list(Store = c("X", "X", "X", "Y", "Y", "Y", "Z", "Z",
"Z"), Product = c("A", "B", "C", "A", "B", "C", "A", "B", "C"
), Sales = c(2L, 1L, 3L, 1L, 2L, 5L, 3L, 6L, 2L)), .Names = c("Store",
"Product", "Sales"), class = "data.frame", row.names = c(NA,
-9L))
lookup_df <- structure(list(Product = c("A", "C"), Sales = c(10L, 15L)), .Names = c("Product", "Sales"), class = "data.frame", row.names = c(NA,
-2L))
# Product Sales
#1 A 10
#2 C 15
If you use a lookup-vector, it is relatively short:
d <- read.table(text = "
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2", sep = "|", header = T, stringsAsFactors = F)
lookup <- read.table(text = "Product | Sales
A | 10
B | 7
C | 15", sep = "|", header = T, stringsAsFactors = F)
lookup$Product <- gsub("^\\s+|\\s+$", "", lookup$Product) # remove spaces
lookup <- setNames(lookup$Sales, lookup$Product) # convert to vector
d$Product <- gsub("^\\s+|\\s+$", "", d$Product) # remove spaces
d$Sales <- lookup[d$Product] # main part
d

Resources