Changing values of many columns at once -- model.matrix()? - r

Here is dput() of a structure I currently have.
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k+ observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!
This is what I'd like to end up with:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
The following code gives a close result:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
But ends up giving me the following, erroneous structure:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0,
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame")
How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.
Thank you for any help!

Perhaps this helps
library(dplyr)
df %>%
mutate(across(USA:Japan, ~ +(country == cur_column()))) %>%
group_by(id) %>%
mutate(across(USA:Japan, max)) %>%
ungroup
-output
# A tibble: 5 × 5
id country USA Germany Japan
<dbl> <chr> <int> <int> <int>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Or modifying the model.matrix as
m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)

You can use base R
re <- rle(df$id)
for(j in re$values){
y <- which(j == df$id)
df[y , match(df$country[y] , colnames(df))] <- 1
}
Output
id country USA Germany Japan
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0

Are you looking for such a solution (in combination) to your closed question here CRAN R - Assign the value '1' to many dummy variables at once
The solution provided by #akrun solves the question here. But you may look for something like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(-country, ~case_when(country == cur_column() ~ 1))) %>%
fill(-country, .direction = "updown") %>%
mutate(across(-country, ~ifelse(is.na(.), 0, .))) %>%
ungroup()
id country USA Germany Japan
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0

Related

Create multiple variables by id using the same data set in dplyr [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last month.
Improve this question
I have a data set like this:
df <- data.frame(year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3, 4, 5, 6, 7))
I want to create the following variables by id:
years_PM = number of years in this product and market (including year t-1)
value_PM = total value in this product and market (including year t-1)
years_OPM = number of years OTHER PRODUCTS in OTHER MARKETS (including year t-1)
years_SP_OM = number of years SAME PRODUCT in OTHER MARKETS (including year t-1)
history = takes value 1 if a given id has an history (including year t-1)
year_id = number of years of same id (including year t-1)
year_id_consecutive = number of years of same id. If there are more than 2 consecutive years without observing the same id, then it will assign a 0 (including year t-1) - e.g. the count will start from 0 (as a new observation).
n_id_PM = number of id's (different than the one observed) in this product and market (in year t-1)
Therefore, the new dataset will look like:
df_new <- data.frame(year = c("2000", "2000", "2000", "2002", "2000",
"2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple",
"orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN",
"USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3,
4, 5, 6, 7), years_PM = c(0, 0, 0, 1, 0, 0, 0), value_PM = c(0,
0, 0, 5, 0, 0, 0), years_OPM = c(0, 0, 0, 1, 0, 0, 0), years_SP_OM = c(0,
0, 0, 0, 0, 1, 0),
history = c(0, 0, 0, 1, 0, 1, 1), year_id = c(0, 0, 0, 1,
0, 1, 2), year_id_consecutive = c(0, 0, 0, 1, 0, 1, 0), n_id_PM = c(0,
0, 0, 0, 0, 0, 0))
I have used summarise, but it cuts the data. I dont want to merge multiple datasets afterwards. Moreover, mutate did not do the trick either.
Any idea how to use dplyr to create them more directly?
Don't use summarize (as has been said multiple times), it will (almost) always reduce your data.
Here's a shot, given various variables you've asked in the three iterations of this question.
df %>%
mutate(year = as.integer(year)) %>%
group_by(product, market) %>%
mutate(
FPFM = +(year == min(year)),
years_PM = sapply(year, function(y) n_distinct(year[year < y])),
value_PM = sapply(year, function(y) sum(value[year < y])),
n_id_PM = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(product) %>%
mutate(
FP = +(year == min(year)),
years_P = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_P = sapply(year, function(y) sum(value[year < y])),
n_id_P = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(market) %>%
mutate(
FM = +(year == min(year)),
years_M = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_M = sapply(year, function(y) sum(value[year < y])),
n_id_M = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
ungroup() %>%
mutate(
years_OPM = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market != m]),
year, product, market),
years_SP_OM = mapply(function(y, p, m) n_distinct(year[year < y & product == p & market != m]),
year, product, market),
years_OP_SM = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market == m]),
year, product, market)
) %>%
group_by(id) %>%
mutate(
history = +(lengths(sapply(year, function(y) year[year < y])) > 0),
year_id = sapply(year, function(y) n_distinct(year[year < y])),
year_id_consecutive = sapply(year, function(y) {
years <- year[year < y]
if (length(years)) {
+(length(setdiff(seq(min(years), max(years)), years)) < 2)
} else 0L
})
) %>%
ungroup()
# # A tibble: 7 × 23
# year id product market value FPFM years_PM value_PM n_id_PM FP years_P value_P n_id_P FM years_M value_M n_id_M years_OPM years_SP_OM years_OP_SM history year_id year_id_consecutive
# <int> <chr> <chr> <chr> <dbl> <int> <int> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
# 1 2000 X apple CHN 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2 2000 X orange USA 2 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 3 2000 X orange USA 3 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 4 2002 X orange USA 4 0 1 5 1 0 1 5 1 0 1 5 1 1 0 0 1 1 1
# 5 2000 Z cake SPA 5 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 6 2002 Z cake CHL 6 1 0 0 0 0 1 5 1 1 0 0 0 1 1 0 1 1 1
# 7 2007 Z bacon CHL 7 1 0 0 0 1 0 0 0 0 1 6 1 2 0 1 1 2 1
Some of the values are different from yours, but I think it's likely due to either errors in your expected output or misunderstanding/miscommunication of each column's intent.
The pattern should be clear for each: group_by the relevant variables, and as necessary iterate over year or some other variable (to limit to previous years) and count/sum/whatever.
I took the liberty of fixing year to be an integer.

CRAN R - Assign the value '1' to many dummy variables at once [duplicate]

Here is dput() of a structure I currently have.
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k+ observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!
This is what I'd like to end up with:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
The following code gives a close result:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
But ends up giving me the following, erroneous structure:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0,
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame")
How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.
Thank you for any help!
Perhaps this helps
library(dplyr)
df %>%
mutate(across(USA:Japan, ~ +(country == cur_column()))) %>%
group_by(id) %>%
mutate(across(USA:Japan, max)) %>%
ungroup
-output
# A tibble: 5 × 5
id country USA Germany Japan
<dbl> <chr> <int> <int> <int>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Or modifying the model.matrix as
m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)
You can use base R
re <- rle(df$id)
for(j in re$values){
y <- which(j == df$id)
df[y , match(df$country[y] , colnames(df))] <- 1
}
Output
id country USA Germany Japan
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Are you looking for such a solution (in combination) to your closed question here CRAN R - Assign the value '1' to many dummy variables at once
The solution provided by #akrun solves the question here. But you may look for something like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(-country, ~case_when(country == cur_column() ~ 1))) %>%
fill(-country, .direction = "updown") %>%
mutate(across(-country, ~ifelse(is.na(.), 0, .))) %>%
ungroup()
id country USA Germany Japan
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0

dplyr case_when does not return expected result when using multiple columns

I'm trying to mutate a new column, id based on the match using a case when. When a match is found add an ID value, if not leave it NA. This works well when I have fewer columns in the case_when test. But, when I'm using multiple columns the output is not as expected. Below is a reproducible example
set.seed(10)
library(dplyr)
values <- c(0, 1)
country <- c("USA", "Germany","UK","Russia","China")
role <- c("admin", "developer","UI designer","HR","manager")
Df <- dplyr::tibble(
cname = sample(country, 10, replace = TRUE),
role = sample(role, 10, replace = TRUE),
b = sample(values, 10, replace = TRUE),
c = sample(values, 10, replace = TRUE),
d = sample(values, 10, replace = TRUE),
e = sample(values, 10, replace = TRUE),
f = sample(values, 10, replace = TRUE),
g = sample(values, 10, replace = TRUE)
)
Df
# A tibble: 10 x 8
cname role b c d e f g
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 China manager 0 0 1 0 1 1
2 UK developer 0 1 1 1 1 0
3 USA developer 0 0 0 0 1 0
4 Germany HR 1 0 0 0 1 0
5 UK developer 0 1 1 1 0 0
6 Germany admin 1 1 1 1 1 0
7 Germany UI designer 1 0 0 1 0 0
8 UK HR 1 0 1 0 1 1
9 USA HR 1 0 1 0 1 1
10 China manager 0 0 0 0 0 0
Need an id for each department column.
b = 1, c = 2, d =3, e = 4, f = 5, g = 6
Expected output . I have removed the other columns, but it is OK if we retain the columns with 0,1 values as well
cname role Department Departmet_id
China manager d 3
China manager f 5
China manager g 6
UK developer c 2
UK developer d 3
UK developer e 4
UK developer f 5
UK developer g 6
USA developer f 1
Created on 2021-09-01 by the reprex package (v2.0.1)
Update
Still based on camille's comment:
df %>%
pivot_longer(-c(cname, role),
names_to = "Departement",
values_to = "Departement_ID") %>%
group_by(cname, role, Departement) %>%
summarise(Departement_ID = ifelse(any(Departement_ID == 1),
which(names(df) == unique(Departement)) - 2,
NA_integer_)) %>%
drop_na()
returns
# A tibble: 28 x 4
# Groups: cname, role [8]
cname role Departement Departement_ID
<chr> <chr> <chr> <dbl>
1 China developer d 3
2 China developer e 4
3 China developer f 5
4 Germany HR b 1
5 Germany HR c 2
6 Germany manager b 1
7 Germany manager c 2
8 Germany manager d 3
9 Germany manager e 4
10 Germany manager f 5
11 Russia manager b 1
12 Russia manager d 3
13 UK admin b 1
14 UK admin c 2
15 UK admin e 4
16 UK admin f 5
17 UK admin g 6
18 UK developer b 1
19 UK developer c 2
20 UK developer d 3
21 UK developer f 5
22 UK manager b 1
23 UK manager c 2
24 UK manager d 3
25 UK manager e 4
26 UK manager f 5
27 USA manager e 4
28 USA manager g 6
Data
df <- structure(list(cname = c("UK", "USA", "Germany", "Russia", "UK",
"Germany", "Germany", "Germany", "China", "UK"), role = c("developer",
"manager", "manager", "manager", "admin", "HR", "developer",
"manager", "developer", "manager"), b = c(1, 0, 0, 1, 1, 1, 0,
1, 0, 1), c = c(1, 0, 1, 0, 1, 1, 0, 0, 0, 1), d = c(1, 0, 1,
1, 0, 0, 0, 0, 1, 1), e = c(0, 1, 0, 0, 1, 0, 0, 1, 1, 1), f = c(1,
0, 1, 0, 1, 0, 0, 1, 1, 1), g = c(0, 1, 0, 0, 1, 0, 0, 0, 0,
0)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
If I understood you want the position of each letter in the alphabet, here is a solution that I thought:
library(tidyverse)
Df %>%
#Pivot data to make a single column with all letters
pivot_longer(cols = -c(cname,role)) %>%
# Apply function for each
rowwise() %>%
# Create an id, where the number it is the position of the alphabet of each letter - 1
mutate(id = which(str_detect(letters,name))-1 )

Subtracting each column from its previous one in a data frame

I have a very simple case here in which I would like to subtract each column from its previous one. As a matter of fact I am looking for a sliding subtraction as the first column stays as is and then the first one subtracts the second one and second one subtracts the third one and so on till the last column.
here is my sample data set:
structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
and my desired output:
structure(list(x = c(1, 0, 0, 0), y = c(0, 0, 1, 1), z = c(-1,
1, 0, 0)), class = "data.frame", row.names = c(NA, -4L))
I am personally looking for a solution with purrr family of functions. I also thought about slider but I'm not quite familiar with the latter one. So I would appreciate any help and idea with these two packages in advance. Thank you very much.
A simple dplyr only solution-
cur_data() inside mutate/summarise just creates a whole copy. So
just substract cur_data()[-ncol(.)] from cur_data()[-1]
with pmap_df you can do similar things
df <- structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
df %>%
mutate(cur_data()[-1] - cur_data()[-ncol(.)])
#> x y z
#> 1 1 0 -1
#> 2 0 0 1
#> 3 0 1 0
#> 4 0 1 0
similarly
pmap_dfr(df, ~c(c(...)[1], c(...)[-1] - c(...)[-ncol(df)]))
I think you are looking for pmap_df with lag to subtract the previous value.
library(purrr)
library(dplyr)
pmap_df(df, ~{x <- c(...);x - lag(x, default = 0)})
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
Verbose, but simple:
df %>%
select(x) %>%
bind_cols(df %>%
select(-1) %>%
map2_dfc(df %>%
select(-ncol(df)), ~.x -.y))
# x y z
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
We can just do (no need of any packages)
cbind(df1[1], df1[-1] - df1[-ncol(df1)])
-output
x y z
1 1 0 -1
2 0 0 1
3 0 1 0
4 0 1 0
Or using dplyr
library(dplyr)
df1 %>%
mutate(.[-1] - .[-ncol(.)])

how to build a string variable to capture muti cols info

I have a df that looks like this:
It can be build using codes:
structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Where pass stand for a student pass any test or not. Now I want to build a new var Result to capture a student's test results like following, what should I do?
Try the base R code below
q <- with(data.frame(which(df[-(1:2)] == 1, arr.ind = TRUE)),
tapply(names(df[-(1:2)])[col], factor(row, levels = 1:nrow(df)), toString))
df$Result <- ifelse(is.na(q), "Not Pass", paste0("Pass: ", q))
which gives
> df
# A tibble: 5 x 6
ID Pass Math ELA PE Result
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 0 0 0 Not Pass
2 2 1 0 1 0 Pass: ELA
3 3 1 1 0 1 Pass: Math, PE
4 4 1 1 1 1 Pass: Math, ELA, PE
5 5 1 1 0 1 Pass: Math, PE
Using dplyr with rowwise
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(Result = if(as.logical(Pass))
str_c('Pass: ', toString(names(select(., Math:PE))[as.logical(c_across(Math:PE))])) else 'Not pass' ) %>%
ungroup
# A tibble: 5 x 6
# ID Pass Math ELA PE Result
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 1 0 0 0 0 Not pass
#2 2 1 0 1 0 Pass: ELA
#3 3 1 1 0 1 Pass: Math, PE
#4 4 1 1 1 1 Pass: Math, ELA, PE
#5 5 1 1 0 1 Pass: Math, PE
data
df1 <- structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Here's one solution:
library(dplyr)
library(magrittr)
library(stringr)
df <- structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
df %<>% pivot_longer(cols = -c(ID, Pass), names_to = "sub", values_to = "done")
df %<>% group_by(ID) %>% mutate(Result = paste0(ifelse(done == 1, sub, NA), collapse = ", ")) %>% ungroup()
df %<>% pivot_wider(names_from = sub, values_from = done)
df %<>% mutate(Result = paste0("Pass: ", str_replace_all(Result, "NA[, ]*", "")))
df %<>% mutate(Result = ifelse(str_detect(Result, "Pass: $"), "Not pass", str_replace_all(Result, ",[\\s]*$", "")))
df
# # A tibble: 5 x 6
# ID Pass Result Math ELA PE
# <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 0 Not pass 0 0 0
# 2 2 1 Pass: ELA 0 1 0
# 3 3 1 Pass: Math, PE 1 0 1
# 4 4 1 Pass: Math, ELA, PE 1 1 1
# 5 5 1 Pass: Math, PE 1 0 1
I can provide an explanation of what the code is doing if necessary.

Resources