I have example data as follows:
dat <- structure(list(
zipcode = c(1001, 1002, 1003, 1004, 1101, 1102, 1103, 1104, 1201, 1202, 1203, 1302),
areacode = c(4, 4, NA, 4, 4, 4, NA, 1, 4, 4, NA, 4),
type = structure(c(1L, 1L, NA, 1L, 2L, 2L, NA, 1L, 1L, 1L, NA, 1L),
.Label = c("clay", "sand"), class = "factor"),
region = c(3, 3, NA, 3, 3, 3, NA, 3, 3, 3, NA, 3),
do_not_fill = c(1, NA, NA, 1, 1, NA, NA, 1, NA, NA, NA, 1)),
class = c("data.table", "data.frame"), row.names = c(NA, -4L))
zipcode areacode type region do_not_fill
1: 1001 4 clay 3 1
2: 1002 4 clay 3 NA
3: 1003 NA <NA> NA NA
4: 1004 4 clay 3 1
5: 1101 4 sand 3 1
6: 1102 4 sand 3 NA
7: 1103 NA <NA> NA NA
8: 1104 1 clay 3 1
9: 1201 4 clay 3 NA
10: 1202 4 clay 3 NA
11: 1203 NA <NA> NA NA
12: 1302 4 clay 3 1
I want to fill ONLY the columns areacode, type and region based on two conditions.
The areacode has to be the same before and after the NA.
The first two digits of the zipcode have to be the same before and after the NA.
Based on this solution, and this solution, I attempted following (however data.table solutions are welcomed and even preferred):
library(dplyr)
dat |>
mutate(type = as.character(type)) |>
mutate(across(1:4,
~ ifelse(is.na(.) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
lag(.),
.)))
But somewhere I am doing something wrong, because I get:
Error:
! `n` and `row.names` must be consistent.
Run `rlang::last_error()` to see where the error occurred.
Desired output:
zipcode areacode type region do_not_fill
1: 1001 4 clay 3 1
2: 1002 4 clay 3 NA
3: 1003 4 clay 3 NA
4: 1004 4 clay 3 1
5: 1101 4 sand 3 1
6: 1102 4 sand 3 NA
7: 1103 NA <NA> NA NA
8: 1104 1 clay 3 1
9: 1201 4 clay 3 NA
10: 1202 4 clay 3 NA
11: 1203 NA <NA> NA NA
12: 1302 4 clay 3 1
EDIT
as_tibble(dat) |>
mutate(type = as.character(areacode)) |>
mutate(across(1:4,
~ ifelse(is.na(.) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
lag(.),
.)))
# A tibble: 12 x 5
zipcode areacode type region do_not_fill
<dbl> <dbl> <chr> <dbl> <dbl>
1 1001 4 4 3 1
2 1002 4 4 3 NA
3 1003 4 4 3 NA
4 1004 4 4 3 1
5 1101 4 4 3 1
6 1102 4 4 3 NA
7 1103 NA NA NA NA
8 1104 1 1 3 1
9 1201 4 4 3 NA
10 1202 4 4 3 NA
11 1203 NA NA NA NA
12 1302 4 4 3 1
You need to convert it to a tibble first. I think this is because data.table has extra attributes
Have a look at the rownames,
rownames(as_tibble(dat))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
rownames(dat)
[1] "1" "2" "3" "4"
as_tibble(dat) |>
mutate(type = as.character(type)) |>
mutate(across(1:4,
~ ifelse(is.na(.) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
lag(.),
.)))
# A tibble: 12 x 5
zipcode areacode type region do_not_fill
<dbl> <dbl> <chr> <dbl> <dbl>
1 1001 4 clay 3 1
2 1002 4 clay 3 NA
3 1003 4 clay 3 NA
4 1004 4 clay 3 1
5 1101 4 sand 3 1
6 1102 4 sand 3 NA
7 1103 NA NA NA NA
8 1104 1 clay 3 1
9 1201 4 clay 3 NA
10 1202 4 clay 3 NA
11 1203 NA NA NA NA
12 1302 4 clay 3 1
This can be done in data.table using the same code:
dat[, c(lapply(.SD, \(v) {fifelse(
is.na(areacode) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))), lag(v), v)}),
.SD[, .(do_not_fill)]), .SDcols = !patterns("do_not_fill")]
zipcode areacode type region do_not_fill
<num> <num> <fctr> <num> <num>
1: 1001 4 clay 3 1
2: 1002 4 clay 3 NA
3: 1004 4 clay 3 NA
4: 1004 4 clay 3 1
5: 1101 4 sand 3 1
6: 1102 4 sand 3 NA
7: 1103 NA <NA> NA NA
8: 1104 1 clay 3 1
9: 1201 4 clay 3 NA
10: 1202 4 clay 3 NA
11: 1203 NA <NA> NA NA
12: 1302 4 clay 3 1
Related
I would like to make a some new variables in R based on multiple (>100) other variables.
My dataset looks like this
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1 1 1 1 2 2 1 6 0 1 6 1
2 2 2 3 2 5 1 3
3 3 0 0 <NA> 4 1 0 0 <NA> 2 2
4 4 NA 1 2 2 NA 1 1 4
5 5 NA 4 2 3 5 NA 4 3 4 3
The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6.
I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e.
if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.
Please also see the example dataset below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA
Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA 12
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA 4
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA 4
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA 9
My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.
I hope that someone can help me do this in R. Thank you!
I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:
library(dplyr)
library(purrr)
df %>%
bind_cols(as.data.frame(t(map_dfr(1:6, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
inds <- which(x == a)
if(length(inds) != 0) {
y <- c(...)[grepl("ais", names(df))]
max(y[inds])
} else {
NA
}})))) %>%
setNames(paste0("br", 1:6))) %>%
rowwise() %>%
mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
select(starts_with("br"), sum)
Resulting output
# A tibble: 5 x 7
# Rowwise:
br1 br2 br3 br4 br5 br6 sum
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 NA NA NA NA 2 4
2 3 4 5 NA 2 NA 12
3 NA 4 NA NA NA NA 4
4 2 NA NA 2 NA NA 4
5 NA NA 5 4 NA NA 9
You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).
library(tidyr)
library(dplyr)
data %>%
pivot_longer(-sub_id,
names_to = c("name", "cat"),
names_pattern = ".*_(\\d+)_(.*)") %>%
pivot_wider(names_from = "cat") %>%
group_by(sub_id, br) %>%
summarise(value = max(ais), .groups = "drop") %>%
filter(br %in% 1:6) %>%
group_by(sub_id) %>%
mutate(sum = sum(tail(sort(value), 3))) %>%
pivot_wider(names_from = br,
names_glue = "br_{br}") %>%
select(sub_id, paste0("br_", 1:6), sum)
This returns
# A tibble: 5 x 8
sub_id br_1 br_2 br_3 br_4 br_5 br_6 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA NA NA 2 4
2 2 3 4 5 NA 2 NA 12
3 3 NA 4 NA NA NA NA 4
4 4 2 NA NA 2 NA NA 4
5 5 NA NA 5 4 NA NA 9
Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns).
I took an idea from this answer.
Data
data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2,
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2,
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1,
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0,
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6,
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
For the first part:
data <- data.frame(sub_id = c(1,2,3,4,5),
diag_1_ais = c(1,2,0,NA,NA),
diag_2_ais = c(1,1,0,NA,4),
diag_3_ais = c(2,4,NA,1,2),
diag_4_ais = c(2,3,4,2,3),
diag_5_ais = c(1,5,1,2,5),
diag_1_br = c(6,5,0,NA,NA),
diag_2_br = c(0,2,0,NA,4),
diag_3_br = c(1,2,NA,1,3),
diag_4_br = c(6,1,2,1,4),
diag_5_br = c(1,3,2,4,3))
calc_br <- function(data, value, firstBr, lastBr) {
br <- c()
for (i in 1:nrow(data)){
if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1])))
}
else {
br <- c(br, c(NA))
}
}
result <- br
}
firstBr = 7
lastBr = 11
data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)
This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).
For the second part this should do the trick:
br_sum <- c()
for (i in 1:nrow(data)){
br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
}
data$br_sum <- br_sum
For completness here my results:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1 1 1 1 2 2 1 6
2 2 2 1 4 3 5 5
3 3 0 0 NA 4 1 0
4 4 NA NA 1 2 2 NA
5 5 NA 4 2 3 5 NA
diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 3 3 4 5 NA 2 NA 12
3 0 NA 2 2 NA 4 NA NA NA NA 4
4 NA 1 1 4 2 NA NA 2 NA NA 4
5 4 3 4 3 NA NA 5 4 NA NA 9
I am looking to remove strings across my data.table based on a partial match:
$$ER
Since these strings differ across the entire table, and my table is reasonably large, efficiency and speed is preferred. I have tried data.table's %like% but this is way too inefficient. gsub should do fine but I have an issue referencing the "$$" in the "$$ER".
structure(list(Country = c("NL", "NL", "NL", "NL", "DE", "DE",
"DE", "GB", "GB"), Value1 = c("$$ER: Data not found", NA, NA,
NA, "$$ERROR: NOT AVAILABLE", NA, NA, "3", "4"), Value2 = c("$$ER: Data not found",
NA, NA, NA, "$$ERROR: NOT AVAILABLE", NA, NA, "3", "4"), Value3 = c(10,
15, 12, 9, 8, 20, 23, 3, 4)), class = "data.frame", row.names = c(NA,
-9L))
Country Value1 Value2 Value3
1 NL $$ER: Data not found $$ER: Data not found 10
2 NL <NA> <NA> 15
3 NL <NA> <NA> 12
4 NL <NA> <NA> 9
5 DE $$ERROR: NOT AVAILABLE $$ERROR: NOT AVAILABLE 8
6 DE <NA> <NA> 20
7 DE <NA> <NA> 23
8 GB 5 6 3
9 GB 6 8 4
Desired output:
Country Value1 Value2 Value3
1 NL NA NA 10
2 NL NA NA 15
3 NL NA NA 12
4 NL NA NA 9
5 DE NA NA 8
6 DE NA NA 20
7 DE NA NA 23
8 GB 5 6 3
9 GB 6 8 4
An alternative would be to use grepl:
df[apply(df, 2, function(i) grepl('$$ER', i, fixed = T))] <- NA
which would yield the following:
# Country Value1 Value2 Value3
# 1 NL <NA> <NA> 10
# 2 NL <NA> <NA> 15
# 3 NL <NA> <NA> 12
# 4 NL <NA> <NA> 9
# 5 DE <NA> <NA> 8
# 6 DE <NA> <NA> 20
# 7 DE <NA> <NA> 23
# 8 GB 3 3 3
# 9 GB 4 4 4
You can use startsWith in sapply testing for $$ER.
D[2:3][sapply(D[2:3], startsWith, "$$ER")] <- NA
D
# Country Value1 Value2 Value3
#1 NL <NA> <NA> 10
#2 NL <NA> <NA> 15
#3 NL <NA> <NA> 12
#4 NL <NA> <NA> 9
#5 DE <NA> <NA> 8
#6 DE <NA> <NA> 20
#7 DE <NA> <NA> 23
#8 GB 3 3 3
#9 GB 4 4 4
But maybe you want to use as.numeric:
D[2:3] <- sapply(D[2:3], as.numeric)
D
# Country Value1 Value2 Value3
#1 NL NA NA 10
#2 NL NA NA 15
#3 NL NA NA 12
#4 NL NA NA 9
#5 DE NA NA 8
#6 DE NA NA 20
#7 DE NA NA 23
#8 GB 3 3 3
#9 GB 4 4 4
Using data.table -
library(data.table)
setDT(df)[, (2:3) := lapply(.SD, function(x)
as.numeric(replace(x, grepl('$$ER', x, fixed = TRUE), NA))), .SDcols = 2:3]
df
# Country Value1 Value2 Value3
#1: NL NA NA 10
#2: NL NA NA 15
#3: NL NA NA 12
#4: NL NA NA 9
#5: DE NA NA 8
#6: DE NA NA 20
#7: DE NA NA 23
#8: GB 3 3 3
#9: GB 4 4 4
I have a dataframe of a service. Now I need to add a column "order" and group them with the following rule:
Group the service to orders: If within the next 5 values after one service Value "A" is another service "A" present, fill all values to an order ID - also the ones that don't have a service value. If there is no service value within the next 5 values the next order group is defined.
dput(data)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
14, 15, 16), time = structure(1:15, .Label = c("13:20:01", "13:20:02",
"13:20:03", "13:20:04", "13:20:05", "13:20:06", "13:20:07", "13:20:08",
"13:20:09", "13:20:10", "13:20:11", "13:20:12", "13:20:13", "13:20:14",
"13:20:15"), class = "factor"), apples = c(2, 2, 2, 3, 3, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2), service = structure(c(NA, 1L, 1L,
NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, 1L), .Label = "A", class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
overview
id time apples service
1 13:20:01 2
2 13:20:02 2 A
3 13:20:03 2 A
4 13:20:04 3
5 13:20:05 3 A
6 13:20:06 2
7 13:20:07 2 A
8 13:20:08 2
9 13:20:09 2
10 13:20:10 2
11 13:20:11 2
12 13:20:12 2
14 13:20:13 2 A
15 13:20:14 2
16 13:20:15 2 A
This is the format I'm looking for. ID 2 to ID 8 is an order and ID 14 to ID 16.
id time apples service Order
1 13:20:01 2
2 13:20:02 2 A 1
3 13:20:03 2 A 1
4 13:20:04 3 1
5 13:20:05 3 A 1
6 13:20:06 2 1
7 13:20:07 2 A 1
8 13:20:08 2
9 13:20:09 2
10 13:20:10 2
11 13:20:11 2
12 13:20:12 2
14 13:20:13 2 A 2
15 13:20:14 2 2
16 13:20:15 2 A 2
I tried it with a for loop. I suggest there is a way to use the mutate method and add the "range" conditon.
Thx for your help!
This is my output that is produced by the code of tspano
# A tibble: 15 x 11
id time apples service start end g0 g1 g2 g3 order
<dbl> <fct> <dbl> <fct> <dbl> <dbl> <chr> <int> <chr> <int> <int>
1 1 13:20:01 2 NA 0 3 NA 0 NA 0 NA
2 2 13:20:02 2 A 1 3 start 1 NA 0 NA
3 3 13:20:03 2 A 2 3 NA 1 NA 0 NA
4 4 13:20:04 3 NA 2 2 NA 1 NA 0 NA
5 5 13:20:05 3 A 3 2 NA 1 NA 0 NA
6 6 13:20:06 2 NA 3 1 NA 1 NA 0 NA
7 7 13:20:07 2 A 3 1 NA 1 NA 0 NA
8 8 13:20:08 2 NA 2 0 end 2 NA 0 NA
9 9 13:20:09 2 NA 2 1 NA 2 NA 0 NA
10 10 13:20:10 2 NA 1 1 NA 2 NA 0 NA
11 11 13:20:11 2 NA 1 2 NA 2 NA 0 NA
12 12 13:20:12 2 NA 0 2 NA 2 NA 0 NA
13 14 13:20:13 2 A 1 2 start 3 NA 0 NA
14 15 13:20:14 2 NA 1 1 NA 3 NA 0 NA
15 16 13:20:15 2 A 2 1 NA 3 NA 0 NA
Here is a solution using RcppRoll, which should be faster than a R for loop:
data %>%
mutate(start = RcppRoll::roll_sum(c(rep(F,4),(service=="A") %in% T), n = 5, align = "right"),
end = RcppRoll::roll_sum(c((service=="A") %in% T, rep(F,4)), n = 5, align = "left"),
g0 = case_when(start>0 & (lag(start)==0) %in% c(T,NA) ~ "start",
end ==0 ~ "end",
T ~ NA_character_)
) %>%
group_by(g1 = cumsum(!is.na(g0))) %>%
mutate(g2 = if_else(first(g0)=="end", NA_character_, "order")) %>%
ungroup() %>%
group_by(g3 = cumsum(!is.na(g2) & is.na(lag(g2))) ) %>%
mutate(order = if_else(is.na(g2), NA_integer_, g3)) %>%
ungroup() %>%
select(id, time, apples, service, order)
If you remove the last select you can see I have several intermediate results that should make the logic clear.
I have a dataframe which looks like this:
`Row Labels` Female Male
<chr> <chr> <chr>
1 London <NA> <NA>
2 42 <NA> 1
3 Paris <NA> <NA>
4 36 1 <NA>
5 Belgium <NA> <NA>
6 18 1
7 21 <NA> 1
8 Madrid <NA> <NA>
9 20 1 <NA>
10 Berlin <NA> <NA>
11 37 <NA> 1
12 23 1
13 25 1
14 44 1
The code I used to produce this dataframe looks like this:
structure(list(`Row Labels` = c("London", "42", "Paris","36", "Belgium","18" ,"21", "Madrid", "20", "Berlin", "37","23","25","44"),
Female = c(NA, NA, NA, "1", NA, NA,NA, NA, "1", NA, NA,"1","1","1"), Male = c(NA,"1", NA, NA, NA, "1", NA, NA, NA, "1",NA,NA,NA,NA)),
.Names = c("Row Labels","Female", "Male"), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
I would like to know how I can change multiple rows in this dataframe to become columns.
My ideal output looks like this:
'Row Labels' Female Male 42 36 21 20 37 18 23 25 44
London 1 1
Paris 1 1
Belgium 1 1 1 1
Madrid 1 1
Berlin 3 1 1 1 1 1
Seems very mechanical. Calling your data d:
d1 = d[seq(1, nrow(d), by = 2), ]
d2 = d[seq(2, nrow(d), by = 2), ]
d1[, c("Male", "Female")] = d2[, c("Male", "Female")]
d3 = matrix(nrow = nrow(d2), ncol = nrow(d2))
diag(d3) = 1
colnames(d3) = d2$`Row Labels`
cbind(d2, d3)
# Row Labels Female Male 42 36 21 20 37
# 1 42 <NA> 1 1 NA NA NA NA
# 2 36 1 <NA> NA 1 NA NA NA
# 3 21 <NA> 1 NA NA 1 NA NA
# 4 20 1 <NA> NA NA NA 1 NA
# 5 37 <NA> 1 NA NA NA NA 1
Using tidyverse.
library(dplyr)
library(tidyr)
#cumsum based on country names
df %>% group_by(gr=cumsum(grepl('\\D+',`Row Labels`))) %>%
#Sum Female and Male
mutate_at(vars('Female','Male'), list(~sum(as.numeric(.), na.rm = T))) %>%
#Create RL from country name and number where we are at numbers
mutate(RL=ifelse(row_number()>1, paste0(first(`Row Labels`),',',`Row Labels`), NA)) %>%
filter(!is.na(RL)) %>%
select(RL, gr, Male, Female) %>%
separate(RL, into = c('RL','Age')) %>% mutate(flag=1) %>% spread(Age, flag) %>%
ungroup() %>% select(-gr)
# A tibble: 5 x 12
RL Male Female `18` `20` `21` `23` `25` `36` `37` `42` `44`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Belgium 1 0 1 NA 1 NA NA NA NA NA NA
2 Berlin 1 3 NA NA NA 1 1 NA 1 NA 1
3 London 1 0 NA NA NA NA NA NA NA 1 NA
4 Madrid 0 1 NA 1 NA NA NA NA NA NA NA
5 Paris 0 1 NA NA NA NA NA 1 NA NA NA
I have some data that looks like this:
samp
# A tibble: 5 x 2
ID Source
<dbl> <chr>
1 34221 75
2 33861 75
3 59741 126,123
4 56561 111,105
5 55836 36,34,34,36,22
Of any of the distinct values, I want to make a new column. If the value exists in a row I want to impute an "x" otherwise no value should be imputed.
Example (pseudo code) of the expected result:
ID 75 126 123 111 105 36 34 22
1 34221 x
2 33861 x
3 59741 x x
4 56561 x x
5 55836 x x x
I tried it by the separtate function of the tydr package. Like this for the start.
into = unique(unlist(strsplit(samp$Source, ",")))
samp %>% separate(col = "Source", into = into, sep = ",")
However, this doesn´t work, because if there are more then one value in a row the values will not be assigned to the respective column (e.g. for the ID 59741 the value 126 is in column 75 and not in the column 126).
A tibble: 5 x 9
ID `75` `126` `123` `111` `105` `36` `34` `22`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 34221 75 NA NA NA NA NA NA NA
2 33861 75 NA NA NA NA NA NA NA
3 59741 126 123 NA NA NA NA NA NA
4 56561 111 105 NA NA NA NA NA NA
5 55836 36 34 34 36 22 NA NA NA
Here is a dput:
structure(list(ID = c(34221, 33861, 59741, 56561, 55836), Source = c("75",
"75", "126,123", "111,105", "36,34,34,36,22")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Could also do:
library(tidyverse)
df %>%
mutate(Source = strsplit(Source, ","),
dummy = "x") %>%
unnest() %>% distinct() %>%
spread(Source, dummy)
Output:
ID `105` `111` `123` `126` `22` `34` `36` `75`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 33861 NA NA NA NA NA NA NA x
2 34221 NA NA NA NA NA NA NA x
3 55836 NA NA NA NA x x x NA
4 56561 x x NA NA NA NA NA NA
5 59741 NA NA x x NA NA NA NA
The package splitstackshape is very handy for such operations, i.e.
library(splitstackshape)
cSplit_e(df, "Source", mode = "binary", type = "character", fill = 0, drop = TRUE)
which gives,
ID Source_105 Source_111 Source_123 Source_126 Source_22 Source_34 Source_36 Source_75
1 34221 0 0 0 0 0 0 0 1
2 33861 0 0 0 0 0 0 0 1
3 59741 0 0 1 1 0 0 0 0
4 56561 1 1 0 0 0 0 0 0
5 55836 0 0 0 0 1 1 1 0
Another option is using tidyr::separate_rows
library(dplyr)
library(tidyr)
df %>% separate_rows(Source,sep=',') %>% distinct() %>%
mutate(dummy='X') %>% spread(Source,dummy)
ID 105 111 123 126 22 34 36 75
1 33861 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
2 34221 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
3 55836 <NA> <NA> <NA> <NA> X X X <NA>
4 56561 X X <NA> <NA> <NA> <NA> <NA> <NA>
5 59741 <NA> <NA> X X <NA> <NA> <NA> <NA>