I have some data that looks like this:
samp
# A tibble: 5 x 2
ID Source
<dbl> <chr>
1 34221 75
2 33861 75
3 59741 126,123
4 56561 111,105
5 55836 36,34,34,36,22
Of any of the distinct values, I want to make a new column. If the value exists in a row I want to impute an "x" otherwise no value should be imputed.
Example (pseudo code) of the expected result:
ID 75 126 123 111 105 36 34 22
1 34221 x
2 33861 x
3 59741 x x
4 56561 x x
5 55836 x x x
I tried it by the separtate function of the tydr package. Like this for the start.
into = unique(unlist(strsplit(samp$Source, ",")))
samp %>% separate(col = "Source", into = into, sep = ",")
However, this doesn´t work, because if there are more then one value in a row the values will not be assigned to the respective column (e.g. for the ID 59741 the value 126 is in column 75 and not in the column 126).
A tibble: 5 x 9
ID `75` `126` `123` `111` `105` `36` `34` `22`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 34221 75 NA NA NA NA NA NA NA
2 33861 75 NA NA NA NA NA NA NA
3 59741 126 123 NA NA NA NA NA NA
4 56561 111 105 NA NA NA NA NA NA
5 55836 36 34 34 36 22 NA NA NA
Here is a dput:
structure(list(ID = c(34221, 33861, 59741, 56561, 55836), Source = c("75",
"75", "126,123", "111,105", "36,34,34,36,22")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Could also do:
library(tidyverse)
df %>%
mutate(Source = strsplit(Source, ","),
dummy = "x") %>%
unnest() %>% distinct() %>%
spread(Source, dummy)
Output:
ID `105` `111` `123` `126` `22` `34` `36` `75`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 33861 NA NA NA NA NA NA NA x
2 34221 NA NA NA NA NA NA NA x
3 55836 NA NA NA NA x x x NA
4 56561 x x NA NA NA NA NA NA
5 59741 NA NA x x NA NA NA NA
The package splitstackshape is very handy for such operations, i.e.
library(splitstackshape)
cSplit_e(df, "Source", mode = "binary", type = "character", fill = 0, drop = TRUE)
which gives,
ID Source_105 Source_111 Source_123 Source_126 Source_22 Source_34 Source_36 Source_75
1 34221 0 0 0 0 0 0 0 1
2 33861 0 0 0 0 0 0 0 1
3 59741 0 0 1 1 0 0 0 0
4 56561 1 1 0 0 0 0 0 0
5 55836 0 0 0 0 1 1 1 0
Another option is using tidyr::separate_rows
library(dplyr)
library(tidyr)
df %>% separate_rows(Source,sep=',') %>% distinct() %>%
mutate(dummy='X') %>% spread(Source,dummy)
ID 105 111 123 126 22 34 36 75
1 33861 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
2 34221 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
3 55836 <NA> <NA> <NA> <NA> X X X <NA>
4 56561 X X <NA> <NA> <NA> <NA> <NA> <NA>
5 59741 <NA> <NA> X X <NA> <NA> <NA> <NA>
Related
I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.)
Some simplified data look like this:
seed(999)
df_base <- data.frame(id = seq(1:100),
dum= sample(c(rep(0,50), rep(1,50)),100),
nbid_1=sample(1:100,100),
nbid_2=sample(1:100,100),
nbid_3=sample(1:100,100)) %>%
mutate(nbid_1 = replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
nbid_2 = replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
nbid_3 = replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))
(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)
My approach was to duplicate and then join the data, which would look like this:
df1 <- df_base
df2 <- df_base %>%
select(-c(nbid_1,nbid_2,nbid_3)) %>%
rename(nbdum=dum)
df <- left_join(df1,df2,by=c("nbid_1"="id")) %>%
rename(nbdum1=nbdum) %>%
left_join(.,df2,by=c("nbid_2"="id")) %>%
rename(nbdum2=nbdum) %>%
left_join(.,df2,by=c("nbid_3"="id")) %>%
rename(nbdum3=nbdum)
df is the result that I am looking for - from here I can create an overall neighbour dummy or a count.
This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.
How can I solve this in a less clumsy way?
Thanks in advance for your ideas!!
A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.
For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().
library(dplyr)
library(tidyr)
df1 %>%
select(!id) %>%
pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
mutate(original_id = rep(1:100, each = 3))%>%
left_join(df2, by = c("value" = "id"))%>%
pivot_wider(original_id, values_from = c(value, nbdum))
#> # A tibble: 100 × 7
#> original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 25 90 23 0 0 1
#> 2 2 12 NA NA 1 NA NA
#> 3 3 11 40 47 0 0 0
#> 4 4 94 87 NA 0 1 NA
#> 5 5 46 77 NA 1 0 NA
#> 6 6 98 82 NA 1 0 NA
#> 7 7 43 NA NA 1 NA NA
#> 8 8 74 NA 7 0 NA 1
#> 9 9 57 NA NA 1 NA NA
#> 10 10 49 72 NA 0 0 NA
#> # … with 90 more rows
## compare to original
as_tibble(df)
#> # A tibble: 100 × 8
#> id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
#> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 0 25 90 23 0 0 1
#> 2 2 1 12 NA NA 1 NA NA
#> 3 3 1 11 40 47 0 0 0
#> 4 4 1 94 87 NA 0 1 NA
#> 5 5 0 46 77 NA 1 0 NA
#> 6 6 1 98 82 NA 1 0 NA
#> 7 7 1 43 NA NA 1 NA NA
#> 8 8 0 74 NA 7 0 NA 1
#> 9 9 0 57 NA NA 1 NA NA
#> 10 10 0 49 72 NA 0 0 NA
#> # … with 90 more rows
As you just seem to be indexing dum with your neighbor variables you should be able to do:
library(dplyr)
df_base %>%
mutate(across(starts_with("nbid"), ~ dum[.x], .names = "nbdum_{1:3}"))
id dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
1 1 0 25 90 23 0 0 1
2 2 1 12 NA NA 1 NA NA
3 3 1 11 40 47 0 0 0
4 4 1 94 87 NA 0 1 NA
5 5 0 46 77 NA 1 0 NA
6 6 1 98 82 NA 1 0 NA
7 7 1 43 NA NA 1 NA NA
8 8 0 74 NA 7 0 NA 1
9 9 0 57 NA NA 1 NA NA
10 10 0 49 72 NA 0 0 NA
...
Or same idea in base R:
df_base[paste0("nbdum", 1:3)] <- sapply(df_base[startsWith(names(df_base), "nbid")], \(x) df_base$dum[x])
I have a dataframe like this:
df <- data_frame(id = c(rep('A', 10), rep('B', 10)),
value = c(1:3, rep(NA, 2), 1:2, rep(NA, 3), 1, rep(NA, 4), 1:3, rep(NA, 2)))
I need to count the number of consective NA's in the value column. The count needs to be grouped by ID, and it needs to restart at 1 every time a new NA or new series of NA's is encountered. The exptected output should look like this:
df$expected_output <- c(rep(NA, 3), 1:2, rep(NA, 2), 1:3, NA, 1:4, rep(NA, 3), 1:2)
If anyone can give me a dplyr solution that would also be great :)
I've tried a few things but nothing is giving any sort of sensical result. Thanks in advance^!
A solution using dplyr and data.table.
library(dplyr)
library(data.table)
df2 <- df %>%
group_by(id) %>%
mutate(info = rleid(value)) %>%
group_by(id, info) %>%
mutate(expected_output = row_number()) %>%
ungroup() %>%
mutate(expected_output = ifelse(!is.na(value), NA, expected_output)) %>%
select(-info)
df2
# # A tibble: 20 x 3
# id value expected_output
# <chr> <dbl> <int>
# 1 A 1 NA
# 2 A 2 NA
# 3 A 3 NA
# 4 A NA 1
# 5 A NA 2
# 6 A 1 NA
# 7 A 2 NA
# 8 A NA 1
# 9 A NA 2
# 10 A NA 3
# 11 B 1 NA
# 12 B NA 1
# 13 B NA 2
# 14 B NA 3
# 15 B NA 4
# 16 B 1 NA
# 17 B 2 NA
# 18 B 3 NA
# 19 B NA 1
# 20 B NA 2
We can use rle to get length of groups that are or are not na, and use purrr::map2 to apply seq if they are NA and get the growing count or just fill in with NA values using rep.
library(tidyverse)
count_na <- function(x) {
r <- rle(is.na(x))
consec <- map2(r$lengths, r$values, ~ if (.y) seq(.x) else rep(NA, .x))
unlist(consec)
}
df %>%
mutate(expected_output = count_na(value))
#> # A tibble: 20 × 3
#> id value expected_output
#> <chr> <dbl> <int>
#> 1 A 1 NA
#> 2 A 2 NA
#> 3 A 3 NA
#> 4 A NA 1
#> 5 A NA 2
#> 6 A 1 NA
#> 7 A 2 NA
#> 8 A NA 1
#> 9 A NA 2
#> 10 A NA 3
#> 11 B 1 NA
#> 12 B NA 1
#> 13 B NA 2
#> 14 B NA 3
#> 15 B NA 4
#> 16 B 1 NA
#> 17 B 2 NA
#> 18 B 3 NA
#> 19 B NA 1
#> 20 B NA 2
Here is a solution using rle:
x <- rle(is.na(df$value))
df$new[is.na(df$value)] <- sequence(x$lengths[x$values])
# A tibble: 20 x 3
id value new
<chr> <dbl> <int>
1 A 1 NA
2 A 2 NA
3 A 3 NA
4 A NA 1
5 A NA 2
6 A 1 NA
7 A 2 NA
8 A NA 1
9 A NA 2
10 A NA 3
11 B 1 NA
12 B NA 1
13 B NA 2
14 B NA 3
15 B NA 4
16 B 1 NA
17 B 2 NA
18 B 3 NA
19 B NA 1
20 B NA 2
Yet another solution:
library(tidyverse)
df %>%
mutate(aux =data.table::rleid(value)) %>%
group_by(id, aux) %>%
mutate(eout = ifelse(is.na(value), row_number(), NA_real_)) %>%
ungroup %>% select(-aux)
#> # A tibble: 20 × 4
#> id value expected_output eout
#> <chr> <dbl> <int> <dbl>
#> 1 A 1 NA NA
#> 2 A 2 NA NA
#> 3 A 3 NA NA
#> 4 A NA 1 1
#> 5 A NA 2 2
#> 6 A 1 NA NA
#> 7 A 2 NA NA
#> 8 A NA 1 1
#> 9 A NA 2 2
#> 10 A NA 3 3
#> 11 B 1 NA NA
#> 12 B NA 1 1
#> 13 B NA 2 2
#> 14 B NA 3 3
#> 15 B NA 4 4
#> 16 B 1 NA NA
#> 17 B 2 NA NA
#> 18 B 3 NA NA
#> 19 B NA 1 1
#> 20 B NA 2 2
A data frame df has one row per for every distinct value of its numeric vector id0 - but trailing zeros in cells for id0 indicate important groups along which the file must be transformed. Here are 12 observations of df:
row id0 id0_ntz
a 111000 3
b 111010 1
c 112345 0
d 111974 0
e 112090 1
f 114000 3
g 114099 0
h 555001 0
i 555012 0
j 461000 3
k 461020 1
l 111090 0
Let's call every value of id0 with three trailing zeros (i.e., where id0_ntz == 3) a "big id" and every value that doesn't fit this pattern a "little id." The 12 obs above include three big ids (row values a, f, and j). For each big id, I need to:
Find every other value of id0 that matches the first three digits of this ith big id
Add the value of id0 for each match to one of j discrete vectors called idj, where j is a suffix ranging from 1 to j that effectively counts the number of matching little ids nested in the ith big id.
If df only included the 12 rows shown above, the correct result would look like this:
row id0 id0_ntz id1 id2 id3
a 111000 3 111010 111974 111090
b 111010 1 NA NA NA
c 112345 0 NA NA NA
d 111974 0 NA NA NA
e 112090 1 NA NA NA
f 114000 3 114099 NA NA
g 114099 0 NA NA NA
h 555001 0 NA NA NA
i 555012 0 NA NA NA
j 461000 3 461020 NA NA
k 461020 1 NA NA NA
l 111090 0 NA NA NA
I'm open to any solution that solves this problem dynamically (i.e., is agnostic to the number of big ids, little ids, and resulting idj vectors).
P.S.: I need to do the same thing again where id0_ntz == 2, then 1, but an acceptable answer to this posted question only requires a solution that solves the problem where id0_ntz == 3.
This will serve your purpose
df <- read.table(text = 'row id0 id0_ntz
a 111000 3
b 111010 1
c 112345 0
d 111974 0
e 112090 1
f 114000 3
g 114099 0
h 555001 0
i 555012 0
j 461000 3
k 461020 1
l 111090 0', header = T)
df$id0 <- as.character(df$id0)
library(tidyverse)
df %>%
filter(id0_ntz == 3) %>%
mutate(big_id = substr(id0, 1, 3)) -> big_id
df %>% mutate(id0 = as.character(id0)) %>%
left_join(df %>% mutate(id = as.character(id0),
dummy = match(substr(id, 1, 3), big_id$big_id)) %>%
filter(!is.na(dummy)) %>%
group_by(dummy) %>%
mutate(d2 = paste0('id', row_number() - 1)) %>% select(-id0) %>%
pivot_wider(id_cols = dummy, names_from = d2, values_from = id),
by = c('id0')) %>%
select(-dummy)
#> row id0 id0_ntz id1 id2 id3
#> 1 a 111000 3 111010 111974 111090
#> 2 b 111010 1 <NA> <NA> <NA>
#> 3 c 112345 0 <NA> <NA> <NA>
#> 4 d 111974 0 <NA> <NA> <NA>
#> 5 e 112090 1 <NA> <NA> <NA>
#> 6 f 114000 3 114099 <NA> <NA>
#> 7 g 114099 0 <NA> <NA> <NA>
#> 8 h 555001 0 <NA> <NA> <NA>
#> 9 i 555012 0 <NA> <NA> <NA>
#> 10 j 461000 3 461020 <NA> <NA>
#> 11 k 461020 1 <NA> <NA> <NA>
#> 12 l 111090 0 <NA> <NA> <NA>
Created on 2021-05-28 by the reprex package (v2.0.0)
I'd use the approach below. Its quite short.
library(tidyverse)
df %>%
group_by(big_id = substr(id0, 1, 3)) %>%
mutate(id = ifelse(substr(id0, 4, 6) == "000",
list(setdiff(unique(id0),
paste0(big_id, "000"))),
list())) %>%
unnest_wider(col = id,
names_sep = "")
#> # A tibble: 12 x 7
#> # Groups: big_id [5]
#> row id0 id0_ntz big_id id1 id2 id3
#> <chr> <int> <int> <chr> <int> <int> <int>
#> 1 a 111000 3 111 111010 111974 111090
#> 2 b 111010 1 111 NA NA NA
#> 3 c 112345 0 112 NA NA NA
#> 4 d 111974 0 111 NA NA NA
#> 5 e 112090 1 112 NA NA NA
#> 6 f 114000 3 114 114099 NA NA
#> 7 g 114099 0 114 NA NA NA
#> 8 h 555001 0 555 NA NA NA
#> 9 i 555012 0 555 NA NA NA
#> 10 j 461000 3 461 461020 NA NA
#> 11 k 461020 1 461 NA NA NA
#> 12 l 111090 0 111 NA NA NA
Created on 2021-05-27 by the reprex package (v0.3.0)
library(tidyverse)
df <- read.table(text = 'row id0 id0_ntz
a 111000 3
b 111010 1
c 112345 0
d 111974 0
e 112090 1
f 114000 3
g 114099 0
h 555001 0
i 555012 0
j 461000 3
k 461020 1
l 111090 0', header = T)
df %>%
mutate(id = id0 %/% 1000 * 1000) %>%
group_by(id) %>%
mutate(row_id = row_number() - 1) %>%
ungroup() %>%
filter(row_id != 0) %>%
pivot_wider(id, names_from = row_id, values_from = id0, names_prefix = "id") %>%
right_join(df, by = c("id" = "id0")) %>%
rename(id0 = id) %>%
arrange(row)
#> # A tibble: 12 x 6
#> id0 id1 id2 id3 row id0_ntz
#> <dbl> <int> <int> <int> <chr> <int>
#> 1 111000 111010 111974 111090 a 3
#> 2 111010 NA NA NA b 1
#> 3 112345 NA NA NA c 0
#> 4 111974 NA NA NA d 0
#> 5 112090 NA NA NA e 1
#> 6 114000 114099 NA NA f 3
#> 7 114099 NA NA NA g 0
#> 8 555001 NA NA NA h 0
#> 9 555012 NA NA NA i 0
#> 10 461000 461020 NA NA j 3
#> 11 461020 NA NA NA k 1
#> 12 111090 NA NA NA l 0
Created on 2021-05-27 by the reprex package (v2.0.0)
I have a dataframe df where:
Days Treatment A Treatment B Treatment C
0 5 1 1
1 0 2 3
2 1 1 0
For example, there were 5 individuals receiving Treatment A that survived 0 days and 1 who survived 2, etc. However, I would like it where those 5 individuals now become a unique row, with that cell representing the days they survived:
Patient # A B C
1 0
2 0
3 0
4 0
5 0
6 2
7 0
8 1
9 1
10 2
11 0
12 1
13 1
14 1
Let Patient # = an arbitrary value.
I am sorry if this is not descriptive enough, but I appreciate any and all help you have to offer! I have the dataset in Excel at the moment, but I can place it into R if that's easier.
We can replicate values the 'Days' with each of the 'Patient' column values in a list, then create a list of the sequence, use Map to construct a data.frame and finally use bind_rows
library(dplyr)
lst1 <- lapply(df[-1], function(x) rep(df$Days, x))
bind_rows(Map(function(x, y, z) setNames(data.frame(x, y),
c("Patient", z)), relist(seq_along(unlist(lst1)),
skeleton = lst1), lst1, sub("Treatment\\s+", "", names(lst1))))
-output
# Patient A B C
#1 1 0 NA NA
#2 2 0 NA NA
#3 3 0 NA NA
#4 4 0 NA NA
#5 5 0 NA NA
#6 6 2 NA NA
#7 7 NA 0 NA
#8 8 NA 1 NA
#9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
Or another option with reshaping into 'long' and then to 'wide'
library(tidyr)
df %>%
pivot_longer(cols = -Days) %>%
separate(name, into = c('name1', 'name2')) %>%
group_by(name2) %>%
summarise(value = rep(Days, value), .groups = 'drop') %>%
mutate(Patient = row_number()) %>%
pivot_wider(names_from = name2, values_from = value)
-output
# A tibble: 14 x 4
# Patient A B C
# <int> <int> <int> <int>
# 1 1 0 NA NA
# 2 2 0 NA NA
# 3 3 0 NA NA
# 4 4 0 NA NA
# 5 5 0 NA NA
# 6 6 2 NA NA
# 7 7 NA 0 NA
# 8 8 NA 1 NA
# 9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
data
df <- structure(list(Days = 0:2, `Treatment A` = c(5L, 0L, 1L),
`Treatment B` = c(1L,
2L, 1L), `Treatment C` = c(1L, 3L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I have a dataframe which looks like this:
`Row Labels` Female Male
<chr> <chr> <chr>
1 London <NA> <NA>
2 42 <NA> 1
3 Paris <NA> <NA>
4 36 1 <NA>
5 Belgium <NA> <NA>
6 18 1
7 21 <NA> 1
8 Madrid <NA> <NA>
9 20 1 <NA>
10 Berlin <NA> <NA>
11 37 <NA> 1
12 23 1
13 25 1
14 44 1
The code I used to produce this dataframe looks like this:
structure(list(`Row Labels` = c("London", "42", "Paris","36", "Belgium","18" ,"21", "Madrid", "20", "Berlin", "37","23","25","44"),
Female = c(NA, NA, NA, "1", NA, NA,NA, NA, "1", NA, NA,"1","1","1"), Male = c(NA,"1", NA, NA, NA, "1", NA, NA, NA, "1",NA,NA,NA,NA)),
.Names = c("Row Labels","Female", "Male"), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
I would like to know how I can change multiple rows in this dataframe to become columns.
My ideal output looks like this:
'Row Labels' Female Male 42 36 21 20 37 18 23 25 44
London 1 1
Paris 1 1
Belgium 1 1 1 1
Madrid 1 1
Berlin 3 1 1 1 1 1
Seems very mechanical. Calling your data d:
d1 = d[seq(1, nrow(d), by = 2), ]
d2 = d[seq(2, nrow(d), by = 2), ]
d1[, c("Male", "Female")] = d2[, c("Male", "Female")]
d3 = matrix(nrow = nrow(d2), ncol = nrow(d2))
diag(d3) = 1
colnames(d3) = d2$`Row Labels`
cbind(d2, d3)
# Row Labels Female Male 42 36 21 20 37
# 1 42 <NA> 1 1 NA NA NA NA
# 2 36 1 <NA> NA 1 NA NA NA
# 3 21 <NA> 1 NA NA 1 NA NA
# 4 20 1 <NA> NA NA NA 1 NA
# 5 37 <NA> 1 NA NA NA NA 1
Using tidyverse.
library(dplyr)
library(tidyr)
#cumsum based on country names
df %>% group_by(gr=cumsum(grepl('\\D+',`Row Labels`))) %>%
#Sum Female and Male
mutate_at(vars('Female','Male'), list(~sum(as.numeric(.), na.rm = T))) %>%
#Create RL from country name and number where we are at numbers
mutate(RL=ifelse(row_number()>1, paste0(first(`Row Labels`),',',`Row Labels`), NA)) %>%
filter(!is.na(RL)) %>%
select(RL, gr, Male, Female) %>%
separate(RL, into = c('RL','Age')) %>% mutate(flag=1) %>% spread(Age, flag) %>%
ungroup() %>% select(-gr)
# A tibble: 5 x 12
RL Male Female `18` `20` `21` `23` `25` `36` `37` `42` `44`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Belgium 1 0 1 NA 1 NA NA NA NA NA NA
2 Berlin 1 3 NA NA NA 1 1 NA 1 NA 1
3 London 1 0 NA NA NA NA NA NA NA 1 NA
4 Madrid 0 1 NA 1 NA NA NA NA NA NA NA
5 Paris 0 1 NA NA NA NA NA 1 NA NA NA