I have a large data.table in the format below
Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1
I would like to forward rows of the data table to achieve below results. I know how to forward fill columns.
Name Value 1 2 3 4 5
A 58 1 1 1 1 1
B 47 NA 1 1 1 1
C 89 NA NA 1 1 1
D 68 NA NA NA 1 1
E 75 NA NA NA NA 1
Help!
data.table has it's own nafill function.
library(data.table) #v>=1.12.8
library(magrittr)
melt(dt, id = 1:2) %>%
.[, value := nafill(value, "locf"), by = Name] %>%
dcast(., ... ~ variable)
# Name Value 1 2 3 4 5
# 1: A 58 1 1 1 1 1
# 2: B 47 NA 1 1 1 1
# 3: C 89 NA NA 1 1 1
# 4: D 68 NA NA NA 1 1
# 5: E 75 NA NA NA NA 1
Data
dt <- fread("Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1")
Use fill in tidyr to fill in missing values with previous value.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(3:7) %>%
group_by(Name) %>%
fill(value) %>%
ungroup() %>%
pivot_wider()
# # A tibble: 5 x 7
# Name Value `1` `2` `3` `4` `5`
# <fct> <int> <int> <int> <int> <int> <int>
# 1 A 58 1 1 1 1 1
# 2 B 47 NA 1 1 1 1
# 3 C 89 NA NA 1 1 1
# 4 D 68 NA NA NA 1 1
# 5 E 75 NA NA NA NA 1
Note: The output above is the same as
df %>% fill(3:7, .direction = "up")
but the logic is different. The former belongs to "filling rows forward" and the latter is "filling columns backward". They will differ in other cases.
Data
df <- structure(list(Name = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), Value = c(58L, 47L, 89L, 68L, 75L
), `1` = c(1L, NA, NA, NA, NA), `2` = c(NA, 1L, NA, NA, NA),
`3` = c(NA, NA, 1L, NA, NA), `4` = c(NA, NA, NA, 1L, NA),
`5` = c(NA, NA, NA, NA, 1L)), class = "data.frame", row.names = c(NA, -5L))
Related
I have a dataframe df where:
Days Treatment A Treatment B Treatment C
0 5 1 1
1 0 2 3
2 1 1 0
For example, there were 5 individuals receiving Treatment A that survived 0 days and 1 who survived 2, etc. However, I would like it where those 5 individuals now become a unique row, with that cell representing the days they survived:
Patient # A B C
1 0
2 0
3 0
4 0
5 0
6 2
7 0
8 1
9 1
10 2
11 0
12 1
13 1
14 1
Let Patient # = an arbitrary value.
I am sorry if this is not descriptive enough, but I appreciate any and all help you have to offer! I have the dataset in Excel at the moment, but I can place it into R if that's easier.
We can replicate values the 'Days' with each of the 'Patient' column values in a list, then create a list of the sequence, use Map to construct a data.frame and finally use bind_rows
library(dplyr)
lst1 <- lapply(df[-1], function(x) rep(df$Days, x))
bind_rows(Map(function(x, y, z) setNames(data.frame(x, y),
c("Patient", z)), relist(seq_along(unlist(lst1)),
skeleton = lst1), lst1, sub("Treatment\\s+", "", names(lst1))))
-output
# Patient A B C
#1 1 0 NA NA
#2 2 0 NA NA
#3 3 0 NA NA
#4 4 0 NA NA
#5 5 0 NA NA
#6 6 2 NA NA
#7 7 NA 0 NA
#8 8 NA 1 NA
#9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
Or another option with reshaping into 'long' and then to 'wide'
library(tidyr)
df %>%
pivot_longer(cols = -Days) %>%
separate(name, into = c('name1', 'name2')) %>%
group_by(name2) %>%
summarise(value = rep(Days, value), .groups = 'drop') %>%
mutate(Patient = row_number()) %>%
pivot_wider(names_from = name2, values_from = value)
-output
# A tibble: 14 x 4
# Patient A B C
# <int> <int> <int> <int>
# 1 1 0 NA NA
# 2 2 0 NA NA
# 3 3 0 NA NA
# 4 4 0 NA NA
# 5 5 0 NA NA
# 6 6 2 NA NA
# 7 7 NA 0 NA
# 8 8 NA 1 NA
# 9 9 NA 1 NA
#10 10 NA 2 NA
#11 11 NA NA 0
#12 12 NA NA 1
#13 13 NA NA 1
#14 14 NA NA 1
data
df <- structure(list(Days = 0:2, `Treatment A` = c(5L, 0L, 1L),
`Treatment B` = c(1L,
2L, 1L), `Treatment C` = c(1L, 3L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I want my_var to be 0 if my_var_a to my_var_c are all NA
# A tibble: 4 x 5
my_var my_var_a my_var_b my_var_c my_var_others
<int> <int> <int> <int> <int>
1 0 NA NA NA NA
2 1 NA 1 NA NA
3 0 NA NA NA NA
4 NA NA NA NA NA
I get my desired result using:
library(tidyverse)
df %>% mutate(my_var = if_else(apply(select(., my_var_a:my_var_c), 1, function(x) all(is.na(x))), 0L, my_var))
Is there a less complicated way of doing that or at least a way using purrr? I looked into pmap but couldn't figure out how it would replace apply.
Which results in:
my_var my_var_a my_var_b my_var_c my_var_others
<int> <int> <int> <int> <int>
1 0 NA NA NA NA
2 1 NA 1 NA NA
3 0 NA NA NA NA
4 0 NA NA NA NA
This is the data frame:
structure(list(my_var = c(0L, 1L, 0L, NA), my_var_a = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), my_var_b = c(NA, 1L,
NA, NA), my_var_c = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_), my_var_others = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
We can use pmap_int from purrr to iterate over multiple columns row-wise.
library(dplyr)
library(purrr)
df %>% mutate(my_var = pmap_int(select(., my_var_a:my_var_c), ~any(!is.na(c(...)))))
# my_var my_var_a my_var_b my_var_c my_var_others
# <int> <int> <int> <int> <int>
#1 0 NA NA NA NA
#2 1 NA 1 NA NA
#3 0 NA NA NA NA
#4 0 NA NA NA NA
In base R, we can use rowSums and assign 1 to rows where there is atleast one non-NA value.
cols <- paste0("my_var_",letters[1:3])
df$my_var <- +(rowSums(is.na(df[cols])) < length(cols))
Checking for all(is.na(x)) yields TRUE where you want 0, so use ! in front. The ^1 transforms into "numeric". Fairly uncomplicated in base R.
dat <- transform(dat, my_var=apply(dat[-1], 1, function(x) !all(is.na(x)))^1)
dat
# my_var my_var_a my_var_b my_var_c my_var_others
# 1 0 NA NA NA NA
# 2 1 NA 1 NA NA
# 3 0 NA NA NA NA
# 4 0 NA NA NA NA
I have a dataframe which looks like this:
`Row Labels` Female Male
<chr> <chr> <chr>
1 London <NA> <NA>
2 42 <NA> 1
3 Paris <NA> <NA>
4 36 1 <NA>
5 Belgium <NA> <NA>
6 18 1
7 21 <NA> 1
8 Madrid <NA> <NA>
9 20 1 <NA>
10 Berlin <NA> <NA>
11 37 <NA> 1
12 23 1
13 25 1
14 44 1
The code I used to produce this dataframe looks like this:
structure(list(`Row Labels` = c("London", "42", "Paris","36", "Belgium","18" ,"21", "Madrid", "20", "Berlin", "37","23","25","44"),
Female = c(NA, NA, NA, "1", NA, NA,NA, NA, "1", NA, NA,"1","1","1"), Male = c(NA,"1", NA, NA, NA, "1", NA, NA, NA, "1",NA,NA,NA,NA)),
.Names = c("Row Labels","Female", "Male"), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
I would like to know how I can change multiple rows in this dataframe to become columns.
My ideal output looks like this:
'Row Labels' Female Male 42 36 21 20 37 18 23 25 44
London 1 1
Paris 1 1
Belgium 1 1 1 1
Madrid 1 1
Berlin 3 1 1 1 1 1
Seems very mechanical. Calling your data d:
d1 = d[seq(1, nrow(d), by = 2), ]
d2 = d[seq(2, nrow(d), by = 2), ]
d1[, c("Male", "Female")] = d2[, c("Male", "Female")]
d3 = matrix(nrow = nrow(d2), ncol = nrow(d2))
diag(d3) = 1
colnames(d3) = d2$`Row Labels`
cbind(d2, d3)
# Row Labels Female Male 42 36 21 20 37
# 1 42 <NA> 1 1 NA NA NA NA
# 2 36 1 <NA> NA 1 NA NA NA
# 3 21 <NA> 1 NA NA 1 NA NA
# 4 20 1 <NA> NA NA NA 1 NA
# 5 37 <NA> 1 NA NA NA NA 1
Using tidyverse.
library(dplyr)
library(tidyr)
#cumsum based on country names
df %>% group_by(gr=cumsum(grepl('\\D+',`Row Labels`))) %>%
#Sum Female and Male
mutate_at(vars('Female','Male'), list(~sum(as.numeric(.), na.rm = T))) %>%
#Create RL from country name and number where we are at numbers
mutate(RL=ifelse(row_number()>1, paste0(first(`Row Labels`),',',`Row Labels`), NA)) %>%
filter(!is.na(RL)) %>%
select(RL, gr, Male, Female) %>%
separate(RL, into = c('RL','Age')) %>% mutate(flag=1) %>% spread(Age, flag) %>%
ungroup() %>% select(-gr)
# A tibble: 5 x 12
RL Male Female `18` `20` `21` `23` `25` `36` `37` `42` `44`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Belgium 1 0 1 NA 1 NA NA NA NA NA NA
2 Berlin 1 3 NA NA NA 1 1 NA 1 NA 1
3 London 1 0 NA NA NA NA NA NA NA 1 NA
4 Madrid 0 1 NA 1 NA NA NA NA NA NA NA
5 Paris 0 1 NA NA NA NA NA 1 NA NA NA
I have some data that looks like this:
samp
# A tibble: 5 x 2
ID Source
<dbl> <chr>
1 34221 75
2 33861 75
3 59741 126,123
4 56561 111,105
5 55836 36,34,34,36,22
Of any of the distinct values, I want to make a new column. If the value exists in a row I want to impute an "x" otherwise no value should be imputed.
Example (pseudo code) of the expected result:
ID 75 126 123 111 105 36 34 22
1 34221 x
2 33861 x
3 59741 x x
4 56561 x x
5 55836 x x x
I tried it by the separtate function of the tydr package. Like this for the start.
into = unique(unlist(strsplit(samp$Source, ",")))
samp %>% separate(col = "Source", into = into, sep = ",")
However, this doesn´t work, because if there are more then one value in a row the values will not be assigned to the respective column (e.g. for the ID 59741 the value 126 is in column 75 and not in the column 126).
A tibble: 5 x 9
ID `75` `126` `123` `111` `105` `36` `34` `22`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 34221 75 NA NA NA NA NA NA NA
2 33861 75 NA NA NA NA NA NA NA
3 59741 126 123 NA NA NA NA NA NA
4 56561 111 105 NA NA NA NA NA NA
5 55836 36 34 34 36 22 NA NA NA
Here is a dput:
structure(list(ID = c(34221, 33861, 59741, 56561, 55836), Source = c("75",
"75", "126,123", "111,105", "36,34,34,36,22")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Could also do:
library(tidyverse)
df %>%
mutate(Source = strsplit(Source, ","),
dummy = "x") %>%
unnest() %>% distinct() %>%
spread(Source, dummy)
Output:
ID `105` `111` `123` `126` `22` `34` `36` `75`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 33861 NA NA NA NA NA NA NA x
2 34221 NA NA NA NA NA NA NA x
3 55836 NA NA NA NA x x x NA
4 56561 x x NA NA NA NA NA NA
5 59741 NA NA x x NA NA NA NA
The package splitstackshape is very handy for such operations, i.e.
library(splitstackshape)
cSplit_e(df, "Source", mode = "binary", type = "character", fill = 0, drop = TRUE)
which gives,
ID Source_105 Source_111 Source_123 Source_126 Source_22 Source_34 Source_36 Source_75
1 34221 0 0 0 0 0 0 0 1
2 33861 0 0 0 0 0 0 0 1
3 59741 0 0 1 1 0 0 0 0
4 56561 1 1 0 0 0 0 0 0
5 55836 0 0 0 0 1 1 1 0
Another option is using tidyr::separate_rows
library(dplyr)
library(tidyr)
df %>% separate_rows(Source,sep=',') %>% distinct() %>%
mutate(dummy='X') %>% spread(Source,dummy)
ID 105 111 123 126 22 34 36 75
1 33861 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
2 34221 <NA> <NA> <NA> <NA> <NA> <NA> <NA> X
3 55836 <NA> <NA> <NA> <NA> X X X <NA>
4 56561 X X <NA> <NA> <NA> <NA> <NA> <NA>
5 59741 <NA> <NA> X X <NA> <NA> <NA> <NA>
I'm fairly used to adding in missing cases for data but this use case escapes me.
I have a number of dataframes (which differ slightly), an example would be:
> t1
3 4 5
2 1 0 0
3 0 2 2
4 2 6 4
5 1 2 1
structure(list(`3` = c(1L, 0L, 2L, 1L), `4` = c(0L, 2L, 6L, 2L
), `5` = c(0L, 2L, 4L, 1L)), .Names = c("3", "4", "5"), row.names = c("2",
"3", "4", "5"), class = "data.frame")
Row names & Column names should be from 1:5 and, obviously, where these were missing the cell value set to NA. For the example above this would give:
> t1
1 2 3 4 5
1 NA NA NA NA NA
2 NA NA 1 0 0
3 NA NA 0 2 2
4 NA NA 2 6 4
5 NA NA 1 2 1
In each case ANY one or more rows AND/OR columns might be missing.
I can readily get the missing columns using the method described by Josh O'Brien here but am missing the row method.
Can anyone help?
We can do this in a much easier way with base R by creating a matrix of NAs of the required dimensions and then assign the values of 't1' based on the row names and column names of 't1'
m1 <- matrix(NA, ncol=5, nrow=5, dimnames = list(1:5, 1:5))
m1[row.names(t1), colnames(t1)] <- unlist(t1)
m1
# 1 2 3 4 5
#1 NA NA NA NA NA
#2 NA NA 1 0 0
#3 NA NA 0 2 2
#4 NA NA 2 6 4
#5 NA NA 1 2 1
Or using tidyverse
library(tidyverse)
rownames_to_column(t1, "rn") %>%
gather(Var, Val, -rn) %>%
mutate_at(vars(rn, Var), as.integer) %>%
complete(rn = seq_len(max(rn)), Var = seq_len(max(Var))) %>%
spread(Var, Val)
# A tibble: 5 × 6
# rn `1` `2` `3` `4` `5`
#* <int> <int> <int> <int> <int> <int>
#1 1 NA NA NA NA NA
#2 2 NA NA 1 0 0
#3 3 NA NA 0 2 2
#4 4 NA NA 2 6 4
#5 5 NA NA 1 2 1
Based on the solution you mentioned by Josh O'Brien, you can do the same but use rownames instead of names. Take a look at the code below..
df <- data.frame(a=1:4, e=4:1)
colnms <- c("a", "b", "d", "e")
rownms <- c("1", "2", "3", "4", "5")
rownames(df) <- c("1", "3", "4", "5")
## find missing columns and replace with zero, and order them
Missing <- setdiff(colnms, names(df))
df[Missing] <- 0
df <- df[colnms]
df
## do the same for rows
MissingR <- setdiff(rownms, rownames(df))
df[MissingR,] <- 0
df <- df[rownms,]
df
# > df
# a b d e
#1 1 0 0 4
#2 0 0 0 0
#3 2 0 0 3
#4 3 0 0 2
#5 4 0 0 1