R: complete/expand a dataset with a new column added - r

I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 3), "Steve"),
A = c(rep(3, 3), 1),
B = c(rep(4, 3), 2),
b = c(2, 3, 2, 2),
detail = c(rep("GOOOOD", 4))
)
Values in variable A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.
I need to complete this dataset with a new variable a added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)
Values in variable a are integers too and a has values from 1 to the value of A. Values in b are nested in each value of a, and values in a are nested in each factor of ID.
I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)
Where the values in detail got located in the completed dataset does not matter to me. My actual dataset has more than 40,000 rows, so I really need something to automate it.
Is it possible to do this?
Thanks very much!!!

It's pretty messy using for loop, and it will give very random position of GOOOOD
comp_dummy <- original %>%
group_by(ID) %>%
expand(A = A, a = 1:A, B = B, b = 1:B)
original <- original %>%
group_by(ID, A, B, b) %>%
summarise(n = n())
vec <- rep(NA_character_, nrow(comp_dummy))
for (i in 1:nrow(original)){
x <- original[i,]
y <- comp_dummy %>%
rownames_to_column(., "row") %>%
filter(ID == x$ID, A == x$A, B == x$B, b == x$b) %>%
pull(row)
z <- sample(y, x$n, replace = FALSE) %>% as.numeric()
print(z)
vec[{z}] <- "GOOOOD"
}
comp_dummy$detail <- vec
comp_dummy
ID A a B b detail
<chr> <dbl> <int> <dbl> <int> <chr>
1 John 3 1 4 1 NA
2 John 3 1 4 2 GOOOOD
3 John 3 1 4 3 NA
4 John 3 1 4 4 NA
5 John 3 2 4 1 NA
6 John 3 2 4 2 NA
7 John 3 2 4 3 NA
8 John 3 2 4 4 NA
9 John 3 3 4 1 NA
10 John 3 3 4 2 GOOOOD
11 John 3 3 4 3 GOOOOD
12 John 3 3 4 4 NA
13 Steve 1 1 2 1 NA
14 Steve 1 1 2 2 GOOOOD

I wonder whether doing the complete twice, first for the a and then for b can be a solution. You can adjust different nesting, or group_by if needed.
Depending if the maximum a shall be from A within the ID group or not you shall adjust/remove the group_by (similar for b within the a group)
library(dplyr)
library(tidyr)
original %>%
dplyr::mutate(a = 1) %>%
dplyr::group_by( ID ) %>%
tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
group_by( a ) %>%
tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
dplyr::ungroup()

A base R solution
do.call(
rbind,
by(original,list(original$ID),function(x){
tmp=merge(
unique(x),
setNames(
expand.grid(
unique(x$ID),
x$A[1],
1:max(x$A),
x$B[1],
1:max(x$B)
),
c("ID","A","a","B","b")
),
by=c("ID","A","B","b"),
all=T
)
tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
})
)
resulting in
ID A a B b detail
John.1 John 3 1 4 1 <NA>
John.5 John 3 1 4 2 GOOOOD
John.8 John 3 1 4 3 GOOOOD
John.11 John 3 1 4 4 <NA>
John.2 John 3 2 4 1 <NA>
John.4 John 3 2 4 2 GOOOOD
John.9 John 3 2 4 3 GOOOOD
John.12 John 3 2 4 4 <NA>
John.3 John 3 3 4 1 <NA>
John.6 John 3 3 4 2 GOOOOD
John.7 John 3 3 4 3 GOOOOD
John.10 John 3 3 4 4 <NA>
Steve.1 Steve 1 1 2 1 <NA>
Steve.2 Steve 1 1 2 2 GOOOOD

Related

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

How to replace any NAs in dataframe with the previous value in the same row in R

I have a data frame that contains several scattered NA values. I would like to fill those NAs with the values immediately preceding it in the cell to the left (same row) or the following cell to the right (same row) if a value doesn't exist to the left or is NA. It seems like using zoo::na.locf or tidyr::fill() can help with this but it only seems to work by taking the previous/next value either above or below in the same column.
I currently have this code but it's only filling based on above values in same column:
lapply(df, function(x) zoo::na.locf(zoo::na.locf(x, na.rm = FALSE), fromLast = TRUE))
My dataframe df looks like this:
C1 C2 C3 C4
1 2 1 9 2
2 NA 5 1 1
3 1 NA 3 8
4 3 NA NA 4
structure(list(C1 = c(2, NA, 1, 3), C2 = c(1, 5, NA, NA), C3 = c(9,
1, 3, NA), C4 = c(2, 1, 8, 4)), row.names = c(NA, 4L), class = "data.frame")
After filling the NA values, I would like it to look like this:
C1 C2 C3 C4
1 2 1 9 2
2 5 5 1 1
3 1 1 3 8
4 3 3 3 4
This is indeed not the usual way to store data, but if you just transpose you can use tidyr::fill(). Only downside is that it adds quite a bit of wrapping code.
xx <- structure(list(C1 = c(2, NA, 1, 3), C2 = c(1, 5, NA, NA), C3 = c(9,
1, 3, NA), C4 = c(2, 1, 8, 4)), row.names = c(NA, 4L), class = "data.frame")
xx %>%
t() %>%
as_tibble() %>%
tidyr::fill(everything(), .direction = "downup") %>%
t() %>%
as_tibble() %>%
set_names(names(xx))
# A tibble: 4 x 4
# C1 C2 C3 C4
# <dbl> <dbl> <dbl> <dbl>
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4
With apply and na.locf
library(zoo)
df[] <- t(apply(df, 1, function(x) na.locf0(na.locf0(x), fromLast = TRUE)))
-output
df
# C1 C2 C3 C4
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4
na.locf can directly work on dataframes but it works column-wise. If you want to make it run row-wise you can transpose the dataframe. You can also use fromLast = TRUE to fill the data from opposite direction. Finally, we use coalesce to select the first non-NA value from the two vectors.
library(zoo)
df[] <- dplyr::coalesce(c(t(na.locf(t(df), na.rm = FALSE))),
c(t(na.locf(t(df), na.rm = FALSE, fromLast = TRUE))))
df
# C1 C2 C3 C4
#1 2 1 9 2
#2 5 5 1 1
#3 1 1 3 8
#4 3 3 3 4

R: Merge two data frames based on value in column and return all values of both data frames

Let's say I have the following dfs
df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4
Now I want to merge both dfs conditional of column "a" to give me the following df
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4
In my dataset i tried using
merge <- merge(x = df1, y = df2, by = "a", all = TRUE)
However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.
I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:
DATA:
df1 <- data.frame(a = c(1,4,9),
b = c(2,3,7),
c = c(3,3,3),
d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
b = c(2,2,2),
c = c(3,3,3),
d = c(4,4,4))
SOLUTION:
Row-bind df1and df2:
df3 <- rbind(df1, df2)
Remove the duplicate rows:
df3 <- df3[!duplicated(df3), ]
RESULT:
df3
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4
With tidyverse, we can do bind_rows and distinct
library(dplyr)
bind_rows(df1, df2) %>%
distinct
data
df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
it is possible so
dplyr::union(df1, df2)
here is another base R solution using rbind + %in%
dfout <- rbind(df1,subset(df2,!a %in% df1$a))
such that
> rbind(df1,subset(df2,!a %in% df1$a))
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
21 2 2 3 4
31 3 2 3 4

Replace all NA values for variable with one row equal to 0

Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear
Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)
df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3
Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.
Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)
One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

Resources