combine different row's values in table in r - r

I need to re-format a table in R.
I have a table like this.
ID category
1 a
1 b
2 c
3 d
4 a
4 c
5 a
And I want to reform it as
ID category1 category2
1 a b
2 c null
3 d null
4 a c
5 a null
Is this doable in R?

This is a very straightforward "long to wide" type of reshaping problem, but you need a secondary "id" (or "time") variable.
You can try using getanID from my "splitstackshape" package and use dcast to reshape from long to wide. getanID will create a new column called ".id" that would be used as your "time" variable:
library(splitstackshape)
dcast.data.table(getanID(mydf, "ID"), ID ~ .id, value.var = "category")
# ID 1 2
# 1: 1 a b
# 2: 2 c NA
# 3: 3 d NA
# 4: 4 a c
# 5: 5 a NA

Same as Ananda's, but using dplyr and tidyr:
library(tidyr)
library(dplyr)
mydf %>% group_by(ID) %>%
mutate(cat_row = paste0("category", 1:n())) %>%
spread(key = cat_row, value = category)
# Source: local data frame [5 x 3]
#
# ID category1 category2
# 1 1 a b
# 2 2 c NA
# 3 3 d NA
# 4 4 a c
# 5 5 a NA

Related

de-duplicate rows and align columns in R

I have a dataframe where there are duplicate samples but the reason for this is that only variable appears per row:
Sample
Var1
Var2
A
1
NA
B
NA
1
A
NA
3
C
NA
2
C
5
NA
B
4
NA
I would like to end up with the row names de-duplicated and corresponding column values side-by-side:
Sample
Var1
Var2
A
1
3
B
4
1
C
5
2
I've tried the group_by() function and that failed miserably!
I would very much appreciate any assistance and happy to clarify anything further if required!
We could use group_by and summarise for this task. Getting the max() will give us the desired output:
library(dplyr)
df %>%
group_by(Sample) %>%
summarise(across(, ~max(., na.rm=TRUE)))
Sample Var1 Var2
<chr> <int> <int>
1 A 1 3
2 B 4 1
3 C 5 2
data.table approach
library(data.table)
DT <- fread("Sample Var1 Var2
A 1 NA
B NA 1
A NA 3
C NA 2
C 5 NA
B 4 NA")
# or setDT(DT) if DT is not a data.table format
# melt to long format, and remove NA's
DT.melt <- melt(DT, id.vars = "Sample", na.rm = TRUE)
# cast to wide again
dcast(DT.melt, Sample ~ variable, fill = NA)
# Sample Var1 Var2
# 1: A 1 3
# 2: B 4 1
# 3: C 5 2
Using collapse
library(collapse)
fmax(df1[-1], g = df1$Sample)
Var1 Var2
A 1 3
B 4 1
C 5 2
Or in dplyr 1.1.0, we can also use .by in reframe
df1 %>%
reframe(across(where(is.numeric), ~ max(.x, na.rm = TRUE)), .by = 'Sample')
Sample Var1 Var2
1 A 1 3
2 B 4 1
3 C 5 2

Turning a data frame and a list into long format with dplyr

Here is a puzzle.
Assume you have a data frame and a list. The list has as many elements as the df has rows:
dd <- data.frame(ID=1:3, Name=LETTERS[1:3])
dl <- map(4:6, rnorm) %>% set_names(letters[1:3])
Is there a simple way (preferably with dplyr / tidyverse) to make a long format, such that the elements of the list are joined with the corresponding rows of the data frame? Here is what I have in mind illustrated with not-so-elegant way:
rows <- map(1:length(dl), ~ rep(., length(dl[[.]]))) %>% unlist()
dd <- dd[rows,]
dd$value <- unlist(dl)
As you can see, for each vector in dl, we replicated the corresponding row as many times as necessary to accommodate each value.
In base R, you can get your result with stack followed by merge:
res <- merge(stack(dl), dd, by.x="ind", by.y="Name")
head(res)
# ind values ID
#1 A -0.79616693 1
#2 A 0.37720953 1
#3 A 1.30273712 1
#4 A 0.19483859 1
#5 B 0.18770716 2
#6 B -0.02226917 2
NB: I supposed the names for dl were supposed to be in uppercases but if they are indeed lowercase, the following line needs to be pass instead:
res <- merge(stack(setNames(dl, toupper(names(dl)))), dd, by.x="ind", by.y="Name")
Since a dplyr solution has already been provided, another option is to subset dl for each Name value in dd using data.table grouping
library(data.table)
setDT(dd)
dd[, .(values = dl[[tolower(Name)]]), by = .(ID, Name)]
# ID Name values
# 1: 1 A -1.09633600
# 2: 1 A -1.26238190
# 3: 1 A 1.15220845
# 4: 1 A -1.45741071
# 5: 2 B -0.49318131
# 6: 2 B 0.59912670
# 7: 2 B -0.73117632
# 8: 2 B -1.09646143
# 9: 2 B -0.79409753
# 10: 3 C -0.08205888
# 11: 3 C 0.21503398
# 12: 3 C -1.17541571
# 13: 3 C -0.10020616
# 14: 3 C -1.01152362
# 15: 3 C -1.03693337
We can create a list column and unnest
library(tidyverse)
dd %>%
mutate(value = dl) %>%
unnest
# ID Name value
#1 1 A 1.57984385
#2 1 A 0.66831102
#3 1 A -0.45472145
#4 1 A 2.33807619
#5 2 B 1.56716709
#6 2 B 0.74982763
#7 2 B 0.07025534
#8 2 B 1.31174561
#9 2 B 0.57901536
#10 3 C -1.36629653
#11 3 C -0.66437155
#12 3 C 2.12506187
#13 3 C 1.20220402
#14 3 C 0.10687018
#15 3 C 0.15973401
Note that if the criteria is based on the compactness of code, if we remove the %>%
unnest(mutate(dd, value = dl))
Or another option is uncount and mutate
dd %>%
uncount(lengths(dl)) %>%
mutate(value = flatten_dbl(unname(dl)))
If it needs a join based on the names of the 'dl'
enframe(dl, name = 'Name') %>%
mutate(Name = toupper(Name)) %>%
left_join(dd) %>%
unnest
In base R, we can replicate the rows of 'dd' with lengths of 'dl' and transform to create the 'value' as unlisted 'dl'
transform(dd[rep(seq_len(nrow(dd)), lengths(dl)),], value = unlist(dl))

count by all variables / count distinct with dplyr

Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?
We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)
For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3
Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

Column Split based on certain separator format

I have a format to separate where I will have this data:
df = data.frame(id=c(1,2),name=c('A~B~C','A~B~D'),value=c('1~2~3','1~~2'))
id name value
1 A~B~C 1~2~3
2 A~B~D 1~~2
which is expected to have the following output where the column name is the original column name followed by the text in the name column:
id value_A value_B value_C value_D
1 1 2 3
2 1 2
I manage to achieve the splitting for the by using many nested for loops to process on my data row by row. It works on small sample data but once the data gets huge, the time is an issue.
Also,there could be more than 1 value columns, but they all should map into the same name column.
Example output:
id value_A value_B value_C value1_A value1_B value1_C
1 1 2 3 1 2 3
2 1 2 3 1 2 3
You can try dplyr:
library(tidyverse)
df %>%
separate_rows(name, value, sep = "~") %>%
spread(name, value)
id A B C D
1 1 1 2 3 <NA>
2 2 1 <NA> 2
Instead of NA you can fill empty cells by anything you specify within fill = ""
Or baseR and reshape2:
a <- strsplit(as.character(df$name), "~")
b <- strsplit(as.character(df$value), "~")
df2 <- do.call(rbind.data.frame, Map(cbind, df$id, a, b))
library(reshape2)
dcast(df2, V1~V2, value.var = "V3")
A B C D
1 1 2 3 <NA>
2 1 <NA> 2
Here is an option using cSplit/dcast. Split the rows into 'long' format with cSplit and dcast it to 'wide' format
library(splitstackshape)
dcast(cSplit(df, c('name','value'), '~', 'long')[!is.na(value)], id ~ paste0('value_', name))
# id value_A value_B value_C value_D
#1: 1 1 2 3 NA
#2: 2 1 NA NA 2

Resources