Best way to unmelt dataframe in r [duplicate] - r

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
My dataframe looks like
df <- data.frame(Role = c("a","a","b", "b", "c", "c"), Men = c(1,0,3,1,2,4), Women = c(2,1,1,4,3,1))
df.melt <- melt(df)
I only have access to the version that looks like df.melt, how to I get it in the df form?
Useing dcast just gets me errors I cant figure out the syntax of it.

We need a sequence column to specify the rows as there are duplicates in the melt for 'variable
library(tidyr)
library(dplyr)
library(data.table)
df.melt %>%
mutate(rn = rowid(variable)) %>%
pivot_wider(names_from = variable, values_from = value) %>%
select(-rn)
# A tibble: 6 x 3
# Role Men Women
# <chr> <dbl> <dbl>
#1 a 1 2
#2 a 0 1
#3 b 3 1
#4 b 1 4
#5 c 2 3
#6 c 4 1
If we are looking for efficient way for "best" way, dcast from data.table is fast
library(data.table)
dcast(setDT(df.melt), rowid(variable) + Role ~
variable, value.var = 'value')[, variable := NULL][]
# Role Men Women
#1: a 1 2
#2: a 0 1
#3: b 3 1
#4: b 1 4
#5: c 2 3
#6: c 4 1

Here is a base R option using unstack
cbind(
Role = df.melt[1:(nrow(df.melt) / length(unique(df.melt$variable))), 1],
unstack(rev(df.melt[-1]))
)
which gives
Role Men Women
1 a 1 2
2 a 0 1
3 b 3 1
4 b 1 4
5 c 2 3
6 c 4 1
Another option is using reshape
subset(
reshape(
transform(
df.melt,
id = ave(1:nrow(df.melt), Role, variable, FUN = seq_along)
),
direction = "wide",
idvar = c("Role", "id"),
timevar = "variable"
),
select = -id
)
which gives
Role value.Men value.Women
1 a 1 2
2 a 0 1
3 b 3 1
4 b 1 4
5 c 2 3
6 c 4 1

Related

Frequency count for multiple columns with same values

I'd like to make a frequency count individually for multiple columns with same possible values. The idea is to keep all columns from original data table, just adding a new one for levels and aggregating.
Here is an example of input data:
foo <- data.table(a = c(1,3,2,3,3), b = c(2,3,3,1,1), c = c(3,1,2,3,2))
# a b c
#1: 1 2 3
#2: 3 3 1
#3: 2 3 2
#4: 3 1 3
#5: 3 1 2
And desired output:
data.table(levels = 1:3, a = c(1,1,3), b = c(2,1,2), c = c(1,2,2))
# levels a b c
#1: 1 1 2 1
#2: 2 1 1 2
#3: 3 3 2 2
Thanks for helping !
We may use
library(data.table)
dcast(melt(foo)[, .N, .(variable, levels = value)],
levels ~ variable, value.var = 'N')
-output
Key: <levels>
levels a b c
<num> <int> <int> <int>
1: 1 1 2 1
2: 2 1 1 2
3: 3 3 2 2
Or using base R
table(stack(foo))
ind
values a b c
1 1 2 1
2 1 1 2
3 3 2 2
You could also use recast from reshape2:
reshape2::recast(foo, value~variable)
# No id variables; using all as measure variables
# Aggregation function missing: defaulting to length
value a b c
1 1 1 2 1
2 2 1 1 2
3 3 3 2 2
or even
reshape2::recast(foo, value~variable, length)
Here is an option using purrr and dplyr from the tidyverse:
library(purrr)
library(dplyr)
foo %>%
imap(~ as.data.frame(table(.x, dnn = "levels"), responseName = .y)) %>%
reduce(left_join, by = "levels")
Alternatively, you could use the pivot functions from tidyr:
library(dplyr)
library(tidyr)
foo %>%
pivot_longer(everything(),
values_to = "levels") %>%
count(name, levels) %>%
pivot_wider(id_cols = levels,
names_from = name,
values_from = n)
foo |>
melt() |>
dcast(value ~ variable, fun.aggregate = length)
# value a b c
# 1: 1 1 2 1
# 2: 2 1 1 2
# 3: 3 3 2 2

Manipulating large dataset with dcast

Apologies if this is a repeat question but I could not find the specific answer I am looking for. I have a dataframe with counts of different species caught on a given trip. A simplified example with 5 trips and 4 species is below:
trip = c(1,1,1,2,2,3,3,3,3,4,5,5)
species = c("a","b","c","b","d","a","b","c","d","c","c","d")
count = c(5,7,3,1,8,10,1,4,3,1,2,10)
dat = cbind.data.frame(trip, species, count)
dat
> dat
trip species count
1 1 a 5
2 1 b 7
3 1 c 3
4 2 b 1
5 2 d 8
6 3 a 10
7 3 b 1
8 3 c 4
9 3 d 3
10 4 c 1
11 5 c 2
12 5 d 10
I am only interested in the counts of species b for each trip. So I want to manipulate this data frame so I end up with one that looks like this:
trip2 = c(1,2,3,4,5)
species2 = c("b","b","b","b","b")
count2 = c(7,1,1,0,0)
dat2 = cbind.data.frame(trip2, species2, count2)
dat2
> dat2
trip2 species2 count2
1 1 b 7
2 2 b 1
3 3 b 1
4 4 b 0
5 5 b 0
I want to keep all trips, including trips where species b was not observed. So I can't just subset the data by species b. I know I can cast the data so species are the columns and then just remove the columns for the other species like so:
library(dplyr)
library(reshape2)
test = dcast(dat, trip ~ species, value.var = "count", fun.aggregate = sum)
test
> test
trip a b c d
1 1 5 7 3 0
2 2 0 1 0 8
3 3 10 1 4 3
4 4 0 0 1 0
5 5 0 0 2 10
However, my real dataset has several hundred species caught on thousands of trips, and if I try to cast that many species to columns R chokes. There are way too many columns. Is there a way to specify in dcast that I only want to cast species b? Or is there another way to do this that doesn't require casting the data? Thank you.
Here is a data.table approach which I suspect will be very fast for you:
library(data.table)
setDT(dat)
result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip]
result
trip species count
1: 1 b 7
2: 2 b 1
3: 3 b 1
4: 4 b 0
5: 5 b 0
We can use tidyverse
library(dplyr)
library(tidyr)
dat %>%
filter(species == 'b') %>%
group_by(trip, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
complete(trip = unique(dat$trip), fill = list(species = 'b', count = 0))
# A tibble: 5 x 3
# trip species count
# <dbl> <chr> <dbl>
#1 1 b 7
#2 2 b 1
#3 3 b 1
#4 4 b 0
#5 5 b 0

R: reshape dataframe with duplicated variable names labeled var.1, var.2 [duplicate]

This question already has answers here:
R: reshaping wide to long [duplicate]
(1 answer)
Using tidyr to combine multiple columns [duplicate]
(1 answer)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I'm hoping to reshape a dataframe in R so that a set of columns read in with duplicated names, and then renamed as var, var.1, var.2, anothervar, anothervar.1, anothervar.2 etc. can be treated as independent observations. I would like the number appended to the variable name to be used as the observation so that I can melt my data.
For example,
dat <- data.frame(ID=1:3, var=c("A", "A", "B"),
anothervar=c(5,6,7),var.1=c(C,D,E),
anothervar.1 = c(1,2,3))
> dat
ID var anothervar var.1 anothervar.1
1 1 A 5 C 1
2 2 A 6 D 2
3 3 B 7 E 3
How can I reshape the data so it looks like the following:
ID obs var anothervar
1 1 A 5
1 2 C 1
2 1 A 6
2 2 D 2
3 1 B 7
3 2 E 3
Thank you for your help!
We can use melt from data.table that takes multiple patterns in the measure
library(data.table)
melt(setDT(dat), measure = patterns("^var", "anothervar"),
variable.name = "obs", value.name = c("var", "anothervar"))[order(ID)]
# ID obs var anothervar
#1: 1 1 A 5
#2: 1 2 C 1
#3: 2 1 A 6
#4: 2 2 D 2
#5: 3 1 B 7
#6: 3 2 E 3
As for a tidyverse solution, we can use unite with gather
dat %>%
unite("1", var, anothervar) %>%
unite("2", var.1, anothervar.1) %>%
gather(obs, value, -ID) %>%
separate(value, into = c("var", "anothervar"))
# ID obs var anothervar
#1 1 1 A 5
#2 2 1 A 6
#3 3 1 B 7
#4 1 2 C 1
#5 2 2 D 2
#6 3 2 E 3

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

How to Index subjects using R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 7 years ago.
I am working in R and I have a Data set that has multiple entries for each subject. I want to create an index variable that indexes by subject. For example:
Subject Index
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
7 D 1
8 D 2
9 E 1
The first A entry is indexed as 1, while the second A entry is indexed as 2. The first B entry is indexed as 1, etc.
Any help would be excellent!
Here.s a quick data.table aproach
library(data.table)
setDT(df)[, Index := seq_len(.N), by = Subject][]
# Subject Index
# 1: A 1
# 2: A 2
# 3: B 1
# 4: C 1
# 5: C 2
# 6: C 3
# 7: D 1
# 8: D 2
# 9: E 1
Or with base R
with(df, ave(as.numeric(Subject), Subject, FUN = seq_along))
## [1] 1 2 1 1 2 3 1 2 1
Or with dplyr (don't run this on a data.table class)
library(dplyr)
df %>%
group_by(Subject) %>%
mutate(Index = row_number())
Using dplyr
library(dplyr)
df %>% group_by(Subject) %>% mutate(Index = 1:n())
You get:
#Source: local data frame [9 x 2]
#Groups: Subject
#
# Subject Index
#1 A 1
#2 A 2
#3 B 1
#4 C 1
#5 C 2
#6 C 3
#7 D 1
#8 D 2
#9 E 1

Resources