R: How to reshape to wide based on multiple columns - r

I asked a similar question to this here:
Previous post
Now, my dataset has expanded a little bit so I want to preserve two columns of data in the long format. Sorry, I couldn't figure out how to extend the answers that were provided to this situation.
> id <- c(1000, 1000, 1000, 1001, 1001, 1001)
> type <- c("A", "B", "B", "C", "C", "A")
> zipcode <- c(14201, 32940, 94105, 22020, 94104, 14201)
> dates <- c("10/5/2019", "10/5/2019", "10/5/2019", "9/17/2020", "9/17/2020", "9/17/2020")
> df <- as.data.frame(cbind(id, type, dates, zipcode))
> df
id type dates zipcode
1 1000 A 10/5/2019 14201
2 1000 B 10/5/2019 32940
3 1000 B 10/5/2019 94105
4 1001 C 9/17/2020 22020
5 1001 C 9/17/2020 94104
6 1001 A 9/17/2020 14201
I would like df to look something like this (it doesn't have to be exactly the same):

You can try reshape like below
reshape(
transform(
df,
q = ave(1:nrow(df),id,dates,FUN = seq_along)
),
direction = "wide",
idvar = c("id","dates"),
timevar = "q"
)
which gives
id dates type.1 zipcode.1 type.2 zipcode.2 type.3 zipcode.3
1 1000 10/5/2019 A 14201 B 32940 B 94105
4 1001 9/17/2020 C 22020 C 94104 A 14201

Using data.table
library(data.table)
dcast(setDT(df), id + dates ~ rowid(id, dates), value.var = c('type', 'zipcode'))
# id dates type_1 type_2 type_3 zipcode_1 zipcode_2 zipcode_3
#1: 1000 10/5/2019 A B B 14201 32940 94105
#2: 1001 9/17/2020 C C A 22020 94104 14201

A tidyverse approach can be:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(-c(id,dates)) %>%
group_by(id,name) %>%
mutate(name=paste0(name,1:n())) %>%
pivot_wider(names_from = name,values_from=value)
Output:
# A tibble: 2 x 8
# Groups: id [2]
id dates type1 zipcode1 type2 zipcode2 type3 zipcode3
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1000 10/5/2019 A 14201 B 32940 B 94105
2 1001 9/17/2020 C 22020 C 94104 A 14201
Update: In case that data types are troublesome, you can set a common format for all variables.
#Code 2
df2 <- df %>%
mutate(across(everything(),~as.character(.))) %>%
pivot_longer(-c(id,dates)) %>%
group_by(id,name) %>%
mutate(name=paste0(name,1:n())) %>%
pivot_wider(names_from = name,values_from=value)

Related

Subset and group dataframe by matching columns and values R

I have 2 dataframes, df1 contains a groupID and continuous variables like so:
GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043
And df2 contains cutoff values (ct) for each variable:
Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294
What I want to do is, for each variable in df1, find the number of rows where the value is greater than the cutoff value in its associated columnn in df2 and return that number for each groupID, so the output would look like this:
GroupID N-Var1 N-Var2 N-Var3 N-Var4
1 62 78 33 99
2 69 25 77 12
3 55 45 27 62
df1 is ~ 2million rows unevenly distributed by GroupID and 30 variable columns I need the count for, I am just looking for a more effecient way than typing out the same function for all 30 variables.
Here's a way in dplyr:
library(dplyr)
df1 %>%
group_by(GroupID) %>%
summarise(across(everything(), ~ sum(.x > df2[grepl(cur_column(), colnames(df2))][, 1])))
GroupID Var1 Var2 Var3 Var4
<int> <int> <int> <int> <int>
1 1 1 1 0 2
2 2 1 2 0 2
3 3 1 2 0 2
data
df1 <- read.table(header = T, text = "GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- read.table(header = T, text = "Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")
a data.table approach that should scale well..
library(data.table)
# if df1 and dsf2 are not data.table, use
# setDT(df)1; setDT(df2)
# we need similara columnnames in df1 and df2 to easily join
setnames(df2, names(df1)[2:5])
# melt df1 and to long format
df1.long <- melt(df1, id.vars = "GroupID")
df2.long <- melt(df2, measure.vars = names(df2))
# join ct-values
df1.long[df2.long, ct := i.value, on = .(variable)]
# summarise
ans <- df1.long[, sum(value > ct), by = .(GroupID, variable)]
# cast to wide
dcast(ans, GroupID ~ variable, value.var = "V1")
# GroupID Var1 Var2 Var3 Var4
# 1: 1 1 1 0 2
# 2: 2 1 2 0 2
# 3: 3 1 2 0 2
sample data
df1 <- fread("GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- fread("Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")

Counting the average of duplicates per id in R

My data looks like this:
id
date
1
a
1
a
1
b
1
c
1
c
1
c
2
z
2
z
2
e
2
x
I want to calculate the average of duplicates per id i.e for id=1 we have 2a 1b 3c I want the output to be 2.
The result shoulbe like this:
id
mean
1
2
2
1.333
You can use mean(table(date)) to get average of counts, apply it by for each id value.
Using dplyr -
library(dplyr)
df %>%
group_by(id) %>%
summarise(mean = mean(table(date)))
# id mean
# <int> <dbl>
#1 1 2
#2 2 1.33
Or with base R aggregate.
aggregate(date~id, df, function(x) mean(table(x)))
You can try a tidyverse
library(tidyverse)
d %>%
group_by(id) %>%
count(date) %>%
summarise(mean = mean(n))
# A tibble: 2 x 2
id mean
<int> <dbl>
1 1 2
2 2 1.33
Using base R you can try
foo <- function(x) mean(rle(x)$length)
aggregate(d$date, by=list(d$id), foo)
The data
d <- read.table(text ="id date
1 a
1 a
1 b
1 c
1 c
1 c
2 a
2 a
2 e
2 z", header=T)
using data.table package
library(data.table)
# dt <- your_data_frame %>% as.data.table() ## convert to table from frame
dt[, .(N=.N), by = .(id,date)][, .(mean = mean(N)), by = id]
Another data.table option
> setDT(df)[, .(Mean = .N / uniqueN(date)), id]
id Mean
1: 1 2.000000
2: 2 1.333333
or
dcast(setDT(df), id ~ date, fill = NA)[, .(Mean = rowMeans(.SD, na.rm = TRUE)), id]
gives
id Mean
1: 1 2.000000
2: 2 1.333333
We can use
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(Mean = count(cur_data(), date) %>%
pull(n) %>%
mean)
A table approach using the base package:
at<-table(a$id,a$date)
apply(at,1,function(x) sum(x)/sum(x!=0))
# 1 2
#2.000000 1.333333
The dataset:
a = data.frame('id'=c(1,1,1,1,1,1,2,2,2,2),'date'=c('a','a','b','c','c','c','a','a','e','z'))
here is a package free solution
a = cbind(c(1,1,1,1,1,1,2,2,2,2),c('a','a','b','c','c','c','a','a','e','z'))
b = matrix(ncol = 2)[-1,]
for(i in unique(a[,1])){
b=rbind(b,c(i,sum(table(a[a[,1]==i,2]))/length(table(a[a[,1]==i,2]))))
}
The output:
[,1] [,2]
[1,] "1" "2"
[2,] "2" "1.33333333333333"

Simultaneous Count and Sort in R

I am trying to obtain counts of a certain categorical variable in 2 separate columns, with each column reflecting the presence or an absence of an indicator variable. This is for a very large data frame. Here is an example data frame to further illustrate what I'm trying to do.
X <- (1:10)
Y <- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
I would like to make a new DF grouped by 'a','b', and 'c' with 2 columns to the right, one with counts of the letter for Z==1 and the a count of that letter for Z==0.
The dplyr way:
library(dplyr)
library(tidyr)
#Code
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
Output:
# A tibble: 3 x 3
# Groups: Y [3]
Y `0` `1`
<chr> <int> <int>
1 a 2 2
2 b 1 2
3 c 0 3
We can use values_fn in pivot_wider to do this in a single step
library(dplyr)
library(tidyr)
test_df %>%
pivot_wider(names_from = Z, values_from = X,
values_fn = length, values_fill = 0)
# A tibble: 3 x 3
# Y `0` `1`
# <chr> <int> <int>
#1 a 2 2
#2 b 1 2
#3 c 0 3
A base R option using aggregate + reshape
replace(
u <- reshape(
aggregate(X ~ ., test_df, length),
idvar = "Y",
timevar = "Z",
direction = "wide"
),
is.na(u),
0
)
giving
Y X.0 X.1
1 a 2 2
2 b 1 2
5 c 0 3
One way with data.table:
library(data.table)
setDT(test_df)
test_df[ , z1 := sum(Z==1), by=Y]
test_df[ , z0 := sum(Z==0), by=Y]
In base R you can use table :
table(test_df$Y, test_df$Z)
# 0 1
# a 2 2
# b 1 2
# c 0 3

Merge two database based on values between other values

I would like to use a category from one data frame and apply it to another based on a similar column (merge). But, the merge needs to consider a range of data points that are found between two columns. I have an example below.
set.seed(123)
df_1 <- tibble(
x = c(0, 500, 1000, 1500, 2000),
y = c(499, 999, 1499, 1999, 99999),
desc = LETTERS[1:5]
)
> df_1
# A tibble: 5 x 3
x y desc
<dbl> <dbl> <chr>
1 0 499 A
2 500 999 B
3 1000 1499 C
4 1500 1999 D
5 2000 99999 E
df_2 <- tibble(
code = sample(1:2500,5,F)
)
>df_2
# A tibble: 5 x 1
code
<int>
1 719
2 1970
3 1022
4 2205
5 2348
## desired output
df_2 %>%
mutate(desc = c('B', 'D', 'C', 'E', 'E'))
# A tibble: 5 x 2
code desc
<int> <chr>
1 719 B
2 1970 D
3 1022 C
4 2205 E
5 2348 E
My first thought was to split df_1 and merge somehow, but I'm stuck on how to deal with the range of values found in x and y. Any ideas?
This is an easy problem to handle in SQL, so one option would be to use the sqldf package, with this query:
SELECT t2.code, COALESCE(t1.desc, '') AS desc
FROM df_2 t2
LEFT JOIN df_1 t1
ON t2.code BETWEEN t1.x AND t1.y;
R code:
library(sqldf)
sql <- paste0("SELECT t2.code, COALESCE(t1.desc, '') AS desc ",
"FROM df_2 t2 LEFT JOIN df_1 t1 ON t2.code BETWEEN t1.x AND t1.y")
result <- sqldf(sql)
library(tidyverse)
set.seed(123)
df_1 <- tibble(
x = c(0, 500, 1000, 1500, 2000),
y = c(499, 999, 1499, 1999, 99999),
desc = LETTERS[1:5]
)
df_2 <- tibble(
code = sample(1:2500,5,F)
)
df_1 %>%
mutate(code = map2(x, y, ~seq(.x, .y, 1))) %>% # create a sequence of numbers with step = 1
unnest() %>% # unnest data
inner_join(df_2, by="code") %>% # join df_2
select(-x, -y) # remove columns
# # A tibble: 5 x 2
# desc code
# <chr> <dbl>
# 1 B 719
# 2 C 1022
# 3 D 1970
# 4 E 2205
# 5 E 2348
This seems to work, but is not very tidyverse-ish:
df_2 %>% mutate(v = with(df_1, desc[ findInterval(code, x) ]))
code v
1 719 B
2 1970 D
3 1022 C
4 2205 E
5 2348 E
This only uses the x column, so the assumption is that there are no gaps in the ranges (y is always one below the next x).

Reshape2: multiple observations for variable

I have the following sample data:
d <- data.frame(id=c(1,1,1,2,2), time=c(1,1,1,1,1), var=runif(5))
id time var
1 1 1 0.373448545
2 1 1 0.007007124
3 1 1 0.840572603
4 2 1 0.684893481
5 2 1 0.822581501
I want to reshape this data.frame to wide format using dcast such that the output is the following:
id var.1 var.2 var.3
1 1 0.3734485 0.007007124 0.8405726
2 2 0.6848935 0.822581501 NA
Does anyone has some ideas?
Create a sequence column, seq, by id and then use dcast:
library(reshape2)
set.seed(123)
d <- data.frame(id=c(1,1,1,2,2), time=c(1,1,1,1,1), var=runif(5))
d2 <- transform(d, seq = ave(id, id, FUN = seq_along))
dcast(d2, id ~ seq, value.var = "var")
giving:
id 1 2 3
1 1 0.28758 0.78831 0.40898
2 2 0.88302 0.94047 NaN
A dplyr/tidyr option with spread would be
library(dplyr)
library(tidyr)
d %>%
group_by(id) %>%
mutate(n1= paste0("var.",row_number())) %>%
spread(n1, var) %>%
select(-time)
# id var.1 var.2 var.3
# (int) (dbl) (dbl) (dbl)
#1 1 0.3734485 0.007007124 0.8405726
#2 2 0.6848935 0.822581501 NA
Ok - here's a working solution. The key is to add a counting variable. My solution for this is a bit complicated - maybe you can come up with something better.
library(dplyr)
library(magrittr)
library(reshape2)
d <- data.frame(id=c(1,1,1,2,2,3,3,3,3), time=c(1,1,1,1,1,1,1,1,1), var=runif(9))
group_by(d, id) %>%
summarise(n = n()) %>%
data.frame() -> count
f <- c()
for (i in 1:nrow(count)) {
f <- c(f, 1:count$n[i])
}
d <- data.frame(d, f)
dcast(d, id ~ f, value.var = "var")

Resources