Trying to do a simple pivot in R, much like you would in SQL.
I understand this question has been asked however I am having trouble with duplicate rows.
Pivoting data in R
Currently the data is in this format (characters are just placeholders for ease of viewing. The actual data is numerical):
V1 V2 V3 V4
A B C Sales
D E F Sales
G H I Technical
J K L Technical
And it needs to be transformed into this format:
Variable Sales Technical
V1 A G
V1 D J
V2 B H
V2 E K
V3 C I
V3 F L
I've tried both reshape and tidyr packages and they either aggregate the data in the case of reshape or throw errors for duplicate row identifiers in the case of tidyr.
I don't care about duplicate row identifiers, infact it's necessary to identify them as factors for analysis.
Am I going about this the wrong way? Are these the correct packages to be using or can anyone suggest another method?
I hope this will work:
df %>% gather(Variable, Value, V1:V3) %>%
group_by(V4, Variable) %>%
mutate(g = row_number()) %>%
spread(V4, Value) %>% ungroup() %>%
select(-g)
# # A tibble: 6 x 3
# Variable Sales Technical
# * <chr> <chr> <chr>
# 1 V1 A G
# 2 V1 D J
# 3 V2 B H
# 4 V2 E K
# 5 V3 C I
# 6 V3 F L
Another option is melt/dcast from data.table
library(data.table)
dcast(melt(setDT(df1), id.var = 'V4'), variable + rowid(V4) ~
V4, value.var = 'value')[, V4 := NULL][]
# variable Sales Technical
#1: V1 A G
#2: V1 D J
#3: V2 B H
#4: V2 E K
#5: V3 C I
#6: V3 F L
Related
I have a df such as
df <-read.table(text="
v1 v2 v3 v4 v5
1 A B X C
2 A B C X
3 A C C C
4 B D V A
5 B Z Z D", header=T)
How can I filter variables v2 to v5 if they have an "X". I've seen some examples using filter at but those seem to work only for numeric conditions.
filter_at(vars(contains("prefix")), all_vars(.>5))
and replacing >5 for "X" does not work
With dplyr 1.0.4, we can use if_any
library(dplyr)
df %>%
filter(if_any(v2:v5, ~ . == 'X'))
# v1 v2 v3 v4 v5
#1 1 A B X C
#2 2 A B C X
You can use filter_at with any_vars to select rows that have at least one value of "X".
library(dplyr)
df %>% filter_at(vars(v2:v5), any_vars(. == 'X'))
# v1 v2 v3 v4 v5
#1 1 A B X C
#2 2 A B C X
However, filter_at has been superseeded so to translate this into across you can do :
df %>% filter(Reduce(`|`, across(v2:v5, ~. == 'X')))
It is also easier in base R :
df[rowSums(df[-1] == 'X') > 0, ]
I have a very large dataset which has 3 columns of interest, id, house, & people. Each id can have multiple houses and each house can have multiple people. I want to create a edge-list using what #David Arenburg, has shared here Creating edge list with additional variables in R
However, the issue I have is the edges given are 'a;b' and 'b;a'. I would like to have them only once. As large set of a and b could produce thousands of a;b, b;a combinations.
I would like to have them only once as I would like to count how many times the people share a house.
Given the dataset
id=c(rep("ID1",3), rep("ID2",6), "ID3", rep("ID4",5))
house=c(rep("house1",2), "house2", rep("house3",2), rep("house4",4), "house5", rep("house6",3), "house7", "house8")
people=c("a","b","c","d","e","d","e","d","e","f","g","h","h","h","h")
df1 <- data.frame(id,house, people)
The following code by #David Arenburg gives us the edge-list
df1 = setDT(df1)[, if(.N > 1) tstrsplit(combn(as.character(people),
2, paste, collapse = ";"), ";"),
.(id, house)]
The results
id house V1 V2
1: ID1 house1 a b
2: ID2 house3 d e
3: ID2 house4 d e
4: ID2 house4 d d
5: ID2 house4 d e
6: ID2 house4 e d
7: ID2 house4 e e
8: ID2 house4 d e
9: ID4 house6 g h
10: ID4 house6 g h
11: ID4 house6 h h
As you can see there is between V1 & V2, house has both 'd;e', 'e;d' which I would like to avoid. So for large amount of data those combinations could be in 1000s
Thanks for your help
I'm sure there's a more concise base R way, but here's one dplyr approach, where we sort the two values to make it easier to eliminate repeats.
library(dplyr)
df %>%
mutate(V1s = if_else(V1 < V2, V1, V2),
V2s = if_else(V1 < V2, V2, V1)) %>%
distinct(id, house, V1s, V2s)
There's a possibility following from the excelent answer that #David Aremburg provided.
The overall strategy:
Create a new variable with the ordered edge (it is, convert "e -> d" to "d -> e")
Get the unique values of each combination of id, house and the new variable.
Drop the variable
.
library(data.table)
# keep Aremburg's solution and chain a couple of additional commands:
setDT(df1)[,
if(.N > 1) tstrsplit(combn(as.character(people),
2, paste, collapse = ";"), ";"),
.(id, house)][,
edge := apply(.SD,
1,
function(x) paste(sort(c(x[1],
x[2])),
collapse = ",")),
.SDcols = c("V1", "V2")][,
.SD[1, ],
by = .(id, house, edge)][
, edge := NULL][]
id house V1 V2
1: ID1 house1 a b
2: ID2 house3 d e
3: ID2 house4 d e
4: ID2 house4 d d
5: ID2 house4 e e
6: ID4 house6 g h
7: ID4 house6 h h
Notice that you could drop the rows in which V1 == V2 too, as those are irrelevant edges. That could be accomplished with [V1 != V2, ] at the end of the previous chain.
I've got a data.table that i'd like to dcast based on three columns (V1, V2, V3). there are, however, some duplicates in V3 and I need an aggregate function that looks at a fourth column V4 and decides for the value of V3 based on maximum value of V4. I'd like to do this without having to aggregate DT separately prior to dcasting. can this aggregation be done in aggregate function of dcast or do I need to aggregate the table separately first?
Here is my data.table DT:
> DT <- data.table(V1 = c('a','a','a','b','b','c')
, V2 = c(1,2,1,1,2,1)
, V3 = c('st', 'cc', 'B', 'st','st','cc')
, V4 = c(0,0,1,0,1,1))
> DT
V1 V2 V3 V4
1: a 1 st 0
2: a 2 cc 0
3: a 1 B 1 ## --> i want this row to be picked in dcast when V1 = a and V2 = 1 because V4 is largest
4: b 1 st 0
5: b 2 st 1
6: c 1 cc 1
and the dcast function could look something like this:
> dcast(DT
, V1 ~ V2
, value.var = "V3"
#, fun.aggregate = V3[max.which(V4)] ## ?!?!?!??!
)
My desired output is:
> desired
V1 1 2
1: a B cc
2: b st st
3: c cc <NA>
Please note that aggregating DT before dcasting to get rid of the duplicates will solve the issue. I'm just wondering if dcasting can be done with the duplicates.
Here is one option where you take the relevent subset before dcasting:
DT[order(V4, decreasing = TRUE)
][, dcast(unique(.SD, by = c("V1", "V2")), V1 ~ V2, value.var = "V3")]
# V1 1 2
# 1: a B cc
# 2: b st st
# 3: c cc <NA>
Alternatively order and use a custom function in dcast():
dcast(
DT[order(V4, decreasing = TRUE)],
V1 ~ V2,
value.var = "V3",
fun.aggregate = function(x) x[1]
)
dplyr/tidyr option would be to group_by V1 and V2 select the maximum value in each group and then spread to wide format.
library(dplyr)
library(tidyr)
DT %>%
group_by(V1, V2) %>%
slice(which.max(V4)) %>%
select(-V4) %>%
spread(V2, V3)
# V1 `1` `2`
# <chr> <chr> <chr>
#1 a B cc
#2 b st st
#3 c cc NA
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 6 years ago.
Let's consider the following data
data <- data.frame(V1 = c("A","A","A","B","B","C","C"), V2 = c("B","B","B","C","C","D","D"))
> data
V1 V2
1 A B
2 A B
3 A B
4 B C
5 B C
6 C D
7 C D
Now we aggregate data by both columns and obtain
library(dplyr)
group_by(data, V1, V2) %>% summarise(n())
V1 V2 n()
(fctr) (fctr) (int)
1 A B 3
2 B C 2
3 C D 2
Now we want to turn this data back into original data. Is there any function for this procedure?
We can use base R to do this
data1 <- as.data.frame(data1)
data1[rep(1:nrow(data1), data1[,3]),-3]
This is one of the cases where I would opt for base R. Having said that, there are package solutions for this type of problem, i.e. expandRows (a wrapper for the above) from splitstackshape
library(splitstackshape)
data %>%
group_by(V1, V2) %>%
summarise(n=n()) %>%
expandRows(., "n")
Or if we want to stick to a similar option as in base R within %>%
data %>%
group_by(V1, V2) %>%
summarise(n=n()) %>%
do(data.frame(.[rep(1:nrow(.), .$n),-3]))
# V1 V2
# (fctr) (fctr)
#1 A B
#2 A B
#3 A B
#4 B C
#5 B C
#6 C D
#7 C D
data
data1 <- group_by(data, V1, V2) %>% summarise(n())
A really simple question but I could'nt find a solution:
I have a data.frame like
V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
df <- data.frame(cbind(V1,V2,V3))
i.e.
V1 V2 V3
A D 10
A D 11
B E 12
B E 13
C F 14
C F 15
And I would like
V1 V2 V3.1 V3.2
A D 10 11
B E 12 13
C F 14 15
I try reshape{stats} and reshape2
As I had mentioned, all that you need is a "time" variable and you should be fine.
Mark Miller shows the base R approach, and creates the time variable manually.
Here's a way to automatically create the time variable, and the equivalent command for dcast from the "reshape2" packge:
## Creating the "time" variable. This does not depend
## on the rows being in a particular order before
## assigning the variables
df <- within(df, {
A <- do.call(paste, df[1:2])
time <- ave(A, A, FUN = seq_along)
rm(A)
})
## This is the "reshaping" step
library(reshape2)
dcast(df, V1 + V2 ~ time, value.var = "V3")
# V1 V2 1 2
# 1 A D 10 11
# 2 B E 12 13
# 3 C F 14 15
Self-promotion alert
Since this type of question has cropped up several times, and since a lot of datasets don't always have a unique ID, I have implemented a variant of the above as a function called getanID in my "splitstackshape" package. In its present version, it hard-codes the name of the "time" variable as ".id". If you were using that, the steps would be:
library(splitstackshape)
library(reshape2)
df <- getanID(df, id.vars=c("V1", "V2"))
dcast(df, V1 + V2 ~ .id, value.var = "V3")
V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
time <- rep(c(1,2), 3)
df <- data.frame(V1,V2,V3,time)
df
reshape(df, idvar = c('V1','V2'), timevar='time', direction = 'wide')
V1 V2 V3.1 V3.2
1 A D 10 11
3 B E 12 13
5 C F 14 15