This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 4 years ago.
let's say
df:
user actions
1 A
1 B
1 c
2 A
2 D
3 B
4 C
4 D
I want to convert to this format
new_df:
user action1 action2 action3
1 A B C
2 A D NA
3 B NA NA
4 C D NA
please note that the number of columns in new_df is equal to the max number of actions among users. it should insert NA if the user is less that max actions.
how can I do it?
You can use rle to create a column to store action1, action2, etc. Then use dcast from data.table package to turn the data into a wide format.
df$coln <- paste0("actions", unlist(lapply(rle(df$user)$lengths, seq_len)))
data.table::dcast(df, user ~ coln, value.var="actions")
In response to OP's comment, you can pad the beginning with 0 as follows:
df$coln <- paste0("actions", sprintf("%02d", unlist(lapply(rle(df$user)$lengths, seq_len))))
Using data.table package:
df <- read.table(text="user actions
1 A
1 B
1 C
1 D
1 E
1 F
1 G
1 H
1 I
1 J
1 K
2 A
2 D
3 B
4 C
4 D", header=TRUE)
library(data.table)
setDT(df)
dcast(setDT(df)[, coln := sprintf("actions%02d", seq_len(.N)), by=.(user)],
user ~ coln, value.var="actions")
A solution using tidyverse approach
df <- read.table(text = "user actions
1 A
1 B
1 c
2 A
2 D
3 B
4 C
4 D", header = TRUE)
library(tidyr)
library(dplyr)
df %>%
group_by(user) %>%
mutate(index = paste0("action", row_number())) %>%
spread(index, actions)
#> # A tibble: 4 x 4
#> # Groups: user [4]
#> user action1 action2 action3
#> <int> <fct> <fct> <fct>
#> 1 1 A B c
#> 2 2 A D <NA>
#> 3 3 B <NA> <NA>
#> 4 4 C D <NA>
Created on 2018-04-11 by the reprex package (v0.2.0).
Related
I have data as follows:
library(data.table)
datA <- fread("A B C
1 1 1
2 2 2")
datB <- fread("A B C
1 1 1
2 2 2
3 3 3")
I want to figure out which rows are unique (which is the one with 3 3 3, because all others occur more often).
I tried:
dat <- rbind(datA, datB)
unique(dat)
!duplicated(dat)
I also tried
setDT(dat)[,if(.N ==1) .SD,]
But that is NULL.
How should I do this?
You can use fsetdiff:
rbind.data.frame(fsetdiff(datA, datB, all = TRUE),
fsetdiff(datB, datA, all = TRUE))
In general, this is called an anti_join:
library(dplyr)
bind_rows(anti_join(datA, datB),
anti_join(datB, datA))
A B C
1: 4 4 4
2: 3 3 3
Data: I added a row in datA to show how to keep rows from both data sets (a simple anti-join does not work otherwise):
library(data.table)
datA <- fread("A B C
1 1 1
2 2 2
4 4 4")
datB <- fread("A B C
1 1 1
2 2 2
3 3 3")
One possible solution
library(data.table)
datB[!datA, on=c("A", "B", "C")]
A B C
<int> <int> <int>
1: 3 3 3
Or (if you are interested in the symmetric difference)
funion(fsetdiff(datB, datA), fsetdiff(datA, datB))
A B C
<int> <int> <int>
1: 3 3 3
Another dplyr option by filtering rows that appear once with a group_by and filter:
library(data.table)
library(dplyr)
datA %>%
bind_rows(., datB) %>%
group_by(across(everything())) %>%
filter(n() == 1)
#> # A tibble: 1 × 3
#> # Groups: A, B, C [1]
#> A B C
#> <int> <int> <int>
#> 1 3 3 3
Created on 2022-11-09 with reprex v2.0.2
Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
I wish to add the first feature in the following dataset in a new column
mydf <- data.frame (customer= c(1,2,1,2,2,1,1) , feature =c("other", "a", "b", "c", "other","b", "c"))
customer feature
1 1 other
2 2 a
3 1 b
4 2 c
5 2 other
6 1 b
7 1 c
by using dplyr. However, I wish to my code ignore the "other" feature in the data set and choose the first one after "other".
so the following code is not sufficient:
library (dplyr)
new <- mydf %>%
group_by(customer) %>%
mutate(firstfeature = first(feature))
How can I ignore "other" so that I reach the following ideal output:
customer feature firstfeature
1 1 other b
2 2 a a
3 1 b b
4 2 c a
5 2 other a
6 1 b b
With dplyr we can group by customer and take the first feature for every group.
library(dplyr)
mydf %>%
group_by(customer) %>%
mutate(firstfeature = feature[feature != "other"][1])
# customer feature firstfeature
# <dbl> <chr> <chr>
#1 1 other b
#2 2 a a
#3 1 b b
#4 2 c a
#5 2 other a
#6 1 b b
#7 1 c b
Similarly we can also do this with base R ave
mydf$firstfeature <- ave(mydf$feature, mydf$customer,
FUN= function(x) x[x!= "other"][1])
Another option is data.table
library(data.table)
setDT(mydf)[, firstfeature := feature[feature != "other"][1], customer]
I need to re-format a table in R.
I have a table like this.
ID category
1 a
1 b
2 c
3 d
4 a
4 c
5 a
And I want to reform it as
ID category1 category2
1 a b
2 c null
3 d null
4 a c
5 a null
Is this doable in R?
This is a very straightforward "long to wide" type of reshaping problem, but you need a secondary "id" (or "time") variable.
You can try using getanID from my "splitstackshape" package and use dcast to reshape from long to wide. getanID will create a new column called ".id" that would be used as your "time" variable:
library(splitstackshape)
dcast.data.table(getanID(mydf, "ID"), ID ~ .id, value.var = "category")
# ID 1 2
# 1: 1 a b
# 2: 2 c NA
# 3: 3 d NA
# 4: 4 a c
# 5: 5 a NA
Same as Ananda's, but using dplyr and tidyr:
library(tidyr)
library(dplyr)
mydf %>% group_by(ID) %>%
mutate(cat_row = paste0("category", 1:n())) %>%
spread(key = cat_row, value = category)
# Source: local data frame [5 x 3]
#
# ID category1 category2
# 1 1 a b
# 2 2 c NA
# 3 3 d NA
# 4 4 a c
# 5 5 a NA
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 7 years ago.
I am working in R and I have a Data set that has multiple entries for each subject. I want to create an index variable that indexes by subject. For example:
Subject Index
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
7 D 1
8 D 2
9 E 1
The first A entry is indexed as 1, while the second A entry is indexed as 2. The first B entry is indexed as 1, etc.
Any help would be excellent!
Here.s a quick data.table aproach
library(data.table)
setDT(df)[, Index := seq_len(.N), by = Subject][]
# Subject Index
# 1: A 1
# 2: A 2
# 3: B 1
# 4: C 1
# 5: C 2
# 6: C 3
# 7: D 1
# 8: D 2
# 9: E 1
Or with base R
with(df, ave(as.numeric(Subject), Subject, FUN = seq_along))
## [1] 1 2 1 1 2 3 1 2 1
Or with dplyr (don't run this on a data.table class)
library(dplyr)
df %>%
group_by(Subject) %>%
mutate(Index = row_number())
Using dplyr
library(dplyr)
df %>% group_by(Subject) %>% mutate(Index = 1:n())
You get:
#Source: local data frame [9 x 2]
#Groups: Subject
#
# Subject Index
#1 A 1
#2 A 2
#3 B 1
#4 C 1
#5 C 2
#6 C 3
#7 D 1
#8 D 2
#9 E 1