Creating a index for unique combination of columns in R - r

I got a set of data just like that:
df = data.frame(A = c(0.1, 0.3, 0.7, 0.9, 0.5, 0.4, 0.3, 0.3, 0.9, 0.9),
B = c(0.5, 0.4, 0.8, 0.6, 0.8, 0.5, 0.4, 0.5, 0.6, 0.5),
D = c(0.2, 0.1, 0.5, 0.8, 0.6, 0.7, 0.1, 0.3, 0.8, 0.3))
but i need to create a index for all unique combination of A, B and D. Just like that:
index A B D
1 1 0.1 0.5 0.2
2 2 0.3 0.4 0.1
3 3 0.7 0.8 0.5
4 4 0.9 0.6 0.8
5 5 0.5 0.8 0.6
6 6 0.4 0.5 0.7
7 2 0.3 0.4 0.1
8 7 0.3 0.5 0.3
9 4 0.9 0.6 0.8
10 8 0.9 0.5 0.3
Note that the combination between A, B and D is the same for rows 4 and 9 and for rows 2 and 7. Therefore, they receive the same index value

You can use the following code. Maybe the naming of indices have a slight difference than your output but the logic is the same:
library(dplyr)
df %>%
group_by(A, B, D) %>%
mutate(index = cur_group_id()) %>%
ungroup() %>%
arrange(index)
# A tibble: 10 x 4
A B D index
<dbl> <dbl> <dbl> <int>
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.3 0.4 0.1 2
4 0.3 0.5 0.3 3
5 0.4 0.5 0.7 4
6 0.5 0.8 0.6 5
7 0.7 0.8 0.5 6
8 0.9 0.5 0.3 7
9 0.9 0.6 0.8 8
10 0.9 0.6 0.8 8

We can use match
library(dplyr)
library(stringr)
df %>%
mutate(index = match(str_c(A, B, D), unique(str_c(A, B, D)))) %>%
arrange(index)

Another dplyr option
df %>%
distinct() %>%
mutate(index = 1:n()) %>%
left_join(x = df)
gives
A B D index
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.7 0.8 0.5 3
4 0.9 0.6 0.8 4
5 0.5 0.8 0.6 5
6 0.4 0.5 0.7 6
7 0.3 0.4 0.1 2
8 0.3 0.5 0.3 7
9 0.9 0.6 0.8 4
10 0.9 0.5 0.3 8

Related

Group similar strings with numbers and keep order of first appearance

I have a dataframe which looks like this example (just much larger):
var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)
var v1 v2
1 Peter 0.4 0.5
2 Ben 0.6 0.4
3 Mary 0.7 0.2
4 Peter.1 0.3 0.5
5 Ben.1 0.9 0.4
6 Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8 Ben.2 0.6 0.4
9 Mary.2 0.7 0.2
I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:
var v1 v2
1 Peter 0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4 # Ben appears second in the original data
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2 # Mary appears third in the original data
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
How can I achieve that?
Thank you!
An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data
library(dplyr)
library(stringr)
df <- df %>%
mutate(var1 = str_remove(var, "\\.\\d+$")) %>%
arrange(factor(var1, levels = unique(var1))) %>%
select(-var1)
Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance
library(forcats)
df %>%
arrange(fct_inorder(str_remove(var, "\\.\\d+$")))
-output
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Compact option with sub and data.table::chgroup
df[chgroup(sub("\\..", "", df$var)),]
var v1 v2
1 Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2 Ben 0.6 0.4
5 Ben.1 0.9 0.4
8 Ben.2 0.6 0.4
3 Mary 0.7 0.2
6 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently
If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:
df %>%
arrange(var)
var v1 v2
1 Ben 0.6 0.4
2 Ben.1 0.9 0.4
3 Ben.2 0.6 0.4
4 Mary 0.7 0.2
5 Mary.1 0.2 0.2
6 Mary.2 0.7 0.2
7 Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1
separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.
This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.
df %>%
separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>%
mutate(no = replace_na(no, 0)) %>%
arrange(match(alpha, alpha), no) %>%
select(-alpha, -no)
giving
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Update
Have removed what was previously the first solution after reading the update to the question.

how insert name of files in column in R [duplicate]

This question already has answers here:
Combine a list of data frames into one data frame by row
(10 answers)
Closed 2 years ago.
suppose, i have such files
rock=structure(list(x1 = c(0, 0.8, 0.4, 0.3, 0.5, 1, 0.7, 0.6, 0.4,
0.4, 0.6), x2 = c(0, 1, 0.5, 0.3, 0.5, 0.5, 0.8, 0.3, 0.6, 0.8,
0.7), x3 = c(0, 0.4, 0.8, 0.4, 0.2, 1, 0.5, 0.8, 0.4, 1, 0.3),
x4 = c(0, 0.3, 0.4, 0.4, 0.5, 0.6, 0.8, 0.3, 0.7, 0.6, 0.2
)), class = "data.frame", row.names = c(NA, -11L))
rave=structure(list(x1 = c(0, 0.8, 0.4, 0.3, 0.5, 1), x2 = c(0, 1,
0.5, 0.3, 0.5, 0.5), x3 = c(0, 0.4, 0.8, 0.4, 0.2, 1), x4 = c(0,
0.3, 0.4, 0.4, 0.5, 0.6)), class = "data.frame", row.names = c(NA,
-6L))
classic=structure(list(x1 = c(0, 0.8), x2 = 0:1, x3 = c(0, 0.4), x4 = c(0,
0.3)), class = "data.frame", row.names = c(NA, -2L))
How to do that when i rbind these datasets, for each dataset paste original name
I.e the result i want to see this like this. Initial data with names in csv format. For example
classic=read.csv(path to classic.csv)
dataset x1 x2 x3 x4
1 classic 0.0 0.0 0.0 0.0
2 classic 0.8 1.0 0.4 0.3
3 Rave 0.0 0.0 0.0 0.0
4 Rave 0.8 1.0 0.4 0.3
5 Rave 0.4 0.5 0.8 0.4
6 Rave 0.3 0.3 0.4 0.4
7 Rave 0.5 0.5 0.2 0.5
8 rock 0.0 0.0 0.0 0.0
9 rock 0.8 1.0 0.4 0.3
10 rock 0.4 0.5 0.8 0.4
11 rock 0.3 0.3 0.4 0.4
12 rock 0.5 0.5 0.2 0.5
13 rock 1.0 0.5 1.0 0.6
14 rock 0.7 0.8 0.5 0.8
15 rock 0.6 0.3 0.8 0.3
16 rock 0.4 0.6 0.4 0.7
17 rock 0.4 0.8 1.0 0.6
18 rock 0.6 0.7 0.3 0.2
Put them in a list and use bind_rows :
library(dplyr)
bind_rows(lst(rock, rave, classic), .id = 'dataset')
# dataset x1 x2 x3 x4
#1 rock 0.0 0.0 0.0 0.0
#2 rock 0.8 1.0 0.4 0.3
#3 rock 0.4 0.5 0.8 0.4
#4 rock 0.3 0.3 0.4 0.4
#5 rock 0.5 0.5 0.2 0.5
#6 rock 1.0 0.5 1.0 0.6
#7 rock 0.7 0.8 0.5 0.8
#8 rock 0.6 0.3 0.8 0.3
#9 rock 0.4 0.6 0.4 0.7
#10 rock 0.4 0.8 1.0 0.6
#11 rock 0.6 0.7 0.3 0.2
#12 rave 0.0 0.0 0.0 0.0
#13 rave 0.8 1.0 0.4 0.3
#14 rave 0.4 0.5 0.8 0.4
#15 rave 0.3 0.3 0.4 0.4
#16 rave 0.5 0.5 0.2 0.5
#17 rave 1.0 0.5 1.0 0.6
#18 classic 0.0 0.0 0.0 0.0
#19 classic 0.8 1.0 0.4 0.3
However, it would be better if you could read the data in a list automatically without reading them individually first.
library(dplyr)
library(purrr)
filenames <- list.files('/path/to/csv', pattern = '\\.csv', full.names = TRUE)
result <- map_df(filenames,
~read.csv(.x) %>%
mutate(dataset = tools::file_path_sans_ext(basename(.x))))
you can add a column with a constant name to your datasets then
rbind and then put last column to first position
classic['dataset'] = 'classic'
rave['dataset'] = 'rave'
rock['dataset'] = 'rock'
df <- rbind(classic, rave, rock)
df <- df[,c(ncol(df), 1:ncol(df)-1)]

Reshape from wide to long in R where id and value of id are in the same row

I am having trouble to reshape my data set to a panel data set. My df looks as follows
id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1
I would like to reshape so it looks as follows
id s ct1 ct2 ret
1 a 0.5 0.5 0.6
1 b 0.5 0.5 0.7
1 c 0.5 0.5 0.8
1 d 0.5 0.5 0.5
2 a 0.6 0.6 0.5
2 b 0.6 0.6 0.6
2 c 0.6 0.6 0.7
2 d 0.6 0.6 0.4
3 a 0.7 0.7 0.7
3 b 0.7 0.7 0.1
3 c 0.7 0.7 0.8
3 d 0.7 0.7 0.2
I regularly reshape from wide to long but somehow my head cannot get around this problem.
1) base R
An option using reshape
out <- reshape(
dat,
idvar = c("id", "ct1", "ct2"),
varying = c(outer(c("s", "ret"), 1:4, paste0)),
sep = "",
direction = "long"
)
Remove rownames and column time
rownames(out) <- out$time <- NULL
Result
out[order(out$id), ]
# id ct1 ct2 s ret
#1 1 0.5 0.5 a 0.6
#4 1 0.5 0.5 b 0.7
#7 1 0.5 0.5 c 0.8
#10 1 0.5 0.5 d 0.5
#2 2 0.6 0.6 c 0.7
#5 2 0.6 0.6 b 0.6
#8 2 0.6 0.6 a 0.5
#11 2 0.6 0.6 d 0.4
#3 3 0.7 0.7 a 0.7
#6 3 0.7 0.7 c 0.8
#9 3 0.7 0.7 d 0.2
#12 3 0.7 0.7 b 0.1
2) data.table
Using melt from data.table
library(data.table)
out <- melt(
setDT(dat),
id.vars = c("id", "ct1", "ct2"),
measure.vars = patterns(c("^s\\d", "^ret\\d")),
value.name = c("s", "ret")
)[, variable := NULL]
out
data
dat <- structure(list(id = 1:3, s1 = structure(c(1L, 2L, 1L), .Label = c("a",
"c"), class = "factor"), s2 = structure(c(1L, 1L, 2L), .Label = c("b",
"c"), class = "factor"), s3 = structure(c(2L, 1L, 3L), .Label = c("a",
"c", "d"), class = "factor"), s4 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), ct1 = c(0.5, 0.6, 0.7), ct2 = c(0.5,
0.6, 0.7), ret1 = c(0.6, 0.7, 0.7), ret2 = c(0.7, 0.6, 0.8),
ret3 = c(0.8, 0.5, 0.2), ret4 = c(0.5, 0.4, 0.1)), .Names = c("id",
"s1", "s2", "s3", "s4", "ct1", "ct2", "ret1", "ret2", "ret3",
"ret4"), class = "data.frame", row.names = c(NA, -3L))
You could do it using spread and gather from the tidyr package. You will need to create a temporary id variable in order to be able to pivot the data:
library(dplyr)
library(tidyr)
df %>%
gather(key, value , -id, -ct1, -ct2) %>%
mutate(key = str_extract(key, "[:alpha:]+")) %>%
group_by(key) %>%
mutate(tmp_id = row_number()) %>%
ungroup() %>%
spread(key, value) %>%
select(id, s, ct1, ct2, ret)
Here is one way that the development version of tidyr (install with devtools::install_github("tidyverse/tidyr")) can make this a lot easier with pivot_longer. We make a spec indicating that the s columns should go into an s variable and similarly for the ret columns. You can remove the final obs column that indicates the number after s or ret if desired.
library(tidyverse)
tbl <- read_table2(
"id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1"
)
spec <- tibble(
`.name` = tbl %>% select(matches("^s|ret")) %>% colnames(),
`.value` = str_remove(`.name`, "\\d$"),
obs = str_extract(`.name`, "\\d")
)
tbl %>%
pivot_longer(spec = spec)
#> # A tibble: 12 x 6
#> id ct1 ct2 obs s ret
#> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 0.5 0.5 1 a 0.6
#> 2 1 0.5 0.5 2 b 0.7
#> 3 1 0.5 0.5 3 c 0.8
#> 4 1 0.5 0.5 4 d 0.5
#> 5 2 0.6 0.6 1 c 0.7
#> 6 2 0.6 0.6 2 b 0.6
#> 7 2 0.6 0.6 3 a 0.5
#> 8 2 0.6 0.6 4 d 0.4
#> 9 3 0.7 0.7 1 a 0.7
#> 10 3 0.7 0.7 2 c 0.8
#> 11 3 0.7 0.7 3 d 0.2
#> 12 3 0.7 0.7 4 b 0.1
Created on 2019-07-23 by the reprex package (v0.3.0)

Lagging variable by group does not work in dplyr

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.
This is my problem:
library(dplyr)
df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
arrange(monthvec) %>%
mutate(growth=ifelse(monthvec==1, 0.3,
ifelse(monthvec==2, 0.5,
ifelse(monthvec==3, 0.7,
ifelse(monthvec==4, 0.1,
ifelse(monthvec==5, 0.6,NA))))))
df%>%
group_by(monthvec) %>%
mutate(lag.growth = lag(growth, order_by=monthvec))
Source: local data frame [13 x 3]
Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 0.3
3 2 0.5 NA
4 2 0.5 0.5
5 3 0.7 NA
6 3 0.7 0.7
7 3 0.7 0.7
8 4 0.1 NA
9 4 0.1 0.1
10 4 0.1 0.1
11 5 0.6 NA
12 5 0.6 0.6
13 5 0.6 0.6
This is what I'd like it to be in the end:
df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)
monthvec growth lag.growth
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
I believe that one problem is that my groups are not of equal length...
Thanks for helping out.
Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
You may join your original data with a dataframe with a shifted "monthvec".
left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")
# monthvec growth.x growth.y
# 1 1 0.3 NA
# 2 1 0.3 NA
# 3 2 0.5 0.3
# 4 2 0.5 0.3
# 5 3 0.7 0.5
# 6 3 0.7 0.5
# 7 3 0.7 0.5
# 8 4 0.1 0.7
# 9 4 0.1 0.7
# 10 4 0.1 0.7
# 11 5 0.6 0.1
# 12 5 0.6 0.1
# 13 5 0.6 0.1

Reshape matrix to data frame

I have association matrix file that looks like this (4 rows and 3 columns) .
test=read.table("test.csv", sep=",", header=T)
head(test)
LosAngeles SanDiego Seattle
1 2 3
A 1 0.1 0.2 0.2
B 2 0.2 0.4 0.2
C 3 0.3 0.5 0.3
D 4 0.2 0.5 0.1
What I want to is reshape this matrix file into data frame. The result should look something like this (12(= 4 * 3) rows and 3 columns):
RowNum ColumnNum Value
1 1 0.1
2 1 0.2
3 1 0.3
4 1 0.2
1 2 0.2
2 2 0.4
3 2 0.5
4 2 0.5
1 3 0.2
2 3 0.2
3 3 0.3
4 3 0.1
That is, if my matrix file has 100 rows and 90 columns. I want to make new data frame file that contains 9000 (= 100 * 90) rows and 3 columns. I've tried to use reshape package but but I do not seem to be able to get it right. Any suggestions how to solve this problem?
Use as.data.frame.table. Its the boss:
m <- matrix(data = c(0.1, 0.2, 0.2,
0.2, 0.4, 0.2,
0.3, 0.5, 0.3,
0.2, 0.5, 0.1),
nrow = 4, byrow = TRUE,
dimnames = list(row = 1:4, col = 1:3))
m
# col
# row 1 2 3
# 1 0.1 0.2 0.2
# 2 0.2 0.4 0.2
# 3 0.3 0.5 0.3
# 4 0.2 0.5 0.1
as.data.frame.table(m)
# row col Freq
# 1 1 1 0.1
# 2 2 1 0.2
# 3 3 1 0.3
# 4 4 1 0.2
# 5 1 2 0.2
# 6 2 2 0.4
# 7 3 2 0.5
# 8 4 2 0.5
# 9 1 3 0.2
# 10 2 3 0.2
# 11 3 3 0.3
# 12 4 3 0.1
This should do the trick:
test <- as.matrix(read.table(text="
1 2 3
1 0.1 0.2 0.2
2 0.2 0.4 0.2
3 0.3 0.5 0.3
4 0.2 0.5 0.1", header=TRUE))
data.frame(which(test==test, arr.ind=TRUE),
Value=test[which(test==test)],
row.names=NULL)
# row col Value
#1 1 1 0.1
#2 2 1 0.2
#3 3 1 0.3
#4 4 1 0.2
#5 1 2 0.2
#6 2 2 0.4
#7 3 2 0.5
#8 4 2 0.5
#9 1 3 0.2
#10 2 3 0.2
#11 3 3 0.3
#12 4 3 0.1

Resources