Transposing of Dataset - r

My dataset looks like this:
A B C B E
1 144 119 120 52
2 102 44 97 40
3 128 81 88 39
Now I want to transpose the dataset in the following format:
A Vars Values
1 B 43
2 B 78
3 B 110
1 C 46
2 C 49
3 C 130
1 B 39
2 B 86
3 B 143
1 E 59
2 E 134
3 E 49
But when I'm using the following code, one of the duplicate variable is not coming in the dataset.
df_transpose<-reshape2::melt(df,id.vars="A")
A Vars Values
1 B 43
2 B 78
3 B 110
1 C 46
2 C 49
3 C 130
1 E 59
2 E 134
3 E 49
I can't rename the duplicate variable "B" before transposing, as the location of B is dynamic. E.g. There might be 2 or 3 or more variables before "B". So, each time finding the location of "B", renaming it and then transposing is bit of a hastle.
Can anyone please help me to resolve the problem?
Thanks a lot!

How about something like,
data.frame(df1$A, stack(df1[,-1]))
# df1.A values ind
#1 1 144 B
#2 2 102 B
#3 3 128 B
#4 1 119 C
#5 2 44 C
#6 3 81 C
#7 1 120 B.1
#8 2 97 B.1
#9 3 88 B.1
#10 1 52 E
#11 2 40 E
#12 3 39 E

This is not pretty but it works:
data.frame(A=df$A, Vars=rep(names(df)[-1], each=nrow(df)), Values=c(as.matrix(df[-1])))
or:
data.frame(A=df$A, Vars=rep(names(df)[-1], each=nrow(df)), Values=stack(df[-1])[[1]])
Data used:
df <- read.table(header=TRUE, check.names=FALSE, text=
"A B C B E
1 144 119 120 52
2 102 44 97 40
3 128 81 88 39")

Related

Label columns with a ascending number [duplicate]

This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))

How to stack multiple columns using tidyverse

I have a data frame like this in wide format
setseed(1)
df = data.frame(item=letters[1:6], field1a=sample(6,6),field1b=sample(60,6),
field1c=sample(200,6),field2a=sample(6,6),field2b=sample(60,6),
field2c=sample(200,6))
what would be the best way to stack all a columns together and all b together and all c together like this
items fielda fieldb fieldc
a 2 52 121
a 1 44 57
using base R:
cbind(item=df$item,unstack(transform(stack(df,-1),ind=sub("\\d+","",ind))))
item fielda fieldb fieldc
1 a 2 57 138
2 b 6 39 77
3 c 3 37 153
4 d 4 4 99
5 e 1 12 141
6 f 5 10 194
7 a 3 17 97
8 b 4 23 120
9 c 5 1 98
10 d 1 22 37
11 e 2 49 163
12 f 6 19 131
Or you can use the reshape function in Base R:
reshape(df,varying = split(names(df)[-1],rep(1:3,2)),idvar = "item",direction = "long")
item time field1a field1b field1c
a.1 a 1 2 57 138
b.1 b 1 6 39 77
c.1 c 1 3 37 153
d.1 d 1 4 4 99
e.1 e 1 1 12 141
f.1 f 1 5 10 194
a.2 a 2 3 17 97
b.2 b 2 4 23 120
c.2 c 2 5 1 98
d.2 d 2 1 22 37
e.2 e 2 2 49 163
f.2 f 2 6 19 131
You can also decide to separate the name of the dataframe by yourself then format it:
names(df)=sub("(\\d)(.)","\\2.\\1",names(df))
reshape(df,varying= -1,idvar = "item",direction = "long")
If we are using tidyverse, then gather into 'long' format, do some rearrangements with the column name and spread
library(tidyverse)
out <- df %>%
gather(key, val, -item) %>%
mutate(key1 = gsub("\\d+", "", key),
key2 = gsub("\\D+", "", key)) %>%
select(-key) %>%
spread(key1, val) %>%
select(-key2)
head(out, 2)
# item fielda fieldb fieldc
#1 a 2 57 138
#2 a 3 17 97
Or a similar option is melt/dcast from data.table, where we melt into 'long' format, substring the 'variable' and then dcast to 'wide' format
library(data.table)
dcast(melt(setDT(df), id.var = "item")[, variable := sub("\\d+", "", variable)
], item + rowid(variable) ~ variable, value.var = 'value')[
, variable := NULL][]
# item fielda fieldb fieldc
# 1: a 2 57 138
# 2: a 3 17 97
# 3: b 6 39 77
# 4: b 4 23 120
# 5: c 3 37 153
# 6: c 5 1 98
# 7: d 4 4 99
# 8: d 1 22 37
# 9: e 1 12 141
#10: e 2 49 163
#11: f 5 10 194
#12: f 6 19 131
NOTE: Should also work when the lengths are not balanced for each cases
data
set.seed(1)
df = data.frame(item = letters[1:6],
field1a=sample(6,6),
field1b=sample(60,6),
field1c=sample(200,6),
field2a=sample(6,6),
field2b=sample(60,6),
field2c=sample(200,6))

How to identify data which does not show link between two data sets? [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 4 years ago.
Dataset1:
id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1
Dataset2:
id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f
I have two data sets. Both are linked by id1 and id2. How to identify those data from both data sets which fails to link???
We can use anti_join from the dplyr package to filter the rows with no match.
library(dplyr)
Dataset1_anti <- Dataset1 %>% anti_join(Dataset2, by = c("id1", "id2"))
Dataset1_anti
# id1 id2 abc n
# 1 2 121 no 1
# 2 3 122 yes 2
Dataset2_anti <- Dataset2 %>% anti_join(Dataset1, by = c("id1", "id2"))
Dataset2_anti
# id1 id2 age gen
# 1 2 1 52 f
# 2 121 122 41 f
# 3 121 122 44 m
# 4 4 221 56 m
DATA
Dataset1 <- read.table(text = "id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1 ",
header = TRUE, stringsAsFactors = FALSE)
Dataset2 <- read.table(text = "id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f ",
header = TRUE, stringsAsFactors = FALSE)

Calculate Ranks for Each Group, but counting tie's as 1

Following up from this post:
Calculate ranks for each group
df <- ddply(df, .(type), transform, pos = rank(x, ties.method = "min")-1)
Using the method described in the above post, when you you have multiple ties across the same TYPE, the ranking output (Pos) gets a little messy and hard to interpret, though technically still an accurate output.
For example:
library(plyr)
df <- data.frame(type = c(rep("a",11), rep("b",6), rep("c",2), rep("d", 6)),
x = c(50:53, rep(54, 3), 55:56, rep(57, 2), rep(51,3), rep(52,2), 56,
53, 57, rep(52, 2), 54, rep(58, 2), 70))
df<-ddply(df,.(type),transform, pos=rank(x,ties.method="min")-1)
Produces:
Type X Pos
a 50 0
a 51 1
a 52 2
a 53 3
a 54 4
a 54 4
a 54 4
a 55 7
a 56 8
a 57 9
a 57 9
b 51 0
b 51 0
b 51 0
b 52 3
b 52 3
b 56 5
c 53 0
c 57 1
d 52 0
d 52 0
d 54 2
d 58 3
d 58 3
d 70 5
The Pos relative ranking is correct (equal values are ranked the same, lower values ranked lower, and higher values ranked higher), but I have been trying to make the output look prettier. Any thoughts?
I'd like to get the output to look like this:
Type X Pos
a 50 1
a 51 2
a 52 3
a 53 4
a 54 5
a 54 5
a 54 5
a 55 6
a 56 7
a 57 8
a 57 8
b 51 1
b 51 1
b 51 1
b 52 2
b 52 2
b 56 3
c 53 1
c 57 2
d 52 1
d 52 1
d 54 2
d 58 3
d 58 3
d 70 4
This format, of course, assumes that the total number of records for each group doesn't matter. By taking away the "-1", we can remove the 0's, but that only solves one aspect. I've tried playing around with different equations and ties.method's, but to no avail.
Maybe the rank() function isn't what I should be using?
It seems you are looking for dense-rank:
as.data.table(df)[, pos := frank(x, ties.method = 'dense'), by = 'type'][]
# type x pos
# 1: a 50 1
# 2: a 51 2
# 3: a 52 3
# 4: a 53 4
# 5: a 54 5
# 6: a 54 5
# 7: a 54 5
# 8: a 55 6
# 9: a 56 7
# 10: a 57 8
# 11: a 57 8
# 12: b 51 1
# 13: b 51 1
# 14: b 51 1
# 15: b 52 2
# 16: b 52 2
# 17: b 56 3
# 18: c 53 1
# 19: c 57 2
# 20: d 52 1
# 21: d 52 1
# 22: d 54 2
# 23: d 58 3
# 24: d 58 3
# 25: d 70 4
# type x pos
dens_rank in dplyr does the same thing:
library(dplyr)
df %>% group_by(type) %>% mutate(pos = dense_rank(x)) %>% ungroup()
# # A tibble: 25 x 3
# type x pos
# <fctr> <dbl> <int>
# 1 a 50 1
# 2 a 51 2
# 3 a 52 3
# 4 a 53 4
# 5 a 54 5
# 6 a 54 5
# 7 a 54 5
# 8 a 55 6
# 9 a 56 7
# 10 a 57 8
# # ... with 15 more rows

Creating an index for each subject in R

I'm working with some data on repeated measures of subjects over time. The data is in this format:
Subject <- as.factor(c(rep("A", 20), rep("B", 35), rep("C", 13)))
variable.A <- rnorm(mean = 300, sd = 50, n = Subject)
dat <- data.frame(Subject, variable.A)
dat
Subject variable.A
1 A 334.6567
2 A 353.0988
3 A 244.0863
4 A 284.8918
5 A 302.6442
6 A 298.3162
7 A 271.4864
8 A 268.6848
9 A 262.3761
10 A 341.4224
11 A 190.4823
12 A 297.1981
13 A 319.8346
14 A 343.9855
15 A 332.5318
16 A 221.9502
17 A 412.9172
18 A 283.4206
19 A 310.9847
20 A 276.5423
21 B 181.5418
22 B 340.5812
23 B 348.5162
24 B 364.6962
25 B 312.2508
26 B 278.9855
27 B 242.8810
28 B 272.9585
29 B 239.2776
30 B 254.9140
31 B 253.8940
32 B 330.1918
33 B 300.7302
34 B 237.6511
35 B 314.4919
36 B 239.6195
37 B 282.7955
38 B 260.0943
39 B 396.5310
40 B 325.5422
41 B 374.8063
42 B 363.1897
43 B 258.0310
44 B 358.8605
45 B 251.8775
46 B 299.6995
47 B 303.4766
48 B 359.8955
49 B 299.7089
50 B 289.3128
51 B 401.7680
52 B 276.8078
53 B 441.4852
54 B 232.6222
55 B 305.1977
56 C 298.4580
57 C 210.5164
58 C 272.0228
59 C 282.0540
60 C 207.8797
61 C 263.3859
62 C 324.4417
63 C 273.5904
64 C 348.4389
65 C 174.2979
66 C 363.4353
67 C 260.8548
68 C 306.1833
I've used the seq_along() function and the dplyr package to create an index of each observation for every subject:
dat <- as.data.frame(dat %>%
group_by(Subject) %>%
mutate(index = seq_along(Subject)))
Subject variable.A index
1 A 334.6567 1
2 A 353.0988 2
3 A 244.0863 3
4 A 284.8918 4
5 A 302.6442 5
6 A 298.3162 6
7 A 271.4864 7
8 A 268.6848 8
9 A 262.3761 9
10 A 341.4224 10
11 A 190.4823 11
12 A 297.1981 12
13 A 319.8346 13
14 A 343.9855 14
15 A 332.5318 15
16 A 221.9502 16
17 A 412.9172 17
18 A 283.4206 18
19 A 310.9847 19
20 A 276.5423 20
21 B 181.5418 1
22 B 340.5812 2
23 B 348.5162 3
24 B 364.6962 4
25 B 312.2508 5
26 B 278.9855 6
27 B 242.8810 7
28 B 272.9585 8
29 B 239.2776 9
30 B 254.9140 10
31 B 253.8940 11
32 B 330.1918 12
33 B 300.7302 13
34 B 237.6511 14
35 B 314.4919 15
36 B 239.6195 16
37 B 282.7955 17
38 B 260.0943 18
39 B 396.5310 19
40 B 325.5422 20
41 B 374.8063 21
42 B 363.1897 22
43 B 258.0310 23
44 B 358.8605 24
45 B 251.8775 25
46 B 299.6995 26
47 B 303.4766 27
48 B 359.8955 28
49 B 299.7089 29
50 B 289.3128 30
51 B 401.7680 31
52 B 276.8078 32
53 B 441.4852 33
54 B 232.6222 34
55 B 305.1977 35
56 C 298.4580 1
57 C 210.5164 2
58 C 272.0228 3
59 C 282.0540 4
60 C 207.8797 5
61 C 263.3859 6
62 C 324.4417 7
63 C 273.5904 8
64 C 348.4389 9
65 C 174.2979 10
66 C 363.4353 11
67 C 260.8548 12
68 C 306.1833 13
What I'm now looking to do is set up an analysis that looks at every 10 observations, so I'd like to create another column that basically gives me a number for every 10 observations. For example, Subject A would have a sequence of ten "1's" followed by a sequence of ten "2's" (IE, two groupings of 10). I've tried to use the rep() function but the issue I'm running into is that the other subjects don't have a number of observations that is divisible by 10.
Is there a way for the rep() function to just assign the grouping the next number, even if it doesn't have 10 total observations? For example, Subject B would have ten "1's", ten "2's" and then five "3's" (representing that his last group of observations)?
You can use modular division %/% to generate the ids:
dat %>%
group_by(Subject) %>%
mutate(chunk_id = (seq_along(Subject) - 1) %/% 10 + 1) -> dat1
table(dat1$Subject, dat1$chunk_id)
# 1 2 3 4
# A 10 10 0 0
# B 10 10 10 5
# C 10 3 0 0
For a plain vanilla base R solution, you also could try this:
dat$newcol <- 1
dat$index <- ave(dat$newcol, dat$Subject, FUN = cumsum)
dat$chunk_id <- (dat$index - 1) %/% 10 + 1
which, when you run the table command as above gives you
table(dat$Subject, dat$chunk_id)
1 2 3 4
A 10 10 0 0
B 10 10 10 5
C 10 3 0 0
If you don't want the extra 'newcol' column, just use 'NULL' to get rid of it:
dat$newcol <- NULL

Resources