How can I transpose specific columns in a data.frame as:
id<- c(1,2,3)
t0<- c(0,0,0)
bp0<- c(88,95,79)
t1<- c(15,12,12)
bp1<- c(92,110,82)
t2<- c(25,30,20)
bp2<- c(75,99,88)
df1<- data.frame(id, t0, bp0, t1, bp1, t2, bp2)
df1
> df1
id t0 bp0 t1 bp1 t2 bp2
1 1 0 88 15 92 25 75
2 2 0 95 12 110 30 99
3 3 0 79 12 82 20 88
In order to obtain:
> df2
id t bp
1 1 0 88
2 2 0 95
3 3 0 79
4 1 15 92
5 2 12 110
6 3 12 82
7 1 25 75
8 2 30 99
9 3 20 88
In order to obtain df2, with represent t(t0,t1,t2) and bp(bp0,bp1,bp2) for the corresponding "id"
Using Base R, you can do:
Reprex
Code
df2 <- cbind(df1[1], stack(df1, startsWith(names(df1), "t"))[1], stack(df1,startsWith(names(df1), "bp"))[1])
names(df2)[2:3] <- c("t", "bp")
Output
df2
#> id t bp
#> 1 1 0 88
#> 2 2 0 95
#> 3 3 0 79
#> 4 1 15 92
#> 5 2 12 110
#> 6 3 12 82
#> 7 1 25 75
#> 8 2 30 99
#> 9 3 20 88
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is solution with pivot_longer using name_pattern:
\\w+ = one or more alphabetic charachters
\\d+ = one or more digits
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer (
-id,
names_to = c(".value", "name"),
names_pattern = "(\\w+)(\\d+)"
) %>%
select(-name)
id t bp
<dbl> <dbl> <dbl>
1 1 0 88
2 1 15 92
3 1 25 75
4 2 0 95
5 2 12 110
6 2 30 99
7 3 0 79
8 3 12 82
9 3 20 88
A base R option using reshape
reshape(
setNames(df1, sub("(\\d+)", ".\\1", names(df1))),
direction = "long",
idvar = "id",
varying = -1
)
gives
id time t bp
1.0 1 0 0 88
2.0 2 0 0 95
3.0 3 0 0 79
1.1 1 1 15 92
2.1 2 1 12 110
3.1 3 1 12 82
1.2 1 2 25 75
2.2 2 2 30 99
3.2 3 2 20 88
Related
I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects). Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:
subid
a_score.baseline
a_score.followup
b_score.baseline
b_score.followup
c_score.baseline
c_score.followup
1
100
150
5
2
80
70
2
120
142
10
9
79
42
3
111
146
60
49
89
46
4
152
148
4
4
69
48
5
110
123
20
18
60
23
6
112
120
5
3
12
20
7
111
145
6
4
11
45
I'd like to calculate the difference between followup and baseline for each score in a new column like this:
df$a_score_difference = df$a_score.followup - df$a_score.baseleine
Any ideas on how to do this efficiently? I really appreciate your help.
code to generate sample data:
subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)
df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)
base R
scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13 8 34
# $b_score.followup
# [1] -3 -1 -11 0 -2 -2 -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37 8 34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1 1 100 150 5 2 80 70 50
# 2 2 120 142 10 9 79 42 22
# 3 3 111 146 60 49 89 46 35
# 4 4 152 148 4 4 69 48 -4
# 5 5 110 123 20 18 60 23 13
# 6 6 112 120 5 3 12 20 8
# 7 7 111 145 6 4 11 45 34
# b_score.difference c_score.difference
# 1 -3 -10
# 2 -1 -37
# 3 -11 -43
# 4 0 -21
# 5 -2 -37
# 6 -2 8
# 7 -2 34
There exists (in an unsupervised mode) the possibility that not all followups will have comparable baselines, which could cause a problem. You might include a test to validate the presence and order:
all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE
dplyr
You might consider pivoting the data into a more long format. This can be done in base R as well, but looks a lot simpler when done here:
library(dplyr)
# library(tidyr) # pivot_*
df %>%
tidyr::pivot_longer(
-subid,
names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline)
# # A tibble: 21 x 5
# subid ltr baseline followup difference
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 a 100 150 50
# 2 1 b 5 2 -3
# 3 1 c 80 70 -10
# 4 2 a 120 142 22
# 5 2 b 10 9 -1
# 6 2 c 79 42 -37
# 7 3 a 111 146 35
# 8 3 b 60 49 -11
# 9 3 c 89 46 -43
# 10 4 a 152 148 -4
# # ... with 11 more rows
Honestly, I tend to prefer a long format most of the time for many reasons. If, however, you want to make it wide again, then
df %>%
tidyr::pivot_longer(
-subid, names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline) %>%
tidyr::pivot_wider(
names_from = "ltr",
values_from = c("baseline", "followup", "difference"),
names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
# subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 100 5 80 150 2 70 50 -3 -10
# 2 2 120 10 79 142 9 42 22 -1 -37
# 3 3 111 60 89 146 49 46 35 -11 -43
# 4 4 152 4 69 148 4 48 -4 0 -21
# 5 5 110 20 60 123 18 23 13 -2 -37
# 6 6 112 5 12 120 3 20 8 -2 8
# 7 7 111 6 11 145 4 45 34 -2 34
dplyr #2
This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.
df %>%
mutate(across(
ends_with("score.followup"),
~ . - cur_data()[[sub("followup", "baseline", cur_column())]],
.names = "{sub('followup', 'difference', col)}")
)
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1 1 100 150 5 2 80 70 50 -3 -10
# 2 2 120 142 10 9 79 42 22 -1 -37
# 3 3 111 146 60 49 89 46 35 -11 -43
# 4 4 152 148 4 4 69 48 -4 0 -21
# 5 5 110 123 20 18 60 23 13 -2 -37
# 6 6 112 120 5 3 12 20 8 -2 8
# 7 7 111 145 6 4 11 45 34 -2 34
I am looking to change the structure of my dataframe, but I am not really sure how to do it. I am not even sure how to word the question either.
ID <- c(1,8,6,2,4)
a <- c(111,94,85,76,72)
b <- c(75,37,86,55,62)
dataframe <- data.frame(ID,a,b)
ID a b
1 1 111 75
2 8 94 37
3 6 85 86
4 2 76 55
5 4 72 62
Above is the code with the output, however, I want the output to look like the following; however, the only way I know how to do this is to just type it manually, is there any other way other than changing the input manually? Because I have quite a large data set that I would like to change and manually would just take forever.
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
We may use pivot_longer
library(dplyr)
library(tidyr)
dataframe %>%
pivot_longer(cols = a:b, names_to = 'letter')
-output
# A tibble: 10 × 3
ID letter value
<dbl> <chr> <dbl>
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
A base R option using reshape:
df <- reshape(dataframe, direction = "long",
v.names = "value",
varying = 2:3,
times = names(dataframe)[2:3],
timevar = "letter",
idvar = "ID")
df <- df[ order(match(df$ID, dataframe$ID)), ]
row.names(df) <- NULL
Output
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a file which contains data format like this:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
Is there anyway to read this format into R? Thanks! :>
library(stringi)
library(dplyr)
library(magrittr)
library(tidyr)
text =
"48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1"
df =
text %>%
# split over newlines (could also be accomplished by readLines)
stri_split_fixed(pattern = "\n") %>%
# need to take first list corresponding to text
extract2(1) %>%
# make the text a column in the dataframe
{data_frame(values = .)} %>%
# identify rows based on what type of data they contain
# assume a repeating pattern every 3 lines
mutate(variable = c("id", "A_row", "B_row") %>% rep(length.out = n())) %>%
# for each type of data
group_by(variable) %>%
summarize(value =
values %>%
# concatenate all values
paste(collapse = " ") %>%
# remove headers (might need to modify regex)
stri_replace_all_regex("[A-Z]_row ", "") %>%
# split as space separated data
stri_split_regex(pattern = " +")) %>%
# unnest the lists
unnest(value) %>%
# make values numeric
mutate(value = as.numeric(value)) %>%
# for each variable, number 1 through n() to guess new row ID's
group_by(variable) %>%
mutate(n = 1:n()) %>%
# reshape data
spread(variable, value)
As commented above, one approach would be to use read.delim (maybe in chunks using skip & nrows), and then cbind to reassemble them.
Depending on the file (as pasted it looks like it might need additional preprocessing to be used with read.delim), another approach would be to use readLines and strsplit