Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a file which contains data format like this:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
Is there anyway to read this format into R? Thanks! :>
library(stringi)
library(dplyr)
library(magrittr)
library(tidyr)
text =
"48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
A_row 17 16 10 12 9 15 10 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 3 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1
71 72 73 74 75 76 77 78 80 81 83 84 85 86 87 88 89 90 94 97 103 104
A_row 1 6 0 2 9 5 1 19 9 15 7 3 5 12 6 4 6 8 1 7 6 5 4
B_row 2 5 1 5 2 0 3 1 2 2 3 1 3 2 1 2 1 1 1 0 0 1 1"
df =
text %>%
# split over newlines (could also be accomplished by readLines)
stri_split_fixed(pattern = "\n") %>%
# need to take first list corresponding to text
extract2(1) %>%
# make the text a column in the dataframe
{data_frame(values = .)} %>%
# identify rows based on what type of data they contain
# assume a repeating pattern every 3 lines
mutate(variable = c("id", "A_row", "B_row") %>% rep(length.out = n())) %>%
# for each type of data
group_by(variable) %>%
summarize(value =
values %>%
# concatenate all values
paste(collapse = " ") %>%
# remove headers (might need to modify regex)
stri_replace_all_regex("[A-Z]_row ", "") %>%
# split as space separated data
stri_split_regex(pattern = " +")) %>%
# unnest the lists
unnest(value) %>%
# make values numeric
mutate(value = as.numeric(value)) %>%
# for each variable, number 1 through n() to guess new row ID's
group_by(variable) %>%
mutate(n = 1:n()) %>%
# reshape data
spread(variable, value)
As commented above, one approach would be to use read.delim (maybe in chunks using skip & nrows), and then cbind to reassemble them.
Depending on the file (as pasted it looks like it might need additional preprocessing to be used with read.delim), another approach would be to use readLines and strsplit
Related
How can I transpose specific columns in a data.frame as:
id<- c(1,2,3)
t0<- c(0,0,0)
bp0<- c(88,95,79)
t1<- c(15,12,12)
bp1<- c(92,110,82)
t2<- c(25,30,20)
bp2<- c(75,99,88)
df1<- data.frame(id, t0, bp0, t1, bp1, t2, bp2)
df1
> df1
id t0 bp0 t1 bp1 t2 bp2
1 1 0 88 15 92 25 75
2 2 0 95 12 110 30 99
3 3 0 79 12 82 20 88
In order to obtain:
> df2
id t bp
1 1 0 88
2 2 0 95
3 3 0 79
4 1 15 92
5 2 12 110
6 3 12 82
7 1 25 75
8 2 30 99
9 3 20 88
In order to obtain df2, with represent t(t0,t1,t2) and bp(bp0,bp1,bp2) for the corresponding "id"
Using Base R, you can do:
Reprex
Code
df2 <- cbind(df1[1], stack(df1, startsWith(names(df1), "t"))[1], stack(df1,startsWith(names(df1), "bp"))[1])
names(df2)[2:3] <- c("t", "bp")
Output
df2
#> id t bp
#> 1 1 0 88
#> 2 2 0 95
#> 3 3 0 79
#> 4 1 15 92
#> 5 2 12 110
#> 6 3 12 82
#> 7 1 25 75
#> 8 2 30 99
#> 9 3 20 88
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is solution with pivot_longer using name_pattern:
\\w+ = one or more alphabetic charachters
\\d+ = one or more digits
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer (
-id,
names_to = c(".value", "name"),
names_pattern = "(\\w+)(\\d+)"
) %>%
select(-name)
id t bp
<dbl> <dbl> <dbl>
1 1 0 88
2 1 15 92
3 1 25 75
4 2 0 95
5 2 12 110
6 2 30 99
7 3 0 79
8 3 12 82
9 3 20 88
A base R option using reshape
reshape(
setNames(df1, sub("(\\d+)", ".\\1", names(df1))),
direction = "long",
idvar = "id",
varying = -1
)
gives
id time t bp
1.0 1 0 0 88
2.0 2 0 0 95
3.0 3 0 0 79
1.1 1 1 15 92
2.1 2 1 12 110
3.1 3 1 12 82
1.2 1 2 25 75
2.2 2 2 30 99
3.2 3 2 20 88
I am looking to change the structure of my dataframe, but I am not really sure how to do it. I am not even sure how to word the question either.
ID <- c(1,8,6,2,4)
a <- c(111,94,85,76,72)
b <- c(75,37,86,55,62)
dataframe <- data.frame(ID,a,b)
ID a b
1 1 111 75
2 8 94 37
3 6 85 86
4 2 76 55
5 4 72 62
Above is the code with the output, however, I want the output to look like the following; however, the only way I know how to do this is to just type it manually, is there any other way other than changing the input manually? Because I have quite a large data set that I would like to change and manually would just take forever.
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
We may use pivot_longer
library(dplyr)
library(tidyr)
dataframe %>%
pivot_longer(cols = a:b, names_to = 'letter')
-output
# A tibble: 10 × 3
ID letter value
<dbl> <chr> <dbl>
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
A base R option using reshape:
df <- reshape(dataframe, direction = "long",
v.names = "value",
varying = 2:3,
times = names(dataframe)[2:3],
timevar = "letter",
idvar = "ID")
df <- df[ order(match(df$ID, dataframe$ID)), ]
row.names(df) <- NULL
Output
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
I would like to apply a function that selects the best transformation of certain variables in a data frame, and then adds new columns to the data frame with the transformed data. I can currently get the transformation to run as follows. However, this rewrites the existing data, instead of adding new, transformed variables. I have seen the other stackoverflow posts about dynamically-added variables but can't quite seem to get it to work. Here is what I have:
df <- data.frame(study_id = c(1:10),
v1 = (sample(1:100, 10)),
v2 = (sample(1:100, 10)),
v3 = (sample(1:100, 10)),
v4 = (sample(1:100, 10)))
require(bestNormalize)
transformed <- function(x) {
bn <- bestNormalize(x)
return(bn$x.t)
}
df <- df %>%
mutate(across(c(2,4:5), transformed))
Current output:
study_id v1 v2 v3 v4
1 1 -0.001846842 43 0.6559159 0.37893888
2 2 -2.416625847 81 -1.2998111 -0.64356058
3 3 1.012132345 95 -1.5086228 -0.48845289
4 4 0.798561562 2 0.8301299 0.30168982
5 5 -0.257460026 35 0.1322051 0.78737617
6 6 -0.179681789 42 -1.1352463 -2.42438347
7 7 0.378206706 22 -0.3635088 0.79583687
8 8 0.909304988 70 1.0748401 0.63712357
9 9 0.325879668 32 0.9041796 -0.09711216
10 10 -0.568470765 7 0.7099185 0.75254380
Desired output:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 72 7 87 100 4 3 2
2 2 57 78 64 69 10 8 6
3 3 35 65 83 96 3 5 4
4 4 24 58 94 53 6 10 10
5 5 100 62 82 63 -1 7 3
6 6 47 55 4 50 8 4 1
7 7 83 97 35 41 7 2 -1
8 8 78 86 22 73 1 -1 9
9 9 11 39 93 68 2 0 7
10 10 36 49 8 72 0 1 0
Many thanks in advance.
Use the .names= argument of across:
df %>%
mutate(across(c(2,4:5), transformed, .names = "{.col}_transformed"))
giving:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 50 72 12 7 0.3850197 -0.7916019 -1.9775107
2 2 53 82 61 42 0.4425318 0.6132865 0.6790496
3 3 3 12 90 20 -2.3661268 0.9496526 -0.4232995
4 4 20 84 37 21 -0.5190229 0.1809655 -0.3508475
5 5 55 54 4 23 0.4790925 -1.7301008 -0.2157362
6 6 61 96 85 74 0.5812924 0.9002185 1.5209888
7 7 52 94 22 38 0.4237308 -0.2683955 0.5302984
8 8 72 41 57 35 0.7449435 0.5546340 0.4080778
9 9 13 67 6 45 -0.9434502 -1.3866702 0.7815968
10 10 74 48 93 14 0.7719892 0.9780114 -0.9526174
This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))
I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows