I am working with R.
i have a list of datasets where each of those sets should have a row length 5 for each month (Jan-May). it should look like this:
data.frame(name = rep("B", 5),
doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"),
i_name = rep("Aa",5),
aggregation = rep("34"), 5)
but some of my datasets dont have data for certain months, or are completely empty, and therefore have a shorter row length/no rows at all. like this:
data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = "34")
I would like to extend each dataset, even empty ones, with the specific months , copy all the other information into the row and put a 0 for aggregation.
I tried to use extend and complete by tidyr but couldnt make it work.
With tidyr's complete with purrr's reduce to add more dataframes.
Also tweaked aggregation = rep(34, 5).
library(tidyverse)
df1 <- data.frame(name = rep("B", 5),
doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"),
i_name = rep("Aa",5),
aggregation = rep(34, 5))
df2 <- data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = 34)
reduce(list(df1, df2, df1), bind_rows) |>
complete(doc_month, nesting(name, i_name), fill = list(aggregation = 0))
#> # A tibble: 15 × 4
#> doc_month name i_name aggregation
#> <chr> <chr> <chr> <dbl>
#> 1 2022.01 A Aa 34
#> 2 2022.01 B Aa 34
#> 3 2022.01 B Aa 34
#> 4 2022.02 A Aa 0
#> 5 2022.02 B Aa 34
#> 6 2022.02 B Aa 34
#> 7 2022.03 A Aa 0
#> 8 2022.03 B Aa 34
#> 9 2022.03 B Aa 34
#> 10 2022.04 A Aa 0
#> 11 2022.04 B Aa 34
#> 12 2022.04 B Aa 34
#> 13 2022.05 A Aa 0
#> 14 2022.05 B Aa 34
#> 15 2022.05 B Aa 34
Created on 2022-06-10 by the reprex package (v2.0.1)
You could create a skeleton dataset with the five months and then join it to each of your partial datasets.
library(dplyr)
library(tidyr)
data_A <- data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = "34")
reference <- data.frame(doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"))
data_A |>
full_join(reference, by = "doc_month") |>
mutate(aggregation = replace_na(aggregation, "0")) |>
fill(name, i_name)
Output:
#> name doc_month i_name aggregation
#> 1 A 2022.01 Aa 34
#> 2 A 2022.02 Aa 0
#> 3 A 2022.03 Aa 0
#> 4 A 2022.04 Aa 0
#> 5 A 2022.05 Aa 0
Created on 2022-06-10 by the reprex package (v2.0.1)
Related
I am in search of an elegant solution that produces a column of values that are column offsets of a 'column offset' column = 'relative_column_position.' The desired answer is provided (radio).
My actual data consists of thousands of rows with ~300 different column positions denoted in 'relative_column_position,' so a hand-solution such as this is not in the cards.
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
gaga
radio <- tibble( c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
radio
Base R answer using matrix subsetting -
gaga <- data.frame(gaga)
result <- data.frame(value = gaga[cbind(seq_len(nrow(gaga)),
gaga$relative_column_position + 1)])
result
# value
#1 1
#2 2
#3 3
#4 16
#5 17
#6 18
#7 19
#8 20
#9 21
#10 34
#11 35
#12 36
gaga$relative_column_position + 1 because the subsetting starts from the 2nd column in the dataset. So when gaga$relative_column_position is 1, we actually want to subset data from 2nd column in gaga dataset.
Here is a base R solution in two steps.
library(tibble)
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
radio <- tibble(c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
unlist(mapply(\(x, i) x[i], gaga[-1], rcp))
#> col_11 col_12 col_13 col_21 col_22 col_23 col_24 col_25 col_26 col_31 col_32
#> 1 2 3 16 17 18 19 20 21 34 35
#> col_33
#> 36
Created on 2022-05-21 by the reprex package (v2.0.1)
As a tibble:
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
radio <- tibble(rcp = unlist(mapply(\(x, i) x[i], gaga[-1], rcp)))
rm(rcp)
radio
#> # A tibble: 12 × 1
#> rcp
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 16
#> 5 17
#> 6 18
#> 7 19
#> 8 20
#> 9 21
#> 10 34
#> 11 35
#> 12 36
Created on 2022-05-21 by the reprex package (v2.0.1)
df |>
mutate(rel = apply(df, 1, \(x) x[colnames(df)[x["relative_col"]]] ))
to apply to your df example:
gaga |>
mutate(rel = apply(gaga, 1, \(x) x[colnames(gaga)[x["relative_column_position"] + 1]] ))
Assuming you have a relative column to map over, you can use apply and
mutate
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I want to perform multiple joins to original dataframe, from the same source with different IDs each time. Specifically I actually only need to do two joins, but when I perform the second join, the columns being joined already exist in the input df, and rather than add these columns with new names using the .x/.y suffixes, I want to sum the values to the existing columns. See the code below for the desired output.
# Input data:
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
> values
# A tibble: 10 x 3
id variable1 variable2
<chr> <int> <dbl>
1 A 1 10
2 B 2 20
3 C 3 30
4 D 4 40
5 E 5 50
6 F 6 60
7 G 7 70
8 H 8 80
9 I 9 90
10 J 10 100
> df
# A tibble: 5 x 1
twin_id
<chr>
1 A/F
2 B/G
3 C/H
4 D/I
5 E/J
So this is the two joins:
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
left_join(values, by = c("right_id" = "id"))
> joined_df
# A tibble: 5 x 7
twin_id left_id right_id variable1.x variable2.x variable1.y variable2.y
<chr> <chr> <chr> <int> <dbl> <int> <dbl>
1 A/F A F 1 10 6 60
2 B/G B G 2 20 7 70
3 C/H C H 3 30 8 80
4 D/I D I 4 40 9 90
5 E/J E J 5 50 10 100
And this is the output I want, using the only way I can see to get it:
output_df_wanted <- joined_df %>%
mutate(
variable1 = variable1.x + variable1.y,
variable2 = variable2.x + variable2.y) %>%
select(twin_id, left_id, right_id, variable1, variable2)
> output_df_wanted
# A tibble: 5 x 5
twin_id left_id right_id variable1 variable2
<chr> <chr> <chr> <int> <dbl>
1 A/F A F 7 70
2 B/G B G 9 90
3 C/H C H 11 110
4 D/I D I 13 130
5 E/J E J 15 150
I can see how to get what I want using a mutate statement, but I will have a much larger number of variables in the actually dataset. I am wondering if this is the best way to do this.
You can try reshaping your data and using dplyr::summarise_at:
library(tidyr)
library(dplyr)
df %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
pivot_longer(-twin_id) %>%
left_join(values, by = c("value" = "id")) %>%
group_by(twin_id) %>%
summarise_at(vars(starts_with("variable")), sum) %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE)
## A tibble: 5 x 5
# twin_id left_id right_id variable1 variable2
# <chr> <chr> <chr> <int> <dbl>
#1 A/F A F 7 70
#2 B/G B G 9 90
#3 C/H C H 11 110
#4 D/I D I 13 130
#5 E/J E J 15 150
You can use my package safejoin if it's acceptable to you to use a github package.
The idea is that you have conflicting columns, dplyr and base R deal with conflict by renaming them while safejoin is more flexible, you can use the function you want to apply in case of conflicts. Here you want to add them so we'll use conflict = `+`, for the same effect you could have used conflict = ~ .x + .y or conflict = ~ ..1 + ..2.
# remotes::install_github("moodymudskipper/safejoin")
library(tidyverse)
library(safejoin)
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
safe_left_join(values, by = c("right_id" = "id"), conflict = `+`)
joined_df
#> # A tibble: 5 x 5
#> twin_id left_id right_id variable1 variable2
#> <chr> <chr> <chr> <int> <dbl>
#> 1 A/F A F 7 70
#> 2 B/G B G 9 90
#> 3 C/H C H 11 110
#> 4 D/I D I 13 130
#> 5 E/J E J 15 150
Created on 2020-04-29 by the reprex package (v0.3.0)
I am trying to find the best way to iterate through each column of a data frame, group by that column, and produce a summary.
Here is my attempt:
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
myfunction <- function(x) {
groupVars <- select_if(x, is.factor) %>% colnames()
results <- list()
for(i in 1:length(groupVars)) {
results[[i]] <- x %>%
group_by_at(.vars = vars(groupVars[i])) %>%
summarise(
n = n()
)
}
return(results)
}
test <- myfunction(data)
The function returns:
[[1]]
# A tibble: 3 x 2
a n
<fct> <int>
1 A 37
2 B 34
3 C 29
...
...
...
My question is, is this the best way to do this? Is there a way to avoid using a for loop? Can I use purrr and map somehow to do this?
Thank you
An option is to use map
library(tidyverse)
map(data[1:4], ~data.frame(x = {{.x}}) %>% count(x))
#$a
## A tibble: 3 x 2
# x n
# <fct> <int>
#1 A 39
#2 B 32
#3 C 29
#
#$b
## A tibble: 8 x 2
# x n
# <fct> <int>
#1 A 14
#2 B 11
#3 C 16
#4 D 10
#5 E 12
#6 F 10
#7 G 13
#8 H 14
#...
The output is a list. Note that I have ignored the last column of data, as it doesn't seem to be relevant here.
If you want columns in the list data.frames to be named according to the columns from your original data, we can use imap
imap(data[1:4], ~tibble(!!.y := {{.x}}) %>% count(!!sym(.y)))
#$a
## A tibble: 3 x 2
# a n
# <fct> <int>
#1 A 23
#2 B 35
#3 C 42
#
#$b
## A tibble: 8 x 2
# b n
# <fct> <int>
#1 A 15
#2 B 10
#3 C 13
#4 D 5
#5 E 19
#6 F 9
#7 G 13
#8 H 16
#...
Or making use of tibble::enframe (thanks #camille)
imap(data[1:4], ~enframe(.x, value = .y) %>% count(!!sym(.y)))
You could reshape the data and group by both the column and the letter. This gives you one dataframe instead of a list of them, but you could get the list if you really want it with split.
set.seed(123)
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
data %>%
pivot_longer(cols = -value, names_to = "column", values_to = "letter") %>%
group_by(column, letter) %>%
summarise(n = n())
#> # A tibble: 35 x 3
#> # Groups: column [4]
#> column letter n
#> <chr> <fct> <int>
#> 1 a A 33
#> 2 a B 32
#> 3 a C 35
#> 4 b A 8
#> 5 b B 11
#> 6 b C 12
#> 7 b D 14
#> 8 b E 8
#> 9 b F 17
#> 10 b G 16
#> # … with 25 more rows
Created on 2019-10-30 by the reprex package (v0.3.0)
You can simply call:
apply(data, 2,table)
You can drop the last list element if you want.
I am working with genetic data and I need to concatenate pairs of columns. The data I have has the major and minor alleles in separate columns (e.g., allele1a, allele1b, allele2a, allele2b, etc. etc.). I need a way to pairs of columns for the entire data frame. I included a sample below, but my data has 1.7 million pairs (so I have 3.4 million columns right now), so it will not work if I need to name each column. I will change the column names later. Any guidance is appreciated if there is a way to do this in R. I have tried to create a sequence and paste them, something like:
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])
but that did not work. I want it to go from:
id var1 var2 var3 var4 var5 var6
1 1 A T C C A A
2 2 A T C C A A
3 3 A T C C A A
4 4 A T C C A A
5 5 A T C C A A
6 6 A T C C A A
7 7 A T C C A A
8 8 A T C C A A
9 9 A T C C A A
10 10 A T C C A A
11 11 A A C G A G
12 12 A A C G A G
13 13 A A C G A G
14 14 A A C G A G
15 15 A A C G A G
16 16 A A C G A G
17 17 A A C G A G
18 18 A A C G A G
19 19 A A C G A G
20 20 A A C G A G
to:
id var1 var2 var3
1 1 AT CC AA
2 2 AT CC AA
3 3 AT CC AA
4 4 AT CC AA
5 5 AT CC AA
6 6 AT CC AA
7 7 AT CC AA
8 8 AT CC AA
9 9 AT CC AA
10 10 AT CC AA
11 11 AA CG AG
12 12 AA CG AG
13 13 AA CG AG
14 14 AA CG AG
15 15 AA CG AG
16 16 AA CG AG
17 17 AA CG AG
18 18 AA CG AG
19 19 AA CG AG
20 20 AA CG AG
edit:
Thank you!!! I was able to adapt two of the answers for my data and #akrun's ran a little faster. I created a subset of my data with 100 rows and 100,000 columns and the results are below:
microbenchmark(
+ {
+ new <- ped %>%
+ gather(key = V, value = value, -id) %>%
+ mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+ group_by(id) %>%
+ mutate(pair = ceiling(V / 2)) %>%
+ group_by(id, pair) %>%
+ summarise(combined = paste(value, collapse = "")) %>%
+ mutate(V_combo = paste0("V", pair)) %>%
+ select(-pair) %>%
+ spread(key = V_combo, value = combined) %>%
+ select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+ },
+ {
+ out <- ped[1]
+ new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+
+ out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2),
+ function(i) do.call(paste0, ped[i:(i+1)]))
+ },
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
camille 250.30901 250.30901 250.30901 250.30901 250.30901 250.30901 1
akrun 23.52434 23.52434 23.52434 23.52434 23.52434 23.52434 1
>
> new <- data.frame(new, stringsAsFactors = FALSE)
> identical(new, out)
[1] TRUE
We can create a loop to subset the columns along with the adjacent column, paste it together withdo.call` and assign it as new columns to the new dataset
out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2),
function(i) do.call(paste0, df[i:(i+1)]))
Here's a tidyverse way designed to scale fairly well. Instead of hard-coding that you want to pair columns 1 & 2, 3 & 4, and 5 & 6, I'm reshaping to long data to get a variable number, grouping those into pairs by dividing the variable number by 2, collapsing the letters in each pair, and reshaping back to wide. This way, you can do the same procedure on any even number of columns.
library(tidyverse)
...
Filtering for ID 1 to show a glimpse of this:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups: id [1]
#> id var value pair
#> <int> <dbl> <chr> <dbl>
#> 1 1 1 A 1
#> 2 1 2 T 1
#> 3 1 3 C 2
#> 4 1 4 C 2
#> 5 1 5 A 3
#> 6 1 6 A 3
Then collapsing strings as a summarizing value for each combination of ID and pair:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups: id [?]
#> id pair combined
#> <int> <dbl> <chr>
#> 1 1 1 AT
#> 2 1 2 CC
#> 3 1 3 AA
#> 4 2 1 AT
#> 5 2 2 CC
#> 6 2 3 AA
#> 7 3 1 AT
#> 8 3 2 CC
#> 9 3 3 AA
#> 10 4 1 AT
#> # ... with 50 more rows
And using spread to get back into a wide format.
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = "")) %>%
mutate(var_combo = paste0("var", pair)) %>%
select(-pair) %>%
spread(key = var_combo, value = combined) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: id [6]
#> id var1 var2 var3
#> <int> <chr> <chr> <chr>
#> 1 1 AT CC AA
#> 2 2 AT CC AA
#> 3 3 AT CC AA
#> 4 4 AT CC AA
#> 5 5 AT CC AA
#> 6 6 AT CC AA
Created on 2018-11-07 by the reprex package (v0.2.1)
Using tidyverse, you can compose the modifying expressions ahead of time, then pass them all to transmute in bulk. This solution uses column names and is therefore robust to the column ordering: if you shuffle your allele columns, this should still give you the same answer.
library( tidyverse )
# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )
df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
# id allele1 allele2 allele3
# <int> <chr> <chr> <chr>
# 1 1 AT CC AA
# 2 2 AT CC AA
# 3 3 AT CC AA
# 4 4 AT CC AA
# ...
I modified your data to closer match your description:
df <- data_frame(id = seq(1,20),
allele1a = rep("A", 20),
allele1b = c(rep("T", 10), rep("A", 10)),
allele2a = rep("C", 20),
allele2b = c(rep("C", 10), rep("G", 10)),
allele3a = rep("A", 20),
allele3b = c(rep("A", 10), rep("G", 10)))
using base r you could do:
a <- seq(2,ncol(df),2)
b <- paste0(unlist(df[a]),unlist(df[a+1]))
d <- data.frame(matrix(b,nrow(df)))
result <- cbind(df[1],d)
This can also be written in a one line:
(dat = data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
X1 X2 X3
1 AT CC AA
2 AT CC AA
3 AT CC AA
4 AT CC AA
5 AT CC AA
6 AT CC AA
7 AT CC AA
8 AT CC AA
9 AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG
Then cbind it with the id column:
cbind(df[1],dat)
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""),
var2 = paste(df[,4], df[,5], sep = ""),
var3 = paste(df[,6], df[,7], sep = ""))