Merging different columns with the same name into single columns - r

I having a data.frame in which some columns have the same Name. Now I want to merge/add up these columns into single columns. So for example I want to turn....
v1 v1 v1 v2 v2
1 0 2 4 1
3 1 1 1 0
...into...
v1 v2
3 5
5 1
I only found threads dealing with two data.frames supposed to be merged into one but none dealing with this (rather simple?) problem.
The data can be recreated with this:
df <- structure(list(v1 = c(1L, 3L), v1 = 0:1, v1 = c(2L, 1L),
v2 = c(4L, 1L), v2 = c(1L, 0L)),
.Names = c("v1", "v1", "v1", "v2", "v2"),
class = "data.frame", row.names = c(NA, -2L))

as.data.frame(lapply(split.default(df, names(df)), function(x) Reduce(`+`, x)))
produces:
v1 v2
1 3 5
2 5 1
split.default(...) breaks up the data frame into groups with equal column names, then we use Reduce on each of those groups to sum the values of each column in the group iteratively until there is only one column left per group (see ?Reduce, that is what the function does), and finally we convert back to data frame with as.data.frame.
We have to use split.default because split (or really, split.data.frame, which it will dispatch) splits on rows, not columns.

You can do this quite easily with melt and dcast from "reshape2". Since there's no "id" variable, I've used melt(as.matrix(df)) instead of melt(df, id.vars="id"). This automatically creates a long version of your data that has "Var1" as representing your rownames and "Var2" as your colnames. Using that knowledge, you can do:
library(reshape2)
dcast(melt(as.matrix(df)), Var1 ~ Var2,
value.var = "value", fun.aggregate=sum)
# Var1 v1 v2
# 1 1 3 5
# 2 2 5 1

Related

How to use a dataset to extract specific columns from another dataset?

How to use a dataset to extract specific columns from another dataset?
Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]
It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9

How can I find the number of a vector's elements in another vector?

I have two vectors. First vector name is comments$author_id and second is enrolments$learner_id. I want to add new column into enrolmens dataframe that shows count of repeated rows in comments$author_id vector for each enrolment$learner_id row.
Example:
if(enrolments$learner_id[1] repeated 5 times in comments$author_id)
enrolments$freqs[1] = 5
Can I do this don't using any loops?
The vector samples are as follows:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,4,5,6))
I want to add "counts" column to "df2" that shows counts of repeated v2 element in v1.
"[tabulate] gives me this error: Error in $<-.data.frame(tmp, "comments_count", value = c(0L, 0L, : replacement has 25596 rows, data
has 25597"
That is prly because there is one value at the end of df2$v2, which are not part of df1$v1 - I add 0 and 7 to your example to show that:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,0,4,5,6,7))
df2$count <- tabulate(factor(df1$v1, df2$v2))
# Error in `$<-.data.frame`(`*tmp*`, count, value = c(7L, 5L, 3L, 0L, 5L, :
# replacement has 7 rows, data has 8
To correct that using tabulate, which might be the fastest solution on larger data:
df2$count <- tabulate(factor(df1$v1, df2$v2), length(df2$v2))
df2
# v2 count
# 1 1 7
# 2 2 5
# 3 3 3
# 4 0 0
# 5 4 5
# 6 5 6
# 7 6 2
# 8 7 0
See ?tabulate for the documentation on that function.
Using your df1 and df2 example, you could do it like this:
# Make data
df1 = data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 = data.frame(v2 = c(1,2,3,4,5,6))
# Add 'count' variable as reqeuested
df2$counts = sapply(df2$v2, function(x) {
sum(df1$v1 == x, na.rm = T) #na.rm=T just in case df1$v1 has missing values
})
df2 #view output
What you essentially are doing is aggregating the df1 to get a count, and then adding this count back to the df2 set. This logic can be easily translated to a bunch of different methods:
# base R
merge(
df2,
aggregate(cbind(df1[0], count=1), df1["v1"], FUN=sum),
by.x="v2", by.y="v1", all.x=TRUE
)
# data.table
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, .(count=.N), by=v1], on=c("v2"="v1")]
# dplyr
library(dplyr)
df1 %>%
group_by(v1) %>%
count() %>%
left_join(df2, ., by=c("v2"="v1"))
# v2 count
#1 1 7
#2 2 5
#3 3 3
#4 4 5
#5 5 6
#6 6 2

Efficient way to cbind list by groups in data.table

I have a data.frame
data
data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD",
"ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring",
"class"), row.names = c(NA, -2L), class = "data.frame")
which looks like
#> dtt1
# mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD cat
#2 ASDSDFJSKADDKJSJKDFKSADDLKJFLAK dog
I am searching the start and end positions of a pattern "ADD" with in the first 20 characters in the strings under mystring considering class as the group.
I am doing this using str_locate of stringr package. Here is my attempt
setDT(dtt1)[,
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])),
by = class]
This gives my desired output
# class V1 V2
#1: cat 8 10
#2: cat 16 18
#3: dog 10 12
Question:
I would like to know if this is a standard approach or this can be done in a more efficient manner. str_locate gives the start and end positions of the matched pattern in separate columns, and I am putting them in separate list to cbind them together with the data.table? Also how can I specify the colnames for the cbinded columns here?
I think you first should reduce your operations per group, so I would first create a substring for all groups at once.
setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]
Then, using the stringi package (I don't like wrappers), you could do (though can't currently vouch for efficiency)
library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
# class V1 V2
# 1: cat 8 10
# 2: cat 16 18
# 3: dog 10 12
Alternatively, you could avoid matrix and data.table calls per group but spread the data after all the location were detected
res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
Third similar option could be to try first run stri_locate_all_fixed over the whole data set and only then to unlist per group (instead of running both and unlist and stri_locate_all_fixed per group)
res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
We could change the matrix output from str_locate_all to data.frame and use rbindlist to create the columns.
setDT(data)[,rbindlist(lapply(str_locate_all(substr(mystring, 1, 20),
'ADD'), as.data.frame)) , class]
# class start end
#1: cat 8 10
#2: cat 16 18
#3: dog 10 12
Here's how I did it.
library(stringi)
library(dplyr)
library(magrittr)
data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD",
"ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring",
"class"), row.names = c(NA, -2L), class = "data.frame")
my_function = function(row)
row$mystring %>%
stri_sub(to = 20) %>%
stri_locate_all_fixed(pattern = "ADD") %>%
extract2(1) %>%
as_data_frame
test =
data %>%
group_by(mystring) %>%
do(my_function(.)) %>%
left_join(data)

Append 0 to missing observations in a dataframe.

I have a dataset where I expect a fixed number of observations in a data-frame
A 20
B 10
C 5
However, upon running my analysis this is not always the case sometimes I find missing observations and the resulting dataframe looks like this
A 10
C 5
In this case there are no observations for B. I would want to append 0 observations to the final dataframe before ploting so as to indicate the values of the missing observation.
final data frame should look like this
A 10
B 0
C 5
How can I accomplish this in R?
If you define the ID column (with A,B,C) as factor which seems appropriate here, you could plot the data and even those factor levels which are not in the data (but in the defined factor levels) will be plotted. Here's a small example:
df <- data.frame(ID = LETTERS[1:3], x = rnorm(3))
df
# ID x
#1 A 1.350458
#2 B 1.340855
#3 C 1.311329
subdf <- df[c(1,3),]
subdf
# ID x
#1 A 1.350458
#3 C 1.311329
with(subdf, plot(x ~ ID))
You'll find that "B" is also present in the plot although it's not in the subsetted data.
Maybe you can do something with melt and dcast from "reshape2".
Here's what I had in mind:
library(reshape2)
out <- dcast(
melt( # Makes a data.frame from a list
mget(ls(pattern = "df\\d")), # Collects the relevant df in a list
id.vars = "V1"), # The variable to melt by
L1 ~ V1, value.var = "value", fill = 0) # Other options for dcast
out
# L1 A B C
# 1 df1 20 10 5
# 2 df2 10 0 5
From there, you could go back to a long data form.
melt(out, id.vars = "L1")
# L1 variable value
# 1 df1 A 20
# 2 df2 A 10
# 3 df1 B 10
# 4 df2 B 0
# 5 df1 C 5
# 6 df2 C 5
If separate data.frames are required, then you can also look at using split, but if you are just going to be plotting, this format should work just fine.
Sample data
df1 <- structure(list(V1 = c("A", "B", "C"), V2 = c(20L, 10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
df2 <- structure(list(V1 = c("A", "C"), V2 = c(10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -2L))

Merge two rows into one column in R

How can I merge two rows to one row like
...
type1,-2.427,-32.962,-61.097
type2,0.004276057,0.0015271631,-0.005192355
type1,-2.427,-32.962,-60.783
type2,0.0018325958,0.0033597588,-0.0021380284
...
to
type1,-2.427,-32.962,-61.097,type2,0.004276057,0.0015271631,-0.005192355
type1,-2.427,-32.962,-60.783,type2,0.0018325958,0.0033597588,-0.0021380284
or
-2.427,-32.962,-61.097,0.004276057,0.0015271631,-0.005192355
-2.427,-32.962,-60.783,0.0018325958,0.0033597588,-0.0021380284
in GNU R?
Do they always alternate 1/2? If so, you can just use basic indexing. Using this sample data.frame
dd<-structure(list(V1 = structure(c(1L, 2L, 1L, 2L), .Label = c("type1",
"type2"), class = "factor"), V2 = c(-2.427, 0.004276057, -2.427,
0.0018325958), V3 = c(-32.962, 0.0015271631, -32.962, 0.0033597588
), V4 = c(-61.097, -0.005192355, -60.783, -0.0021380284)), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 4L), class = "data.frame")
you can do
cbind(dd[seq(1,nrow(dd), by=2),], dd[seq(2,nrow(dd), by=2),])
# V1 V2 V3 V4 V1 V2 V3 V4
# 1 type1 -2.427 -32.962 -61.097 type2 0.004276057 0.001527163 -0.005192355
# 3 type1 -2.427 -32.962 -60.783 type2 0.001832596 0.003359759 -0.002138028
to include the "type" column or you can do
cbind(dd[seq(1,nrow(dd), by=2),-1], dd[seq(2,nrow(dd), by=2),-1])
# V2 V3 V4 V2 V3 V4
# 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028
to leave it off
Here's an alternative using #MrFlick's sample data:
## Use `ave` to create an indicator variable
dd$ind <- with(dd, ave(as.numeric(V1), V1, FUN = seq_along))
## use `reshape` to "merge" your rows by indicator
reshape(dd, direction = "wide", idvar = "ind", timevar = "V1")
# ind V2.type1 V3.type1 V4.type1 V2.type2 V3.type2 V4.type2
# 1 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 2 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028
You could use split to split the data by type, and then use cbind to bring them together. The following method removes the first column ( [-1] ) from the result, and also uses MrFlick's data, dd,
> do.call(cbind, split(dd[-1], dd[[1]]))
# type1.V2 type1.V3 type1.V4 type2.V2 type2.V3 type2.V4
# 1 -2.427 -32.962 -61.097 0.004276057 0.001527163 -0.005192355
# 3 -2.427 -32.962 -60.783 0.001832596 0.003359759 -0.002138028
On my machine, I have this as the fastest among the current three answers.

Resources