What kind of join to I need here in R? - r

Please help me first to make a clearer subject for that question.
To point is I don't know the correct R-terminology for what I need here. Is "join" a correct word?
set.seed(0)
df <- data.frame(a = sample(c(T,F), 10, replace=TRUE),
b = sample(c(T,F), 10, replace=TRUE),
c = sample(c(T,F), 10, replace=TRUE),
d = sample(c(T,F), 10, replace=TRUE))
a <- addmargins(table(df$a))
b <- addmargins(table(df$b))
c <- addmargins(table(df$c))
d <- addmargins(table(df$d))
This is the data
FALSE TRUE Sum
7 3 10
FALSE TRUE Sum
4 6 10
FALSE TRUE Sum
4 6 10
FALSE TRUE Sum
5 5 10
And what I want is to make the data look like this
FALSE TRUE Sum
a 7 3 10
b 4 6 10
c 4 6 10
d 5 5 10
Sounds simple, dosn't it? I was using ddply in the past. But I don't get here how to use ddply or anything else.

Here is a simple one-liner to perform the table command and then to add the margins:
addmargins(t(sapply(df, table)))
#or this for just the row sums:
addmargins(t(sapply(df, table)), 2)
sapply to apply the table function to each column.
t to transpose the results
addmargins for the row/column sums

This is just stacking rows, you want rbind (for "binding" rows together. cbind is the equivalent for columns).
rbind(a, b, c, d)
# FALSE TRUE Sum
# a 7 3 10
# b 4 6 10
# c 4 6 10
# d 5 5 10
A join is typically done when you have some shared columns but some different columns, and you want to combine the data such that the shared columns line up, and the different corresponding different columns are kept. For example, if you had one data frame of people and addresses, and another data frame of people and orders, you would join them together to see which address goes with which order. In base R, joins are done with the merge command.

Related

R - Aggregate Function different Results When Adding new grouping column

I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:
I have a dataframe df.
The columns are converted to char (Attributes) and num.
I want to reduce the dataframe by using the aggregate function (dply is not an option).
When I am aggregating using
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)
I get correct results. But I want to group by more attributes. When adding more attributes for example
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).
Has anyone an idea what causes this behavior.
My best guess is that you have missing values in some of your grouping columns. Demonstrating on the built-in mtcars data, which has no missing values, everything is fine:
sum(mtcars$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am")], sum)$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am", "cyl")], sum)$mpg)
# [1] 642.9
But if we introduce a missing value in a grouping variable, it is not included in the aggregation:
mt = mtcars
mt$cyl[1] = NA
sum(aggregate(mt["mpg"], mt[c("am", "cyl")], sum)$mpg)
# [1] 621.9
The easiest fix would be to fill in the missing values with something other than NA, perhaps the string "missing".
I think #Gregor has correctly pointed out that problem could be a grouping variable having NA. The dplyr handles NA in grouping variables differently than aggregate.
We have an alternate solution with aggregate. Please note that document suggest that
`by` a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Here is clue. You can convert your grouping variables to factor using exclude="" which will ensure NA are part of factor.
set.seed(1)
df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
AMOUNT = 1:10)
df$ATTRIBUTE2[5] <- NA
aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
factor(df$ATTRIBUTE2, exclude="")), sum)
# Group.1 Group.2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
# 8 A <NA> 5
The result when grouping variables are not explicitly converted to factor to include NA is as:
aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
# ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

new column with value from another one which name is in a third one in R

got that one I can't resolve.
Example dataset:
company <- c("compA","compB","compC")
compA <- c(1,2,3)
compB <- c(2,3,1)
compC <- c(3,1,2)
df <- data.frame(company,compA,compB,compC)
I want to create a new column with the value from the column which name is in the column "company" of the same line. the resulting extraction would be:
df$new <- c(1,3,2)
df
The way you have it set up, there's one row and one column for every company, and the rows and columns are in the same order. If that's your real dataset, then as others have said diag(...) is the solution (and you should select that answer).
If your real dataset has more than one instance of company (e.g., more than one row per company, then this is more general:
# using your df
sapply(1:nrow(df),function(i)df[i,as.character(df$company[i])])
# [1] 1 3 2
# more complex case
set.seed(1) # for reproducible example
newdf <- data.frame(company=LETTERS[sample(1:3,10,replace=T)],
A = sample(1:3,10,replace=T),
B=sample(1:5,10,replace=T),
C=1:10)
head(newdf)
# company A B C
# 1 A 1 5 1
# 2 B 1 2 2
# 3 B 3 4 3
# 4 C 2 1 4
# 5 A 3 2 5
# 6 C 2 2 6
sapply(1:nrow(newdf),function(i)newdf[i,as.character(newdf$company[i])])
# [1] 1 2 4 4 3 6 7 2 5 3
EDIT: eddi's answer is probably better. It is more likely that you would have the dataframe to work with rather than the individual row vectors.
I am not sure if I understand your question, it is unclear from your description. But it seems you are asking for the diagonals of the data values since this would be the place where "name is in the column "company" of the same line". The following will do this:
df$new <- diag(matrix(c(compA,compB,compC), nrow = 3, ncol = 3))
The diag function will return the diagonal of the matrix for you. So I first concatenated the three original vectors into one vector and then specified it to be wrapped into a matrix of three rows and three columns. Then I took the diagonal. The whole thing is then added to the dataframe.
Did that answer your question?

R - find all unique values among subsets of a data frame

I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.
For example, from:
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
data_subsets data_values
A 1
A 2
A 3
A 4
A 5
B 2
B 3
B 4
B 6
B 7
I would want to extract the following data frame.
data_subsets data_values
A 1
A 5
B 6
B 7
I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!
EDIT
I modified the approach from #Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:
counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]
First, find the count of each element in df$data_values:
x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))
> x
[1] 1 2 2 2 1 2 2 2 1 1
Now extract the rows:
> df[x==1,]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
Note that you missed "A 5" above. There is no "B 5".
You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.
!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
Indexing your data.frame with this vector gives:
df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
A variant of P Lapointe's answer would be
df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]
The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).
You can use the 'dplyr' and 'explore' library to overcome this problem.
library(dplyr)
library(explore)
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
df %>% describe(data_subsets)
######## output ########
#variable = data_subsets
#type = character
#na = 0 of 10 (0%)
#unique = 2
# A = 5 (50%)
# B = 5 (50%)

Resources