I have a data frame similar to the following format:
Doc Category val
A aa 1
B ab 6
C ab 3
D cc 6.....
I am using the following code to identify all combinations of sums of val and then extracting the rows that add up to a target sum I have already identified.
#all combinations
res <- Map(combn, list(val), seq_along(val), simplify = FALSE)
x=unlist(res, recursive = FALSE)
z=lapply(x, function(x) sum(x))
My issue is determining the best way to preserve the character columns in the data frame as the code above only gives numerical values. The way I am doing it now is a mapping based on val, which normally works fine, however, I can run into issues when there are duplicated values.
For example, if my target sum is 7, I eventually want output that looks like this (there are other ways to get to this value, but for now just returning the first instance works):
Doc Category val
A aa 1
B ab 6
Is there a better way to map to the non-numerical columns to achieve this output ?
Would this solution work for you:
df <- data.frame(Doc = LETTERS[1:7],
Category = c("aa","ab","ab","cc","ca","cb","bb"),
val = c(1,6,3, 6, 4, 5, 2),
stringsAsFactors=FALSE)
df
# Doc Category val
# 1 A aa 1
# 2 B ab 6
# 3 C ab 3
# 4 D cc 6
# 5 E ca 4
# 6 F cb 5
# 7 G bb 2
target.sum=7
# create an "id" variable that is equal to the index of all rows
df$id <- seq_along(df$val)
id.res <- Map(combn, list(df$id), seq_along(df$id), simplify = FALSE)
x=unlist(id.res, recursive = FALSE)
#remove all elements in the list where the sum of
# values in column val is not equal to target value
x.list <- lapply(x,FUN=function(x){ if(sum(df$val[x]) == target.sum ) df[x,] else NA})
#remove missing values
x.list <-x.list[!is.na(x.list)]
x.list
# [[1]]
# Doc Category val id
# 1 A aa 1 1
# 2 B ab 6 2
#
# [[2]]
# Doc Category val id
# 1 A aa 1 1
# 4 D cc 6 4
#
# [[3]]
# Doc Category val id
# 3 C ab 3 3
# 5 E ca 4 5
#
# [[4]]
# Doc Category val id
# 6 F cb 5 6
# 7 G bb 2 7
#
# [[5]]
# Doc Category val id
# 1 A aa 1 1
# 5 E ca 4 5
# 7 G bb 2 7
Related
I am looking for a way to find clusters of group 2 (pairs).
Is there a simple way to do that?
Imagine I have some kind of data where I want to match on x and y, like
library(cluster)
set.seed(1)
df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))
I want to find the closest pair of distances between the x_coord and y_coord:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.
Maybe there is a more effective alternative for doing this than clustering.
Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:
id x_coord y_coord pair_id
1 9 3 1
3 7 5 1
4 1 8 2
5 2 2 2
6 5 6 3
7 3 10 3
9 6 4 4
10 8 7 4
You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
df
#> id x_coord y_coord pair
#> 1 1 9 3 4
#> 3 3 7 5 4
#> 4 4 1 8 1
#> 5 5 2 2 1
#> 6 6 5 6 2
#> 7 7 3 10 2
#> 9 9 6 4 3
#> 10 10 8 7 3
Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line
df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
And we can see the method scales nicely:
df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I have a data.frame which has two column. However, I need to convert the format of psw column in 5 digit integer from the current format. How can I automatically change 1 digit to 5 in psw column? How can I get this done in R easily? Thanks
Here is reproducible data.frame
mydat <- data.frame(ID=LETTERS[seq( from = 1, to = 6)],
psw=c(10501,3,80310,8930,234,1))
> mydat
ID psw
1 A 10501
2 B 3
3 C 80310
4 D 8930
5 E 234
6 F 1
This is my desired output:
> mydat
ID psw
1 A 10501
2 B 00003
3 C 80310
4 D 08930
5 E 00234
6 F 00001
You can't do that while keeping the psw column numeric, but you can format it to be a certain width as a character vector. Here are two methods for this:
In base R you can use formatC():
mydat <- data.frame(ID=LETTERS[seq( from = 1, to = 6)],
psw=c(10501,3,80310,8930,234,1))
mydat$psw <- formatC(mydat$psw, width = 5, format = "d", flag = "0")
mydat
# ID psw
# 1 A 10501
# 2 B 00003
# 3 C 80310
# 4 D 08930
# 5 E 00234
# 6 F 00001
In stringr, you can use str_pad():
install.packages("stringr")
library(stringr)
mydat <- data.frame(ID=LETTERS[seq( from = 1, to = 6)],
psw=c(10501,3,80310,8930,234,1))
mydat$psw <- str_pad(mydat$psw, width = 5, pad = "0")
mydat
# ID psw
# 1 A 10501
# 2 B 00003
# 3 C 80310
# 4 D 08930
# 5 E 00234
# 6 F 00001
One can even use sprintf in base-R.
mydat$psw <- sprintf("%05d",mydat$psw)
mydat
# ID psw
# 1 A 10501
# 2 B 00003
# 3 C 80310
# 4 D 08930
# 5 E 00234
# 6 F 00001
I would like to create bins for the variable numbers per category name inside a function. But I am having difficulties related to using the category name provided as parameter inside the function. Maybe a data.table approach would be better.
set.seed(10)
b<-(rnorm(10, sd=1,mean=10))
y<-runif(3)
pr<-y/sum(y)
names<-unlist(lapply(mapply(rep, LETTERS[1:3], 1:3), function (x) paste0(x, collapse = "") ) )
x <- sample(names, 10, replace=TRUE, prob=pr)
df<-data.frame(name=x,numbers=b)
df
#working without bin limits per category (not desired)
#and using "numbers" in cut (not desired)
binfunction1 <- function(df, colgroup1, varcount,binsize) {
new<-df %>%
group_by_(colgroup1) %>%
mutate(bin = cut(numbers, breaks <- c(seq(7, 15, by = binsize)), # limits by colgroup not implemented
labels = 1:(length(breaks)-1) ) )
return(new)
}
binfunction1(df,"name","numbers",0.5)
name numbers bin
<fctr> <dbl> <fctr>
1 BB 10.018746 7
2 A 9.815747 6
3 CCC 8.628669 4
4 CCC 9.400832 5
5 BB 10.294545 7
6 CCC 10.389794 7
7 A 8.791924 4
8 A 9.636324 6
9 A 8.373327 3
10 A 9.743522 6
Not the most elegant solution, but is the outcome something you are after? (I didn't quite understand your question)
binfunction3 <- function(x, colgroup1, varcount, binsize) {
tmp <- split(x, x[[colgroup1]], drop = TRUE)
tp <- lapply(tmp, function(k) {
breaks <- c(seq(min(k[[varcount]])*0.9, max(k[[varcount]])*1.1, by = binsize))
cbind(k, data.frame(bin = cut(k[[varcount]], breaks, labels = 1:(length(breaks)-1))))
})
tp <- do.call(rbind, tp)
rownames(tp) <- gsub("[[:alpha:]]*\\.", "", rownames(tp))
return(tp[rownames(x),])
}
binfunction3(df,"name","numbers",0.5)
# name numbers bin
# 1 A 10.018746 5
# 2 CCC 9.815747 5
# 3 CCC 8.628669 2
# 4 BB 9.400832 2
# 5 A 10.294545 6
# 6 BB 10.389794 4
# 7 A 8.791924 3
# 8 CCC 9.636324 4
# 9 A 8.373327 2
# 10 A 9.743522 5
My answer based on Mikko's, but allowing better control over min and max limits of breaks and bin size.
binfunctionnew <- function(x, colgroup, varcount, binexp) {
tmp <- split(x, x[colgroup], drop = TRUE)
tp <- lapply(tmp, function(k) {
bin<-cut(k[,varcount],
breaks=c(seq(min(k[,varcount])*(1-10^(-(binexp+1))),
max(k[,varcount])*(1+10^(-(binexp-2))),
by = 10^(-(binexp))) ) , labels=F)
cbind (k, data.frame(bin = bin))
} )
tp <- do.call(rbind, tp)
return(tp)
}
#example or
binfunctionnew(df,"name","numbers",1) binfunctionnew(df,"name","numbers",0)
# name numbers bin name numbers bin
# A.1 A 10.018746 18 A.1 A 10.018746 3
# A.5 A 10.294545 21 A.5 A 10.294545 3
# A.7 A 8.791924 6 A.7 A 8.791924 2
# A.9 A 8.373327 1 A.9 A 8.373327 1
# A.10 A 9.743522 15 A.10 A 9.743522 3
# BB.4 BB 9.400832 1 BB.4 BB 9.400832 1
# BB.6 BB 10.389794 11 BB.6 BB 10.389794 2
# CCC.2 CCC 9.815747 13 CCC.2 CCC 9.815747 3
# CCC.3 CCC 8.628669 1 CCC.3 CCC 8.628669 1
# CCC.8 CCC 9.636324 11 CCC.8 CCC 9.636324 2
I want to column bind (cbind) mydf[,"c"] and give it a new name newcolumn in one step and get the result matrix mydf. How do I do it?
mydf
# a b c
# 1 2 6
# 1 3 4
mydf
# a b c newcolumn
# 1 2 6 6
# 1 3 4 4
You can specify the new column name in the call to cbind:
mydf <- cbind(mydf, newcolumn=mydf[,"c"])
mydf
# a b c newcolumn
# [1,] 1 2 6 6
# [2,] 1 3 4 4
Data (constructed with the same approach):
mydf <- cbind(a=c(1, 1), b=c(2, 3), c=c(6, 4))
If you had a data frame instead of a matrix, you could simply do mydf$newcolumn <- mydf$c.
There are many approaches you could take here:
mydf <- data.frame(a=c(1,1),b=c(2,3),c=c(6,4));
mydf;
## a b c
## 1 1 2 6
## 2 1 3 4
data.frame(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
cbind(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
transform(mydf,newcolumn=c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
within(mydf,newcolumn <- c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
mydf$newcolumn <- mydf$c;
mydf;
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
Of the 5 approaches shown above, only the last actually modifies mydf. For the other 4, you have to assign mydf to the return value to replace it with the new data.frame that has the additional column.
library('microbenchmark');
bind.df <- function() mydf <- data.frame(mydf,newcolumn=mydf$c);
bind.cb <- function() mydf <- cbind(mydf,newcolumn=mydf$c);
bind.tr <- function() mydf <- transform(mydf,newcolumn=c);
bind.wi <- function() mydf <- within(mydf,newcolumn <- c);
bind.as1 <- function() mydf$newcolumn <- mydf$c;
bind.as2 <- function() mydf['newcolumn'] <- mydf['c'];
bind.as3 <- function() mydf[,'newcolumn'] <- mydf[,'c'];
bind.as4 <- function() mydf[['newcolumn']] <- mydf[['c']];
N <- 1e5; mydf <- data.frame(a=rep(c(1,1),N),b=rep(c(2,3),N),c=rep(c(6,4),N));
microbenchmark(bind.df(),bind.cb(),bind.tr(),bind.as1(),bind.as2(),bind.as3(),bind.as4(),times=1e4);
## Unit: microseconds
## expr min lq mean median uq max neval
## bind.df() 97.077 112.046 128.66080 121.027 134.711 1690.513 10000
## bind.cb() 86.814 100.927 117.14364 109.907 122.737 1849.172 10000
## bind.tr() 105.203 120.171 138.90802 131.290 145.830 1680.250 10000
## bind.as1() 12.402 20.100 23.35085 22.239 25.660 148.397 10000
## bind.as2() 370.776 412.686 596.47901 425.088 449.036 41799.239 10000
## bind.as3() 347.682 385.743 564.78320 396.435 419.528 42144.355 10000
## bind.as4() 17.534 26.087 30.09639 28.654 32.930 638.915 10000
If there are two columns and you would like to add two column by column then use cbind in dataframe type.
dapu <- cbind(data.frame(data_r), data.frame(data_c))
How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.