I've got a data frame like this one
1 1 1 K 1 K K
2 1 2 K 1 K K
3 8 3 K 1 K K
4 8 2 K 1 K K
1 1 1 K 1 K K
2 1 2 K 1 K K
I want to remove all the columns with the same value, i.e K, so my result will be like this
1 1 1 1
2 1 2 1
3 8 3 1
4 8 2 1
1 1 1 1
2 1 2 1
I try to iterate in a for by columns but I didn't get anything. Any ideas?
To select columns with more than one value regardless of type:
uniquelength <- sapply(d,function(x) length(unique(x)))
d <- subset(d, select=uniquelength>1)
?
(Oops, Roman's question is right -- this could knock out your column 5 as well)
Maybe (edit: thanks to comments!)
isfac <- sapply(d,inherits,"factor")
d <- subset(d,select=!isfac | uniquelength>1)
or
d <- d[,!isfac | uniquelength>1]
Here's a solution that'll work to remove any replicated columns (including, e.g., pairs of replicated character, numeric, or factor columns). That's how I read the OP's question, and even if it's a misreading, it seems like an interesting question as well.
df <- read.table(text="
1 1 1 K 1 K K
2 1 2 K 1 K K
3 8 3 K 1 K K
4 8 2 K 1 K K
1 1 1 K 1 K K
2 1 2 K 1 K K")
# Need to run duplicated() in 'both directions', since it considers
# the first example to be **not** a duplicate.
repdCols <- as.logical(duplicated(as.list(df), fromLast=FALSE) +
duplicated(as.list(df), fromLast=TRUE))
# [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE
df[!repdCols]
# V1 V2 V3 V5
# 1 1 1 1 1
# 2 2 1 2 1
# 3 3 8 3 1
# 4 4 8 2 1
# 5 1 1 1 1
# 6 2 1 2 1
Another way to do this is using the higher order function Filter. Here is the code
to_keep <- function(x) any(is.numeric(x), length(unique(x)) > 1)
Filter(to_keep, d)
Oneliner solution.
df2 <- df[sapply(df, function(x) !is.factor(x) | length(unique(x))>1 )]
Related
This is an example.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
df
item n
1 a 3
2 b 2
3 c 2
4 d 1
5 e 1
Item needs to be grouped so that the group has a sample size of at least 4.
This would be the solution if you follow the sorting of df.
item n cluster
1 a 3 1
2 b 2 1
3 c 2 2
4 d 1 2
5 e 1 2
How to get all possible unique solutions?
Further, the code should also not allow any clusters to have a sample size less than 4.
Below, we have a brute force approach using the package partitions. The idea is that we find every partition of the rows of df. We then sum each group and check to see that the requirement has been met.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
minSize <- 4
funGetClusters <- function(df, minSize) {
allParts <- partitions::listParts(nrow(df))
goodInd <- which(sapply(allParts, function(p) {
all(sapply(p, function(x) sum(df$n[x])) >= minSize)
}))
allParts[goodInd]
}
clusterBreakdown <- funGetClusters(df, minSize)
allDfs <- lapply(clusterBreakdown, function(p) {
copyDf <- df
copyDf$cluster <- 1L
clustInd <- 2L
for (i in p[-1]) {
copyDf$cluster[i] <- clustInd
}
copyDf
})
Here is the output:
allDfs
[[1]]
item n cluster
1 a 3 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 1
[[2]]
item n cluster
1 a 3 1
2 b 2 2
3 c 2 2
4 d 1 1
5 e 1 1
[[3]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 2
5 e 1 1
[[4]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
[[5]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 2
4 d 1 1
5 e 1 1
[[6]]
item n cluster
1 a 3 2
2 b 2 2
3 c 2 1
4 d 1 1
5 e 1 1
It should be noted, that there is a combinatorial explosion as the number of rows increases. For example, just with 10 rows we would have to test 115975 different partitions.
As #chinsoon comments, RcppAlgos could be a good choice for an acceptable solution for larger cases. Disclaimer, I am the author. I have answered similar questions with much larger inputs and have had good success.
Allocating tasks to parallel workers so that expected cost is roughly equal
Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?
#AllanCameron also has a great answer and nice methodology to attacking this problem. You should give that a read as well.
Lastly, the following vignette by Robin K. S. Hankin (author of the partitions package) and Luke J. West is not only a great read, but very applicable to problems like the one presented here.
Set Partitions in R
I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)
I have a really big matrix and I'd like to replace the values in it using a lookup table.
I have a table of values (which looks a bit like this):
Origin Destination Distance Final
1 1 1 A
1 1 2 B
1 1 3 E
1 2 2 F
1 3 1 B
1 3 2 C
2 1 1 B
2 2 1 A
2 3 3 C
3 1 1 A
3 1 2 D
3 2 1 B
3 3 2 A
...
and I have a matrix, which looks something like this:
x 1 1 3 1 2 1 ...
1 1 3 2 1 2 1
1 2 2 1 2 2 1
3 2 1 2 1 1 2
1 3 1 2 1 2 1
2 1 1 3 1 1 1
1 2 2 1 3 1 1
...
I'm trying to match my matrix rownames with the Origin column, the matrix Colnames with the Destination Column and the matrix values with the Distance Column and then replace that value with the Final Column.
The Matrix is 4000 by 4000.
The Table is 27 by 4
So when I'm done it should look like:
x 1 1 3 1 2 1 ...
1 A E C A F A
1 B B B B F A
3 D A A A B A
1 E A C A F A
2 B B C B A B
1 B B B E A A
...
I'm currently using a little loop, which looks like this;
for (i in 1:nrow(CategoryTable)){
Origin <- CategoryTable[i,"O"]
Dest <- CategoryTable[i,"D"]
Distance <- CategoryTable[i,"Dist"]
Final <- CategoryTable[i,"Final"]
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
}
Based on this question (Replace all values in a matrix <0.1 with 0) I can replace all the things matching a specific value or the things matching a column or row. But I can't match all at once.
The active ingredient of the current attempt is:
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
So I was trying to match the rows and columns and then pass that as a boolean vector to the value match, and then do the RHS assignation.
However, what I actually get is:
Error in CategoryGrid[row.names(CategoryGrid) %in% Origin, colnames(CategoryGrid) %in% :
incorrect number of dimensions
How would you go about achieving this?
Replacing the active ingredient with the following did the trick.
CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest] <- apply(CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest], MARGIN=c(1,2),function(x) ifelse(x == Distance, Final, x))
Specify the rules for the rows and columns and then do an apply and put the third variable in it's own function.
I want to subset a data frame by a vector, but replicate the subsetting for each value in the vector:
data = data.frame(A = c(1,2,3,1), B = c(1,2,3,4))
vec = c(1, 1, 1)
subset(data, A %in% vec)
A B
1 1 1
4 1 4
Instead of this result I want this:
A B
1 1 1
4 1 4
1 1 1
4 1 4
1 1 1
4 1 4
If you use the purrr library, you can do
map_df(vec, function(x) subset(data, A == x))
with base R, it would be
do.call("rbind", lapply(vec, function(x) subset(data, A == x)))
You need to expand it, i.e.
df2 <- subset(data, A %in% vec)
df2[rep(rownames(df2), length(vec)),]
# A B
#1 1 1
#4 1 4
#1.1 1 1
#4.1 1 4
#1.2 1 1
#4.2 1 4
One option with data.table:
library(data.table)
setDT(data, key = 'A')[.(vec)]
# A B
#1: 1 1
#2: 1 4
#3: 1 1
#4: 1 4
#5: 1 1
#6: 1 4
Or use merge, which gives cartesian product as you need when there are duplicated values in the merge-by column:
merge(data, data.frame(A = vec))
# A B
#1: 1 1
#2: 1 1
#3: 1 1
#4: 1 4
#5: 1 4
#6: 1 4
Along the lines of a base R split-apply-combine solution would be
do.call(rbind, lapply(vec, function(i) data[data$A == i, ]))
A B
1 1 1
4 1 4
11 1 1
41 1 4
12 1 1
42 1 4
This could be useful if vec contained an uneven mixture of values. This solution could be expensive if there are many repetitions in vec. In that instance, computation can be reduced by combining it with the rep idea in soto's answer as follows.
# count the number of repetitions by unique value
uni <- table(vec)
# extract unique values
temp <- lapply(as.numeric(names(uni)), function(i) data[data$A == i, ])
# combine results, repeating data.frames according to count
do.call(rbind, temp[rep(seq_along(uni), each=uni)])
I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1