I am trying to get the column importantval for a number that is within a range. I have no clue how to even start this, anyone have any ideas?
data<-data.frame(lower=c(1,4,6,7,7),upper=c(3,5,7,8,9),importantval=c(99,98,97,96,95))
vals<-c(1.14,3.5,7.2,19)
> data
lower upper importantval
1 1 3 99
2 4 5 98
3 6 7 97
4 7 8 96
5 7 9 95
output goal
# 1.14 99
# 3.5 NA
# 7.2 96 <--return the smalller interval (from 7 to 8 is smaller than 7 to 9)
# 19 NA <--doesnt exist so return NA
A simple lapply would do the trick. Identifying the line is relatively easy. The if statement to take only the smaller interval when multiple values work is a bit harder to understand but mostly, if there are more than one possibility, I take the row where the interval is equal to the smallest interval possible.
foo <- function(i) {
res <- data[data$lower < i & data$upper > i, ]
if (nrow(res) > 1) {
res <- res[which(res$upper - res$lower == min(res$upper - res$lower)), ]
}
if (nrow(res) == 0) return(NA)
return(res$importantval)
}
results <- data.frame(vals, sapply(vals, foo))
This assumes that there are no intervals that are of same length. If this is a possibility, you could add return(min(res$importantval)) at the end to get only the smaller value.
If you would want to keep both values, take the results in a list:
results <- lapply(vals, foo)
names(results) <- vals
Related
I have a dataframe as shown below.
dataframe
Data for replication:
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I'm trying to come up with a way to analyze the 'group' column for any groups with similar elements and remove the row with the higher count.
For example,
element 20000128 is present in rows 1,2 & 5. Since row 1 has a lower number of characters, I want to remove rows 2 & 5. I appreciate any help!!
Ideally the end result should only have Cluster 1,3,4. Each element should only appear once. (the clusters with the lowest character count)
Exploring this problem has been fun. I've learned that this is a variation of the set cover problem and is NP Complete.
It would help to understand the scope of your problem. If we are talking 10s of clusters, we could use brute force. If it's thousands of clusters, we are going to have to use an approximation.
I have learned there is an R implementation of the greedy algorithm in the RcppGreedySetCover package.
First we need to convert to two column long form. We can use dplyr.
library(tidyverse)
longx <- x %>%
mutate(splitgroups = strsplit(as.character(groups), " ")) %>%
unnest(splitgroups) %>% select(cluster, splitgroups)
Then we can use greedySetCover to approximate the smallest set that covers all elements.
library(RcppGreedySetCover)
greedySetCover(longx)
#100% covered by 3 sets.
# cluster splitgroups
# 1: 2 20000127
# 2: 3 20000129
# 3: 3 20000130
# 4: 3 20000131
# 5: 3 20000132
# 6: 5 20000128
# 7: 5 20000133
# 8: 5 20000134
# 9: 5 20000135
#10: 5 20000136
This suggests the set of 2,3, and 5 covers everything. But this does not fully answer your question, because, as you know there is a set of clusters that is shorter.
However, what we have learned, is that the minimum set is 3 clusters. Now we can test all combinations of 3 clusters.
set.size <- length(unique(greedySetCover(longx)$cluster))
binary.matrix <- table(longx)
combinations <- combn(unique(x$cluster),set.size)
total.lengths <- apply(combinations,2,function(x){
if(sum(as.logical(colSums(binary.matrix[x,]))) == ncol(binary.matrix))
{sum(rowSums(binary.matrix[x,]))}
else {NA}})
min.length <- min(total.lengths,na.rm = TRUE)
min.set <- combinations[,which(total.lengths == min.length)]
x[min.set,]
# cluster groups chr
#1 1 20000127 20000128 17
#3 3 20000129 20000130 20000131 20000132 35
#4 4 20000133 20000134 20000135 20000136 35
Data
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I had to use a while loop, maybe there's a less loopy solution...
foo <- function(x) {
i <- 1
while(i < nrow(x)) {
grps <- strsplit(x$groups, " ")
keep <- unlist(lapply(grps, function(x) identical(x, grps[[i]]) | !any((length(x) > length(grps[[i]]) & duplicated(c(grps[[i]], x))))))
x <- x[keep,]
i <- i+1
}
x
}
foo(x)
cluster groups chr
1 1 20000127 20000128 17
3 3 20000129 20000130 20000131 20000132 35
4 4 20000133 20000134 20000135 20000136 35
Explanation.
# I created a function to keep things compact and allow it to be used for other datasets.
# The `x` is the argument, assumed to be your data frame.
# 1: foo <- function(x) {
# Start the ball rolling with a counter to use in the while loop.
# 2: i <- 1
# This starts the while loop and will continue until "i" reaches the end of the data.
# But note later that the data may change if there are rows that meet your condition.
# 3: while(i < nrow(x)) {
# Split the groups variable at the " " and store in "grps"
# 4: grps <- strsplit(x$groups, " ")
# This next line does the work.
# It creates a vector of logical indices which are used to remove rows of "x"
# I split this into many lines to explain better.
# 5: keep <- unlist(lapply(grps, function(x) # apply a function to "grps"
# identical(x, grps[[i]]) | # Returns TRUE for each row we are checking
# !any( # Negate the next conditions. They will return rows to remove.
# (length(x) > length(grps[[i]]) & # return TRUE (negated=FALSE) if the length of each x is more than all others
# duplicated(c(grps[[i]], x)))))) # if duplicated, return TRUE (negated=FALSE)
# Update "x" by keeping only the rows that meet the criteria defined in step 5.
# 6: x <- x[keep,]
# Increase i
# 7: i <- i+1
# 8: } # This ends the while loop
# 9: x # Return the result
} # End of function
I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}
Surely there has to be a function out there in some package for this?
I've searched and I've found this function to calculate the mode:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.
Ultimately I will apply this function to a large number of dplyr::group_by()s.
Thank you for your help!
Maybe you could try
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. The first will be the mode, the 2nd will be 2nd most common, etc.
Another method is to based on table():
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x will be coerced to factor first. If you have a large vector, this is very slow. Also on exit, we have to extract character names and of the table and coerce it to numeric.
Example
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same.
Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):
It's simple; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference:
I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column
how to find the second largest mode value?
Calculating the mode or 2nd/3rd/4th most common value
I have a csv file containing 4 columns of data. I need to select the first column from the csv file which I do like this:
file1<-read.csv("file1.csv",header=TRUE)
x<-file[,1]
The first column contains (x, here) contains row numbers.
x
5
10
54
177
178
182
183
184
185
203
204
205
206
207
208
Now there is another csv file which contains a single column of 365 rows of data
y<-read.csv("data.csv",header=TRUE)
y
0
2.3
0.5
21
0
.
.
.
9.5 #total 365 numbers
This is what I intend to do:
1) From x, chose the first number (which is 5)
2) In y, select the corresponding 5th data point (which is 0) and 4 data point prior to it (which are 21,0.5,2.3,0), then test the following condition respectively
Condition 1: From the 5 data points, if the three out of five are > 0, then print 5 (result of step 1)
Condition 2: If all four of five are >0, then print 5 again
Condition 3: If all of five are >0, then print 5 again
However, if out of three conditions, only the first two are met and third one is not met, then select the second number from x (10 in this case) and again choose the corresponding 10th data point in y and four data points prior to it (6th,7th,8th and 9th) and evaluate them for the third condition (i.e. if all the five numbers - 6th, 7th, 8th,9th and 10th are > 0, I do not need to evaluate the first and second condition which are already met by the previous number from x),then save 10 and stop.
This sounds quite complicated for a feeble mind of mine (as seen by my reputation) and was hoping someone can tell me how to do this in R.
Thanks a lot
Sounds like you need a while loop.
file1 <- data.frame(x=seq(5, 205, by=5))
file1
x <- file1[, 1]
set.seed(123)
file2 <- data.frame(y=rnorm(365))
y <- file2[, 1]
# flags for each condition
cond1 <- FALSE
cond2 <- FALSE
cond3 <- FALSE
k <- 0
while(!cond3) {
k <- k + 1
# select first number
num <- x[k]
# select all y's up to data point
all.y <- y[(num-4):num]
# number of positive values
chk.pos <- length(which(all.y > 0))
# condition 1: check if 3 of 5 are positive
cnt <- 0
if (!cond1 & chk.pos >= 3) {
cnt <- cnt + 1
cond1 <- TRUE
print(num)
}
# condition 2: check if 4 of 5 are positive
if (!cond2 & chk.pos >= 4) {
cnt <- cnt + 1
cond2 <- TRUE
print(num)
}
# condition 3: check if 5 of 5 are positive
if (!cond3 & chk.pos == 5) {
cnt <- cnt + 1
cond3 <- TRUE
print(num)
}
}
for me returns
[1] 5
[1] 15
[1] 70
I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.
It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)