Subset dataframe with equal difference for one column in R - r

I am trying to iterate the rows in a dataframe (data) to check if one of the columns (data$ID) has similar difference (e.g., 3) between consecutive elements. If yes, keep the row, otherwise remove the row. The tricky part is I need to re-compare consecutive elements after certain row is removed.
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data
ID score
1 3.1 70
2 6 80
3 6.9 90
4 9 65
5 10.5 43
6 12 78
7 14.2 44
8 15 92
for (i in (length(data$ID)-1)) {
first <- data$ID[i]
second <- data$ID[i+1]
if ((second-first) == 3){
data <- data[-(i+1),]
}
}
The expected output data should be
ID score
1 3.1 70
2 6 80
3 9 65
4 12 78
5 15 92
The initial row 3, 5, 7 are excluded due to the different diff. But my code failed.
I also try to use diff function,
DF <- diff(data)
But it doesn't take care the fact that after one row is removed, the difference will change. Should I use diff function in a loop, but the dataframe is dynamic changed.

Using a recursive function (a function that calls itself)
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
# use recursive function to trim the remainder of the list
trim_ids <- function (ids) {
# if only one element, return it
if (length(ids) <= 1) {
return(ids)
}
# if the gap between element 2 and element 1 is small enough
if ((ids[2] - ids[1]) < 2.9 ) {
# trim after dropping the second element
return(trim_ids(ids[-2]))
} else {
# keep the first element and trim from the second element
return(c(ids[1], trim_ids(ids[2:length(ids)] )))
}
}
# find the ids to keep
keep_ids <- trim_ids(data$ID)
# select the matching rows
data[data$ID %in% keep_ids,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92

An option could be achieved using cumsum and diff as:
#data
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data[c(0, cumsum(diff(round(data$ID))) %% 3 ) == 0,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92

If you define you want to keep all rows of which the ID, when rounded to 0 digits, belongs to a product of 3, you could try:
df1 <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
df1[round(df1$ID) %% 3 == 0,]
ID score
1 3.1 70
2 6.0 80
4 9.0 65
6 12.0 78
8 15.0 92

Related

Ordering rows when there is a tie between several of them

I have a data.frame (corresponding to a leaderboard) like this one:
structure(list(PJ = c(4, 4, 4, 4, 4, 4), V = c(4, 2, 2, 2, 1,
1), E = c(0, 0, 0, 0, 0, 0), D = c(0, 2, 2, 2, 3, 3), GF = c(182,
91, 92, 185, 126, 119), GC = c(84, 143, 144, 115, 141, 168),
Dif = c(98, -52, -52, 70, -15, -49), Pts = c(12, 6, 6, 6,
3, 3)), class = "data.frame", row.names = c("Player1", "Player2",
"Player3", "Player4", "Player5", "Player6"))
I would like to order the rows according to the number of points Pts. This can be done by df[order(df$Pts, decreasing=T),]. The issue appears when there is a tie between several players, then, what I want to do is to order the rows according to Dif.
How can this be done?
The order function which you are already using can take multiple arguments, each used sequentially to break ties in the previous one; see ?order
So you simply have to add Dif to you existing call:
df[order(df$Pts, df$Dif, decreasing=T),]
You can add further terms to break any remaining ties, e.g. Player2 and Player3 who have identical Pts and Dif.
If you want to specify which direction each argument should be ordered by (increasing or decreasing), you can either specify the decreasing argument as a vector, as in #r.user.05apr's comment, or my preferred lazy solution of adding - to any term that should be ordered in a decreasing direction
df[order(-df$Pts, df$Dif),]
(this will order by Pts decreasing and Dif increasing; it won't work if e.g. one of the ordering columns is character)
You can use sqldf or dplyr library
library (sqldf)
sqldf('select *
from "df"
order by "Pts" desc, "Dif" desc ')
Output
PJ V E D GF GC Dif Pts
1 4 4 0 0 182 84 98 12
2 4 2 0 2 185 115 70 6
3 4 2 0 2 91 143 -52 6
4 4 2 0 2 92 144 -52 6
5 4 1 0 3 126 141 -15 3
6 4 1 0 3 119 168 -49 3

Replicate excel solver in R

I am wondering if there is a simple function to solve the following problem in R:
Suppose I have the following dataframe:
Variable 'A' with values c(10, 35, 90)
Variable 'B' with values c(3, 4, 17, 18, 50, 40, 3)
Now I know that the sum of various values in B equal the values in A, e.g. '3 + 4 + 3 = 10' and '17 + 18 = 35', which always balances out in the complete dataset.
Question
Is there a function that can sum these values in B, through trial and error I suppose, and match the correctly summed values with A? For example, the function tries to sum 3 + 4 + 18, which is 25 and retries this because 25 is not a value in A.
I have tried several solutions myself but one problem that I often encountered was the fact that A always has less observations than B.
I would be very thankful if someone can help me out here! If more info is needed please let me know.
Cheers,
Daan
Edit
This example is with simplified numbers. In reality, it is a large dataset, so I am looking for a scalable solution.
Thanks again!
This is a problem know as the subset sum problem, and there are a ton of examples online of how to solve it using dynamic programming, or greedy algorithms.
To give you an answer that just works, the package adagio has an implementation:
library(adagio)
sums = c(10, 35, 90)
values = c(3, 4, 17, 18, 50, 40, 3)
for(i in sums){
#we have to subset the values to be less than the value
#otherwise the function errors:
print(subsetsum(values[values < i], i))
}
The output for each sum is a list, with the val and the indices in the array, so you can tidy up the output depending on what you want from there.
You can try the following but I am affraid is not scalable.
For the case of 3 summands you have
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),#building a matrix of the possible combinations of summands
c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40, 3))
x$sums <-rowSums(x) #new column with possible sums
idx<- x$sums%in%c(10, 35, 90) #checking the sums are in the required total
x[idx,]
Var1 Var2 Var3 sums
2 4 3 3 10
8 3 4 3 10
14 3 4 3 10
44 4 3 3 10
50 3 3 4 10
56 3 3 4 10
92 3 3 4 10
98 3 3 4 10
296 4 3 3 10
302 3 4 3 10
308 3 4 3 10
338 4 3 3 10
For the case of 2 summands
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40,3))
x$sums <-rowSums(x)
idx<- x$sums%in%c(10, 35, 90)
#Results
x[idx,]
Var1 Var2 sums
18 18 17 35
24 17 18 35
34 40 50 90
40 50 40 90

R function finding dataframe rows with specific attributes

I currently have a large dataset for which I want to find total time spent at altitude and range of temperatures experienced.
An example dataset is provided:
time<-c(1,2,3,4,5,6,7,8,9,10)
height<-c(10,33,41,57,20,27,23,39,40,42)
temp<-c(37,33,14,12,35,34,32,28,26,24)
practicedf<-data.frame(time,height,temp)
I want to calculate the total time spent above 30 m (height) and range of temperatures experienced at these altitudes. However, in my actual dataset the sampling frequency has resulted in a series of datapoints that skip over 30 m (i.e. going from 28.001 to 32.02 and never actually stopping at 30). Therefore I wanted to create a code that documented all of the dataframe rows that are below 30 m and also each time there is a gap between dataframe rows greater than one (to account for times when the data is above 30 m and then returns below 30 m, i.e. 27.24, 32.7, 45.002, 28.54) so I know to discount all points above the altitude I am targeting.
I've created the following function to carry this portion of my analysis out (pinpointing dataframe rows below 30 m).
pracfunction<-function(h){
res<-as.vector(lapply(h,function(x) if (x<=30) {1} else {0}))
res1<-as.vector(which(res == 1))
res_new<-list()
for (item in 1:length(res1)){
ifelse((res1[i+1]-res1[i]>1), append(res_new,i),
append(res_new,"na"))
}
print(which(res_new != "na"))
}
I want the output to look like:
[1] 1 5 6 7
Since in the vector height, indices 1, 5, 6, and 7 have values less than 30.
However each time I run it with height as the input I receive integer(0) as the output. I'm pretty new at writing loops and functions so if anyone could provide input into what I'm doing wrong, or has a better way to approach this problem it would be greatly appreciated! Thank you.
I'd use dplyr to create a new column low indicating whether height < 30.
library(dplyr)
practicedf <- practicedf %>%
mutate(low = ifelse(height < 30, 1, 0))
time height temp low
1 1 10 37 1
2 2 33 33 0
3 3 41 14 0
4 4 57 12 0
5 5 20 35 1
6 6 27 34 1
7 7 23 32 1
8 8 39 28 0
9 9 40 26 0
10 10 42 24 0
Not sure whether I understand your intentions correctly but here is what I think you might be looking for. Start with an extended sample data.frame:
pd <- structure(list(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), height = c(10, 33, 41, 57, 20,
27, 23, 39, 40, 42, 10, 33, 28, 17, 20, 27, 23, 39, 40, 42),
temp = c(37, 33, 14, 12, 35, 34, 32, 28, 26, 24, 37, 33,
14, 12, 35, 34, 32, 28, 26, 24)), .Names = c("time", "height",
"temp"), row.names = c(NA, -20L), class = "data.frame")
Then this function gives you the index of intercepts in a way that the value of every crossing the 30m line in either direction is given. I guess that's not exactly what you want but you can take it from here.
pf <- function( x ) # x is the data.frame
{
res <- ifelse( x[ , "height" ] <= 30, 1 , 0 ) # simplified version of your attempt
n <- NULL # initiate the index vector
for( i in 1:( length( res ) - 1 ) ) # to accommodate room for comparison
{
if( res[ i + 1 ] != res[ i ] ) # registers change between 0 and 1
n <- append( n, i + 1 ) # and writes it into the index vector
}
return( n )
}
With this, the call
pf( pd )
returns
[1] 2 5 8 11 12 13 18
indicating the positions on the height vector after the height limit of 30m was crossed, in either direction.

Data analysis by R language. How to discribe the distribution of NA positon in a vector?

I hope the position distribution of NA is uniform in the vector (length = 30, NA < 6 ).
This one length is 30, 4 NA. It's easy to see these NA not uniform, mainly at left.
vector_x <- c(NA,3, NA, 1, NA, 5, 6, 7, 7, 9, 0, 2, 12, 324, 54,23, 12, 324, 122, 23, 324, 332, 45, 78, 32, 12, 342, 95, 67, NA)
But I have no idea about use which kind of statistic or test to discribe. Then I can quantitative screening by a cutoff.
Now, I have two preliminary thoughts.
To simplify the solution, all NA seemed as 0 and all number seemed as 1, to see the distribution.
Or I get the index of NA, to do variance analysis about c(1, 3, 5, 30)
Thanks for your any suggestions!
You want to perform a Mann-Whitney U test or Wilcoxon rank-sum test (which is more descriptive of what it's doing)
This is easy to do with your data
which(is.na(v))
# [1] 1 3 5 30
which(!is.na(v))
# [1] 2 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
wilcox.test(which(is.na(v)), which(!is.na(v)))
# Wilcoxon rank sum test
# data: which(is.na(v)) and which(!is.na(v))
# W = 29, p-value = 0.1766
# alternative hypothesis: true location shift is not equal to 0
Check that wilcox.test works the way we expect with
wilcox.test(1:5, 6:10) # low p value
wilcox.test(seq(1,10,2), seq(2,10,2)) # high p value
If we need the index of NA elements, use is.na to convert to a logical vector, then with which returns the numeric index where it is TRUE
which(is.na(vector_x))
#[1] 1 3 5 30
Or to convert to a binary vector where 0 represents NA and 1 for other values
as.integer(!is.na(vector_x))

R Find best set of all possible combinations that includes all values

After quite some google effort I hope somebody can help me with the problem, that appears quite simple to me, but is maybe more complicated than I thought:
I have a data.frame with three columns. The first two reflecting all possible combinations of five variables (1-5), the last the "strength" of the combination. I look for the five rows, which include all values of Var1 and Var2 (so values 1-5) and have the highest sum in the strength column. In the example beneath, it is the five rows with a strength of 1000, as they have the highest sum and all five values (1-5) are given in the first two columns.
How do I best approach that problem? Is there a package that has implemented that task? I found now the constrOptim() function, can I do it with that?
Code to create an example dataframe:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
Starting dataset:
Var1 Var2 strength
3 2 306
3 1 230
1 3 207
2 2 169
3 5 165
5 1 156
4 4 153
2 5 136
4 3 128
4 1 118
5 3 101
1 2 98
4 5 73
1 5 63
2 1 61
5 5 35
3 4 32
3 3 27
1 4 19
5 4 14
4 2 6
1 1 -11
2 4 -18
2 3 -32
5 2 -54
Not desired outcome:
Var1 Var2 strength
3 2 306
3 1 230
1 3 207
2 2 169
3 5 165
Desired outcome:
Var1 Var2 strength
3 2 306
1 3 207
5 1 156
4 4 153
2 5 136
I am not sure the presented solution is the most effective one, but somehow I feel that we must go over the entire dataset to find the unique pairs (for example change the value of (Var1 = 2, Var2 = 5, strength = 136) to (Var1 = 2, Var2 = 5, strength = 1). In order to find the unique pairs I use the apply function. First lets recreate the input:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
Now I prepare an empty matrix in which I will keep Var1 in the first column, Var2 in the second and strength in the third column.
V <- matrix(nrow = 5, ncol = 3)
Next I write a function that will get one row from the ordered dataset a, will check if Var1 and Var2 are unique and if so, will store strength.
mf <- function(x){
if( !(x[1] %in% V[,1]) & !(x[2] %in% V[,2])) {
i <- x[1]
V[i,1] <<- x[1]
V[i,2] <<- x[2]
V[i,3] <<- x[3]
}
}
Now I apply the function on each row of a:
apply(a, 1, mf)
The needed values are stored in the matrix V:
V
[,1] [,2] [,3]
[1,] 1 3 207
[2,] 2 5 136
[3,] 3 2 306
[4,] 4 4 153
[5,] 5 1 156
Sometimes, though going over the full dataset is not necessary (like in the example given), then we would like to be able to break the loop once the unique pairs were found. For that we can use a for loop. Here is the code:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
V <- matrix(nrow=5,ncol=3)
for (i in 1:nrow(a)) {
if( sum(is.na(V[,1])) == 0)
break
if( !(a[i,1] %in% V[,1]) & !(a[i,2] %in% V[,2])) {
j <- a[i,1]
V[j,1] <- a[i,1]
V[j,2] <- a[i,2]
V[j,3] <- a[i,3]
}
}
Hope this helps, or at least will lead to improvements.
Consider a series of aggregation and merges between Var1 and Var2 columns:
# MERGE MAX AGGREGATES WHERE Var COL ARE EQUAL AND NOT EQUAL
mergedf1 <- merge(aggregate(strength ~ Var1, data=a[a$Var1==a$Var2,], FUN=max),
a, by=c("Var1", "strength"))
mergedf2 <- merge(aggregate(strength ~ Var1, data=a[a$Var1!=a$Var2,], FUN=max),
a, by=c("Var1", "strength"))
# STACK RESULTS
mergedf <- rbind(mergedf1, mergedf2)
# FINAL MAX AGGREGATION AND MERGE
final <- merge(aggregate(strength ~ Var2, data=mergedf, FUN=max),
mergedf, by=c("Var2", "strength"))
final <- final[,c("Var1", "Var2", "strength")] # SORT COLUMNS
final <- final[with(final, order(-strength)),] # SORT ROWS
# REMOVE TEMP OBJECTS
rm(mergedf1, mergedf2, mergedf)

Resources