subset rows + context - r

I haven't been able to figure out an easy way to include some context ( n adjacent rows ) around the rows I want to select.
I am more or less trying to mirror the -C option of grep to select some rows of a data.frame.
Ex:
a= data.frame(seq(1:100))
b = c(50, 60, 61)
Let's say I want a context of 2 lines around the rows indexed in b; the desired output should be the data frame subset of a with the rows 48,49,50,51,52,58,59,60,61,62,63

You can do something like this, but there may be a more elegant way to compute the indices :
a= data.frame(seq(1:100))
b = c(50, 60, 61)
context <- 2
indices <- as.vector(sapply(b, function(v) {return ((v-context):(v+context))}))
a[indices,]
Which gives :
[1] 48 49 50 51 52 58 59 60 61 62 59 60 61 62 63
EDIT : As #flodel points out, if the indices may overlap you must add the following line :
indices <- sort(unique(indices))

Related

drawing a value from a vector r

After removing the values from the vector from 1 to 100 I have the following vector:
w
[1] 2 5 13 23 24 39 41 47 48 51 52 58 61 62 70 71 72 90
I am now trying to draw values from this vector with the sample function
for(x in roznica)
{
if(licznik_2 != licznik_1 )
{
roznica_proces_2 <- sample(1:w, roznica)
} else {
roznica_proces_2 <- NA
}
}
I tried various combinations with the sample
If w is the name of the vector then you would NOT use sample(1:w, ...). For one thing 1:w doesn't really amke sense since the : operator expects its second argument to be a single number, while w is apparently on the order of 15 values. Depending on what roznica is (and hopefully it is a single integer) then you might use:
sample(w, roznica) # returns a vector of length roznica's value of randomly ordered values in `w`.
The other problem is that you are currently overwirign any values from prior iterations of the for loop. So you might want to use:
roznica_proces_2[roznica] <- sample(1:w, roznica)
You would of course need to have initialized roznica_proces_2, perhaps with:
roznica_proces_2 <- list()
Regarding your query in the comment :
I am only concerned with the sample function itself: I will show an example : w [1] 31 and now I want to draw 1 number from that in ( which is 31) proces_nr_2 <- sample(w, 1) What does he get? proces_nr_2 [1] 26
The reason that happens is because when a vector is of length 1 the sampling takes from 1 to that number. It is explained in the help page of ?sample.
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x
So if you have only 1 number to sample just return that number directly instead of passing it in sample.

ifelse statement return value from same df row

I'm not sure why this is so difficult, but I simply want to return the data in rowX,colB when rowX,colA is > 70.
df
colA colB
80 6
75 7
60 5
66 4
vector <- ifelse(df$colA > 70, paste0('above 70', df$colB), 'below 70')
The result is a vector with 'below 70' if below 70 and NA if above 70, but why? Even if the ifelse statement doesn't understand the row for df$colB it should never return NA as paste0('above 70', df$colB) is actually a dataframe. Is the ifelse statement blind to the row number it's evaluating outside of the first conditional statement?

cumulative sum until a certain value

I have some data from a school year that I am working with. The variables are SchoolYear, Aug, Sep, Oct, ..., May, June where each month corresponds to th number of participants for that month. I need to sum the months until there is missing info, in this case identified by a 0. I have tried
yeardf <-within(yeardf,{
Max_enroll<-cummax(Sep)
Enroll_To_Date<-cumsum(Sep)
}
)
1973-74,0,0,4,2,14,26,22,8,0,99,
1974-75,0,0,4,26,10,23,10,14,0,91,
while putting a condition of Sep>0 on the Enroll_To_Date line but have not been succesfull.
Set up your data as lists and data frame:
> row1 <- c("1973-74",0,0,4,2,14,26,22,8,0,99)
> row2 <- c("1974-75",0,0,4,26,10,23,10,14,0,91)
> df <- rbind(row1,row2)
Cumulative sums of row1 can be found like this, and it looks like you want to capture the 76 (where it hits a zero):
> (z <- cumsum(row1[2:length(row1)]))
[1] 0 0 4 6 20 46 68 76 76 175
Here's one way to get it. First find the spot in the list that has the value:
> which(duplicated(lead(cumsum(row1[2:length(row1)]))))
[1] 8
And then look up the cumulative sum at that value:
> z[which(duplicated(lead(cumsum(row1[2:length(row1)]))))]
[1] 76
So here's the calculation for your row2:
> z <- cumsum(row2[2:length(row2)])
> z[which(duplicated(lead(cumsum(row2[2:length(row2)]))))]
[1] 87
And if you want to do a lot of them, like in the data frame df, chain them together in a function and use apply over all the rows (1) of your data frame:
> apply(df,1,function(x) cumsum(x[2:length(x)])[which(duplicated(lead(cumsum(x[2:length(x)]))))])
row1 row2
76 87

Problems subsetting columns based on values from two separate dataframes

I am using data obtained from a spatially gridded system, for example a city divided up into equally spaced squares (e.g. 250m2 cells). Each cell possesses a unique column and row number with corresponding numerical information about the area contained within this 250m2 square (say temperature for each cell across an entire city). Within the entire gridded section (or the example city), I have various study sites and I know where they are located (i.e. which cell row and column each site is located within). I have a dataframe containing information on all cells within the city, but I want to subset this to only contain information from the cells where my study sites are located. I previously asked a question on this 'Matching information from different dataframes and filtering out redundant columns'. Here is some example code again:
###Dataframe showing cell values for my own study sites
Site <- as.data.frame(c("Site.A","Site.B","Site.C"))
Row <- as.data.frame(c(1,2,3))
Column <- as.data.frame(c(5,4,3))
df1 <- cbind(Site,Row, Column)
colnames(df1) <- c("Site","Row","Column")
###Dataframe showing information from ALL cells
eg1 <- rbind(c(1,2,3,4,5),c(5,4,3,2,1)) ##Cell rows and columns
eg2 <- as.data.frame(matrix(sample(0:50, 15*10, replace=TRUE), ncol=5)) ##Numerical information
df2 <- rbind(eg1,eg2)
rownames(df2)[1:2] <- c("Row","Column")
From this, I used the answer from the previous questions which worked perfectly for the example data.
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,])]
However, I cannot apply this to my own data and cannot figure out why.
EDIT: Initially, I thought there was a problem with naming the columns (i.e. the 'names' function). But it would appear there may be an issue with the 'output' line of code, whereby columns are being included from df2 that shouldn't be (i.e. the output contained columns from df2 which possessed column and row numbers not specified within df1).
I have also tried:
output <- df2[, (df2['Row', ] == df1$Row) & (df2['Column', ] == df1$Column)]
But when using my own (seemingly comparable) data, I don't get information from all cells specified in the 'df1' equivalent (although again works fine in the example data above). I can get my own data to work if I do each study site individually.
SiteA <- df2[, which(df2['Row', ] == 1) & (df2['Column', ] == 5)]
SiteB <- df2[, which(df2['Row', ] == 2) & (df2['Column', ] == 4)]
SiteC <- df2[, which(df2['Row', ] == 3) & (df2['Column', ] == 3)]
But I have 1000s of sites and was hoping for a more succinct way. I am sure that I have maintained the same structure, double checked spellings and variable names. Would anyone be able to shed any light on potential things which I could be doing wrong? Or failing this an alternative method?
Apologies for not providing an example code for the actual problem (I wish I could pinpoint what the specific problem is, but until then the original example is the best I can do)! Thank you.
The only apparent issue I can see is that mapply is not wrapped around unlist. mapply returns a list, which is not what you're after for subsetting purposes. So, try:
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[unlist(mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,]))]
Edit:
If the goal is to grab columns whose first 2 rows match the 2nd and 3rd elements of a given row in df1, you can try the following:
output_df <- Filter(function(x) !all(is.na(x)), data.frame(do.call(cbind,apply(df2, 2, function(x) {
##Create a condition vector for an if-statement or for subsetting
condition <- paste0(x[1:2], collapse = "") == apply(df1[,c('Row','Column')], 1, function(y) {
paste0(y,collapse = "")
})
##Return a column if it meets the condition (first 2 rows are matched in df1)
if(sum(condition) != 0) {
tempdf <- data.frame(x)
names(tempdf) <- df1[condition,]$Site[1]
tempdf
} else {
##If they are not matched, then return an empty column
data.frame(rep(NA,nrow(df2)))
}
}))))
It is quite a condensed piece of code, so I hope the following explanation will help clarify some things:
This basically goes through every column in df2 (with apply(df2, 2, FUN)) and checks if its first 2 rows can be found in the 2nd and 3rd elements of every row in df1. If the condition is met, then it returns that column in a data.frame format with its column name being the value of Site in the matching row in df1; otherwise an empty column (with NA's) is returned. These columns are then bound together with do.call and cbind, and then coerced into a data.frame. Finally, we use the Filter function to remove columns whose values are NA's.
All that should give the following:
Site.A Site.B Site.C
1 2 3
5 4 3
40 42 33
13 47 25
23 0 34
2 41 17
10 29 38
43 27 8
31 1 25
31 40 31
34 12 43
43 30 46
46 49 25
45 7 17
2 13 38
28 12 12
16 19 15
39 28 30
41 24 30
10 20 42
11 4 8
33 40 41
34 26 48
2 29 13
38 0 27
38 34 13
30 29 28
47 2 49
22 10 49
45 37 30
29 31 4
25 24 31
I hope this helps.

conditional subseting with square brackets or inside square brackets

I have two vectors p1,p2 they report the same information except p2 is more precise. So I want to pick compare the 2 and pick the value from p2 except if the difference between the 2 vectors is > k. In that case I want the value from p1 to be picked in the final product "pd".
k <- 5
p1 <- c(21,43,62,88,119,156,264)
p2 <- c(19,42,62,84,104,156,262)
pd should look like:
pd <- c(19,42,62,84,119,156,262)
I have seen code that specified the selection condition inside the square brackets, but can't figure out how to duplicate it. Something similar to pd <- p2[p1, p1-p2 >5], but not exactly because this obviously doesn't evaluate. p2[p1-p2<5] works to select the positive cases but the 5th case where the condition evaluate to FALSE is skipped.
May be
ifelse(abs(p2-p1) <=k, p2, p1)
#[1] 19 42 62 84 119 156 262
Or without using ifelse
indx <- abs(p1-p2) >k
pd <- p2
pd[indx] <- p1[indx]
pd
#[1] 19 42 62 84 119 156 262

Resources