Count next n rows that meets a condition in R

Count next n rows that meets a condition in R - r

Let's say I have a df that looks like this
ID X_Value
1 40
2 13
3 75
4 83
5 64
6 43
7 74
8 45
9 54
10 84
So what I would like to do, is to do a rolling function that if in the actual and last 4 rows, there are 2 or more values that are higher than X (let's say 70 for this example) then return 1, else 0.
So the output would be something like the following:
ID X_Value Next_4_2
1 40 0
2 13 0
3 75 0
4 83 1
5 64 1
6 43 1
7 24 1
8 45 0
9 74 0
10 84 1
I think this would be possible with a rolling function, but I have tried and not sure how to do it. Thank you in advance

Given your expected output, I suppose you meant "in the actual and previous 3 rows". Then using some rolling function indeed does the job:
library(zoo)
thr1 <- 70
thr2 <- 2
last <- 3 + 1
df$Next_4_2 <- 1 * (rollsum(df$X_Value > thr1, last, align = "right", fill = 0) >= thr2)
df
# ID X_Value Next_4_2
# 1 1 40 0
# 2 2 13 0
# 3 3 75 0
# 4 4 83 1
# 5 5 64 1
# 6 6 43 1
# 7 7 74 1
# 8 8 45 0
# 9 9 54 0
# 10 10 84 1

The indexing using max(1,i-3) is perhaps the only part of the code worth remembering. I might help in subsequent construction when a for-loop was really needed.
dat$X_Next_4_2 <- integer( length(dat$X_Value) )
dat$ X_Next_4_2[1]=0
for (i in 2:length(dat$X_Value) ){
dat$ X_Next_4_2[i]=
( sum(dat$X_Value[i: (max(0, i-4) )] >=70) >=2 )}
(Not very pretty and clearly inferior to the rollsum answer already posted.)

Related

Why is my R code for filtering data producing different results with "fread()" and "ffdf()"?

I have a huge file with 7 million records and 160 variables. I came to know that fread() and read.csv.ffdf() are two ways to handle such big data. But when I try to use dplyr to filter these two data sets, I get different results. Below is a small subset of my data-
sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2 18 5 0
3 32 11 0
4 67 7 0
5 37 6 1
6 57 5 0
7 50 6 0
8 59 12 0
9 44 9 0
10 40 9 0
11 27 3 0
12 59 8 0
13 44 7 0
14 81 10 0
15 59 6 1
16 32 10 0
17 90 12 1
18 69 7 0
19 62 11 1
20 85 6 1
21 43 10 0
Code1
sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result1-
AGE AGE_NEONATE AMONTH AWEEKEND
1 67 NA 7 0
2 81 NA 10 0
3 90 NA 12 1
4 69 NA 7 0
5 85 NA 6 1
Code2-
sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result2-
AGE AGE_NEONATE AMONTH AWEEKEND
1 81 10 0
2 90 12 1
3 85 6 1
I know that my 1st code is correct and gives me the correct results. What am I doing wrong in the 2nd code?

I haven't really tried running your code, but from what I can see, I suspect the following:
In your 2nd code version, you are reading the headers as part of the data. This leads to all the columns being imported as character rather than numeric.
In addition, most likely you have default.stringsAsFactors() returning TRUE, meaning that the imported character columns are treated as factors.
Now I guess that your between is being applied to factor levels between 65 and 95, rather than to the actual numbers. Since you probably don't have data for every year (age), 67 and 69 are likely mapped to factor levels below 65 (i.e. as.numeric(AGE) will return you the factor levels the numbers map to, and not the numbers as you see them when printing).
Try to use stringsAsFactors = FALSE or convert explicitly to character after reading.

How to get value from upcomming row if condition is met?

I searched in google and SO but could not find any answer to my question.
I try to get a value from the first upcomming row if the condition is met.
Example:
Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33
to
Pupil participation bonus bonusAtNoParti sumBonusTillParticipation=0
2 55 6 -94 6+3+9 = 18
2 33 3 -97 3+9 = 12
2 88 9 -91 9
2 0 -100 0 0
2 44 4 -29 4+7=11
2 66 7 -26 7
2 0 -33 0 0
So I need to do this:
Iterate through the dataframe and check next rows till participation equals to 0 and get the bonus from that line and add the bonus from the current line and write it to bonusAtNoPati.
My problem here is the "check next rows till participation equals to 0 and get the bonus from that line"
I know how to Iterate through the whole list but not after the current point(row)
I would need to do this process to the whole list where i can get any random participation value in random order.
Has anyone any idea how to realize it?
Edit, I also added another column("sumBonusTillParticipation=0", only sum value is required) which is even harder to realize. R is such a hard to learn language =(

you can use which to get which row number participation is 0.
df <- read.table(text = 'Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33', header = T)
index <- c(0, which(df$participation == 0))
diffs <- diff(index)
df$tp <- rep(df$bonus[index], times = diffs)
df$bonusAtNoParti <- df$bonus + df$tp
df$bonusAtNoParti[index] <- 0
df$tp <- NULL
Pupil participation bonus bonusAtNoParti
1 2 55 6 -94
2 2 33 3 -97
3 2 88 9 -91
4 2 0 -100 0
5 2 44 4 -29
6 2 66 7 -26
7 2 0 -33 0

How to reverse the order of two indices of a variable in R

I have a dataset that looks like
A T Value into T A Value
1 1 32 1 1 32
1 2 33 1 2 55
1 3 34 1 3 96
2 1 55 2 1 33
2 2 56 2 2 56
2 3 57 2 3 97
3 1 96 3 1 34
3 2 97 3 2 57
3 3 98 3 3 98
and i want to use reshape (in R) to reshape this object on the left so that the T index comes in the first column and the A index in the second column to get the object on the right. I dont have the melt or cast functions.

Let df be your data.frame.
df <- df[order(df$T, df$A), c("T", "A", "Value")]
This can be found out easily by googling next time.

Looks like you just want to sort rows and move columns. If this is your sample input
tt<-read.table(text="A T Value
1 1 32
1 2 33
1 3 34
2 1 55
2 2 56
2 3 57
3 1 96
3 2 97
3 3 98", header=T)
you can do
tt[order(tt$T, tt$A), c("T","A","Value")]

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!

This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40

Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.

Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>