I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")
I have got the following problem. I have a data.frame with an x and y column representing some points in space:
X<-c(18.25743,18.25783,18.25823,18.25850,18.25863,18.25878,
18.25885,18.25912,18.25943,18.25962,18.25978,18.26000,
18.26022,18.26051,18.26070,18.26095,18.26118,18.26140,
18.26189,18.26250,18.26310,18.26390)
Y<-c(44.69561,44.69564,44.69567,44.69567,44.69586,
44.69600,44.69637,44.69671,44.69691,44.69701,44.69720,
44.69740,44.69763,44.69774,44.69787,44.69790,44.69791,
44.69795,44.69812,44.69802,44.69812,44.69834)
eDF<-data.frame(X,Y)
Now my problem is they are "sorted" wrong for plotting.So what I need is a function to write together the rows of the two points which belong together (in a list of lists):
1 and 12 is ID1
2 and 13 is ID2
3 and 14 is ID3
...
11 and 22 is ID11
Every so created list within the list of lists should have its unique ID (just numerating from 1 to the end). Well because I got this problem in all my data with different length.
It would be great if the starting point of the second consecutive row selecting (the 12) is flexible always taking the first row after half of the data.((rownumber/2)+1) in this example
12.
Well I have tried some things and i think Im on the right way but I cant figure out a solution by myself.
This function is pretty near but i cant manage to make it start at different rows(1 and 12):
lapply(2:nrow(eDF), function(x) eDF[(x-1):x,])
I also tried to figure it out with seq and it would do what i need if i could make a list of lists by connecting both code samples. Well I also need to change the concrete start and end numbers to a dynamic solution.
eDF[(seq(1,to=11,by=1)),] # selecting rows 1 to 11
eDF[(seq(12,to=nrow(eDF),by=1)),] #selecting rows 12 to end
Anyone any ideas?
I don't know if you needed an ID column inside of the new list but another way would be:
#create the IDs
eDF$ID <- rep(1:11,2)
#split the data.frame according to those
mylist <- split(eDF, eDF$ID)
Output:
mylist
$`1`
X Y ID
1 18.25743 44.69561 1
12 18.26000 44.69740 1
$`2`
X Y ID
2 18.25783 44.69564 2
13 18.26022 44.69763 2
$`3`
X Y ID
3 18.25823 44.69567 3
14 18.26051 44.69774 3
$`4`
X Y ID
4 18.2585 44.69567 4
15 18.2607 44.69787 4
#and so on...
You could only do split(eDF, rep(1:11,2) if you don't need the ID column.
We can modify the OP's lapply code
lapply(1:11, function(i) eDF[c(i, i+11),])
I have a column of a data.table:
DT = data.table(R=c(3,8,5,4,6,7))
Further on I have a vector of upper cluster limits for the cluster 1, 2, 3 and 4:
CP=c(2,4,6,8)
Now I want to compare each entry of R with the elements of CP considering the order of CP. The result
DT[,NoC:=c(2,4,3,2,3,4)]
shall be a column NoC in DT, whose entries are just the number of that cluster, which the element of R belongs to.
(I need the cluster number to choose a factor out of another data.table.)
For example take the 1st entry of R: 3 is not smaller than 2 (out of CP), but smaller than 4 (out of CP). So, 3 belongs to cluster 2.
Another exmaple, take the 6th entry of R: 7 is neither smaller than 2, 4 nor 6 (out of CP), but shmaller than 8 (out of CP). So, 7 belongs to cluster 4.
How can I do that without using if-clauses?
You can accomplish this using rolling joins:
data.table(CP, key="CP")[DT, roll=-Inf, which=TRUE]
# [1] 2 4 3 2 3 4
roll=-Inf performs a NOCB rolling join - Next Observation Carried Backward. That is, in the event of value falling in a gap, the next observation will be rolled backward. Ex: 7 falls between 6 and 8. The next value is 8 - will be rolled backward. We simply get the corresponding index of each match using which=TRUE.
You can just add this as a column to DT using := as you've shown.
Note that this will return the indices after ordering CP. In your example, CP is already ordered, so it returns the result as intended. If CP is not already ordered, you'll have to add an additional column and extract that column instead of using which=TRUE. But I'll leave it to you to work it out.
From your description this would seem to be the code to deliver the correct answers, but Arun, a most skillful data.tablist, seems to have come up with a completely different way to fit your expectations, so I think there must be a different way of reading your requirements.
> DT[ , NoC:= findInterval(R, c(0, 2,4,6,8) , rightmost.closed=TRUE)]
> DT
R NoC
1: 3 2
2: 8 4
3: 5 3
4: 4 3
5: 6 4
6: 7 4
I'm also very puzzled that findInterval is assigning the 5th item to the 4th interval since 6 is not greater than the upper boundary of the third interval (6).
i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA