I have a dataframe which includes leads we generated everyday, columns are lead_id, generated_date, handled_date, termination_date, each row is one lead, if a lead has empty handled_date, that means this lead is still waiting to be handled, and the lead is available for later days before temination_date. I want to plot a stack bar which shows how many leads are available everyday, each stacked bar means on which day the lead is generated. the senario is:
on 20190413,we generate 100 leads, then on 20190413 we have one bar With height 100,
on 20190414, we generate 150 new leads, and 50 leads from previous day (20190413) are handled, so 50 left(no leads are terminated), in the graph, 20190414 should have 2 stacked bars, one is 50 (100-50) leads left from 20190413, another one is 150 New generated leads
on 20190415, we generate 100 leads, 30 leads from 20190413 are handled 'yesterday'(20190414, and no leads are terminated), 50 leads from 20190414 are handled also 'yesterday' (20190414), so in the graph, for 20190415 there should be 3 stacked bars, 20 (100-50-30) leads from 20190413, 100 (150-50) from 20190414, and 100 New generated leads.
As long as it is before the termination_date, and the lead has empty value in column of handled_date, then this lead is availalbe.
Can anybody tell me how to plot? Thank you very much.
Related
My dataset contains a Likert item containing how energetic the participants were at that moment rated from 0-6. Where 0 = not energetic at all and 6 = very energetic. I have to investigate if these scores actually differ from one another based on the data. If 0 and 1 do not differ from eachother, I have to combine these two levels into one and so on. So at the end I might have 2 or 4 levels instead of 6.
I have tried applying classification algorithms to the data to see if a model classifying '0' would give an error rate when classifying '1'. Unfortunately, this did not work as I wanted. Is this actually possible?
My question is if someone knows how I can best investigate if there is indeed a difference between those 6 levels or whether I can combine some of them based on differences (or not) in the data of those levels.
I have, let's say, 60 empirical realizations of PPR. My goal is to create PPR vector with average values of empirical PPR. This average values depend on what upper and lower limit of TTM i take - so I can take TTM from 60 to 1 and calculate average and in PPR vector put this one average number from row 1 to 60 or I can calculate average value of PPR from TTT >= 60 and TTM <= 30 and TTM > 30 and TTM <= 1 and these two calculated numbers put in my vector accordingly to TTM values. Finaly I want to obtain something like this on chart (x-axis is TTM, green line is my empirical PPR and black line is average based on significant changes during TTM). I want to write an algorithm which will help me find the best TTM thresholds to fit the best black line to green line.
TTM PPR
60 0,20%
59 0,16%
58 0,33%
57 0,58%
56 0,41%
...
10 1,15%
9 0,96%
8 0,88%
7 0,32%
6 0,16%
Can you please help me if you know any statistical method which might be applicable in this case or base idea for an algorithm which I could implement in VBA/R ?
I have used Solver and GRG Nonlinear** to deal with it but I believe that there is something more proper to be utilized.
** with Solver I had the problem that it found optimal solution - ok, but I re-run Solver and it found me new solution (with a little bit different values of TTM) and value of target function was lower that on first time (so, was the first solution really optimal ?)
I think this is what you want. The next step would be including a method that can recognize the break points. I am sure you need to define two new parameters, one as the sensitivity and one as the minimum number of points in a sample to be accepted to be categorized as a section (between two break points including start and end point)
Please hit the checkmark next to this answer if you are happy with it.
You can download the Excel file from here:
http://www.filedropper.com/statisticspatternchange
I’m trying to obtain the proportions of individuals that that shares certain DNA sequences between two given points. And I want to use a specific sliding window. In order to show the problem I create this example. First I create a data frame with four columns.
x<-c(rep("sc256",times=2000),rep("sc784",times=2000))
pos1<-round(runif(2000,100,5000),digits=0)
pos2<-round(runif(2000,100,5000),digits=0)
y3<-rep(c(2,1),times=2000)
M1<-data.frame(x,pos1,pos2,y3)
colnames(M1)=c("iid","pos1","pos2","chr")
I also create a function to obtain the proportion of individuals that have sequences in a particular interval.
roh_island<-function(pop,chr,p1,p2){
a<-pop[pop$chr==chr,]
island<-subset(a,pos1>=p1 & pos2<=p2)
n<-nrow(island)/length(M1$iid)
return(n)
}
roh_island(M1,1,345,700)
Now I want to transform this interval into a sliding window of size 10 that moves between values 0 and 7000. So this window will take positions [0,10);(10,20),…,(6990,7000]. I also need that the new function with the slide window stores all the windows and proportion of individuals in each in a data frame to afterwards plot it. I try some solutions that I have found regarding sliding windows I saw but I could not make them work. Thanks
This code will slide p1 from 0 to 6990 in steps of 10 while p2 slides from 10 to 7000 in steps of 10:
output = apply(data.frame(seq(0,6990,10), seq(10,7000,10)), MARGIN=1,
function(x,y,z,a) roh_island(M1, 1, x[1], x[2]))
plot(output, col="blue")
grid(5, 5)
I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.
I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.