I am trying to calculate the monthly percentage but its giving me wrong answer always - percentage

this is my data.
MOnth Key Resource
Jan Yes
Jan Yes
Jan Yes
Jan No
Jan No
I wanted to calculate the percentage of key resources in the month of Jan. So what i did is first i calculated how many key resources are there calculation (case when [Key Resource] = "Yes" then 1 else 0 end. Then to calculate the monthly key resource percentage i tried a calculation = sum(key resources) / count(Key Resource). But this is not giving me the correct answer. Please help

If you just insert a calculated column you run into problems. Spotfire will calculate this record for record.
I created the column [Calc] (via insert calculated column): (case when [Key] = "Yes" then 1 else 0 end)
the calculation Sum([Calc]) / Count([Key]) will return the correct values in most visualizations when you enter this as a custom expression. Just try it on a bar chart, with Month on the categorical axis.
If you want to have the calculation in the standard table, you need to insert a calculated column with the following syntax: Sum([Calc]) over ([Month]) / Count([Key]) over ([Month]). This breaks your calculation so it calculates per month.
I used the following data, calc and calculation are as mentioned above:
Month Key Calc Calculation
Jan Yes 1 0.6
Jan Yes 1 0.6
Jan Yes 1 0.6
Jan No 0 0.6
Jan No 0 0.6
Feb Yes 1 0.25
Feb No 0 0.25
Feb No 0 0.25
Feb No 0 0.25

Related

divide counts in one column where a condition is met

I am trying to determine the on time delivery rate of orders:
The column of interest is on time delivery orders, which contains a field of 0 (not on time) or 1 ( on time). How can I calculate in sql the on time rate for each person? Basically count the number of 0 / over total count(0's & 1's) for each person? Same thing for on time ( count 1/total count (0's & 1's)?
Heres a data example:
Week Delivery on time Person
1 0 sARAH
1 0 sARAH
1 1 sARAH
2 1 vIC
2 0 Vic
You may aggregate by person, and then take the average of the on time statistic:
SELECT Person, AVG(1.0*DeliveryOnTime) AS OnTime,
AVG(1.0 - DeliveryOnTime) AS NotOnTime
FROM yourTable
GROUP BY Person;
Demo
The demo given is for SQL Server, and the above syntax might have to change slightly depending on your actual database, which you did not reveal to us.

Sum of lag functions

Within one person's data from a behavioral task, I am trying to sum the clock time at which a target appears (data$onset) and the reaction time of their response (data$Latency) to find the clock time at which they entered their response at. For future data processing reasons, these calculated values will have to be placed in the data$onset column two values down from when the target appeared on the screen. In the example below:
Item
onset
Latency
Prime
9.97
0
Target
10.70
0.45
Mask
11.02
0
Response
NA
0
Onset is how many seconds into the task the stimuli appeared, and latency is reaction time to the target. latency for non-targets will always be 0, as subjects don't respond to them. in the "NA" under onset, I need that value to be the sum of the onset of the target+reaction time to the target (10.70+0.45). Here is the code I have tried:
data$onset=if_else(is.na(data$onset), sum(lag(data$onset, n = 2)+lag(data$Latency, n = 2)), data$onset)
If any clarification is needed please let me know.
since you used if_else I'm adding a dplyr solution;
library(dplyr)
data %>%
mutate(onset=ifelse(is.na(onset),lag(onset,n =2)+lag(Latency,n = 2),onset))
output;
Item onset Latency
<fct> <dbl> <dbl>
1 Prime 9.97 0
2 Target 10.7 0.45
3 Mask 11.0 0
4 Response 11.1 0
Also note that, if you want to stick to your own syntax;
data$onset=if_else(is.na(data$onset), lag(data$onset, n = 2)+lag(data$Latency, n = 2), data$onset)

Sampling not completely at random, with boundary conditions

I have summary level data that tells me how often a group of patients actually went to the doctor until a certain cut-off date. I do not have individual data, I only know that some e.g. went 5 times, and some only once.
I also know that some were already patients at the beginning of the observation interval, and would be expected to come more often, whereas some were new patients that entered later. If they only joined a month before the cutoff data, they would be expected to come less often than someone who was in the group from the beginning.
Of course, the patients are not well behaved, so they sometimes miss a visit, or they come more often than expected. I am setting some boundary conditions to define the expectation about minimum and maximum number of doctor visits relative to the month they started appearing at the doctor.
Now, I want to distribute the actual summary level data to individuals, i.e. create a data frame that tells me during which month each individual started appearing at the doctor, and how many times they came for check-up until the cut-off date.
I am assuming this can be done with some type of random sampling, but the result needs to fit both the summary level information I have about the actual subjects as well as the boundary conditions telling how often a subject would be expected to come to the doctor relative to their joining time.
Here is some code that generates the target data frame that contains the month when the observation period starts, the respective number of doctor's visits that is expected (including boundary for minimum and maximum visits), and the associated percentage of subjects who start coming to the doctor during this month:
library(tidyverse)
months <- c("Nov", "Dec", "Jan", "Feb", "Mar", "Apr")
target.visits <- c(6,5,4,3,2,1)
percent <- c(0.8, 0.1, 0.05, 0.03, 0.01, 0.01)
df.target <- data.frame(month = months, target.visits = target.visits,
percent = percent) %>%
mutate(max.visits = c(7,6,5,4,3,2),
min.visits = c(5,4,3,2,1,1))
This is the data frame:
month target.visits percent max.visits min.visits
Nov 6 0.80 7 5
Dec 5 0.10 6 4
Jan 4 0.05 5 3
Feb 3 0.03 4 2
Mar 2 0.01 3 1
Apr 1 0.01 2 1
In addition, I can create the data frame that shows the actual subject n with the actual number of visits:
subj.n <- 1000
actual.visits = c(7,6,5,4,3,2,1)
actual.subject.perc = c(0.05,0.6,0.2,0.06,0.035, 0.035,0.02)
df.observed <- data.frame(actual.visits = actual.visits,
actual.subj.perc = actual.subject.perc, actual.subj.n = subj.n * actual.subject.perc)
Here is the data frame with the actual observations:
actual.visits actual.subj.perc actual.subj.n
7 0.050 50
6 0.600 600
5 0.200 200
4 0.060 60
3 0.035 35
2 0.035 35
1 0.020 20
Unfortunately I do not have any idea how to bring these together. I just know that if I have e.g. 60 subjects that come to the doctor 4 times during their observation period, I would like to randomly assign a starting month to each of them. However, based on the boudary conditions min.visits and max.visits, I know that it can only be a month from Dec - Feb.
Any thoughts are much appreciated.

Finding Correlations between data in dataframe (including binary)

I have a dataset called dolls.csv that I imported using
dolls <- read.csv("dolls.csv")
This is a snippet of the data
Name Review Year Strong Skinny Weak Fat Normal
Bell 3.5 1990 1 1 0 0 0
Jan 7.2 1997 0 0 1 0 1
Tweet 7.6 1987 1 1 0 0 0
Sall 9.5 2005 0 0 0 1 0
I am trying to run some preliminary analysis of this data. The Name is the name of the doll, the review is a rating 1-10, year is year made, and all values after that are binary where they are 1 if they possess a characteristic or 0 if they don't.
I ran
summary(dolls)
and get the header, means, mins and max's of values.
I am trying to possibly see what the correlations are between characteristics and year or review rating to see if there is some correlation (for example to see if certain dolls have really high ratings yet have unfavorable traits ), not sure how to construct charts or what functions to use in this case? I was considering some ANOVA tail testing for outliers and means of different values but not sure how to compare values like this (In python i'd run a if-then statement but i dont know how to in R).
This is for a personal study I wanted to conduct and improve my R skills.
Thank you!

Creating a Dummy Variable for Observations within a date range

I want to create a new dummy variable that prints 1 if my observation is within a certain set of date ranges, and a 0 if its not. My dataset is a list of political contributions over a 10 year range and I want to make a dummy variable to mark if the donation came during a certain range of dates. I have 10 date ranges I'm looking at.
Does anyone know if the right way to do this is to create a loop? I've been looking at this question, which seems similar, but I think mine would be a bit more complicated: Creating a weekend dummy variable
By way of example, what I have a variable listing dates that contributions were recorded and I want to create dummy to show whether this contribution came during a budget crisis. So, if there were a budget crisis from 2010-2-01 until 2010-03-25 and another from 2009-06-05 until 2009-07-30, the variable would ideally look like this:
Contribution Date.......Budget Crisis
2009-06-01...........................0
2009-06-06...........................1
2009-07-30...........................1
2009-07-31...........................0
2010-01-31...........................0
2010-03-05...........................1
2010-03-26...........................0
Thanks yet again for your help!
This looks like a good opportunity to use the %in% syntax of the match(...) function.
dat <- data.frame(ContributionDate = as.Date(c("2009-06-01", "2009-06-06", "2009-07-30", "2009-07-31", "2010-01-31", "2010-03-05", "2010-03-26")), CrisisYes = NA)
crisisDates <- c(seq(as.Date("2010-02-01"), as.Date("2010-03-25"), by = "1 day"),
seq(as.Date("2009-06-05"), as.Date("2009-07-30"), by = "1 day")
)
dat$CrisisYes <- as.numeric(dat$ContributionDate %in% crisisDates)
dat
ContributionDate CrisisYes
1 2009-06-01 0
2 2009-06-06 1
3 2009-07-30 1
4 2009-07-31 0
5 2010-01-31 0
6 2010-03-05 1
7 2010-03-26 0

Resources