How do I find combined proportions in an R table? - r

Excuse the horrible title. I don't think I'd be able to summarise this problem in such few words.
So I have a table in R with data whose proportions by row have been calculated using the prop.table function ( prop.table(tab, 1) ). It looks like this:
The row headings (i.e. Q1-00-05, etc.) denote times of the day. The column headings TRUE and FALSE denote whether a particular 999 call was responded to within 10 minutes.
What I need from this table is the proportion of 999 calls responded to efficiently (< 10 mins) between 1800hrs and 0500 hrs.
I tried doing this:
tab2<-table(callouts$daytime=="Q4-18-23"|"Q1-00-05", callouts$tenmins)
but this proved fruitless. I got an error message saying:
operations are possible only for numeric, logical or complex types
I expected the table to come out with TRUE or FALSE as the row headings (for whether the callout time was within this time frame or not) and TRUE or FALSE as the column headings (for whether the response time was sub-10mins)
Any help would be much appreciated. Thanks!

Related

R mean of one column based on another [duplicate]

I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

Convert data frame to matrix without looping

The Question:
I have a data frame with a column that shows whether an event occurred, and columns for month, day, and year. These last 3 were converted to a date vector. I want to make a matrix that shows whether or not an event occurred within a given time period. In this matrix, a row represents a site and a column a date. I was able to write a for loop to do it, but it seemed like there might be a better way to do this, either with apply or some other basic operation. How would you do this?
The Code:
#Initialize events matrix
events = matrix(FALSE,nrow(predicted),ncol(predicted))
# Mark the presence of events
for (i in 1:nrow(events)){
if ((days_from_start[i]>-1)&(days_from_start[i]<=ncol(predicted)))
events[i,days_from_start[i]] = !input_data$Event[i]
}
The Background:
The next step is to compare the events matrix against various model outputs with the same shape. There are relatively few events in the data frame compared to the matrix size; the (probably incorrect) assumption is that the data frame completely lists all events and that unlisted matrix cells did not experience an event. I’m very new to R, so I’d be interested in hearing about other approaches to the same problem, if you think I’m going about it the hard way.
The Data:
> input_data$Event[1:5]
[1] FALSE FALSE FALSE FALSE TRUE
> input_data$Year[1:5]
[1] 2010 2010 2011 2010 2010
> days_from_start[1:5]
Time differences in days
[1] 834 1018 1106 847 1055
> dim(predicted)
[1] 649 732
Since events[i,days_from_start[i]] is accessing more or less random locations in the events matrix (since presumably you have no pattern to days_from_start ) , it may be difficult not to use a loop. Possibly something like the following will work. I haven't tested this since you posted no datasets.
foo<- (days_from_start>-1)&(days_from_start<=ncol(predicted) )
index_matrix<-cbind((1:i)[foo],days_from_start[(1:i)[foo]])
events[index_matrix]<-!input_data$Event[index_matrix[,1]]
What the first line does is create a vector of logicals, TRUE where you want to do something
The next line creates a set of index pairs where you're going to insert data into events matrix. The last line does the insertion.

Conditional mean statement

I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))

Resources