T.test in R program for multiple data sets - r

So I know that R has a built in function for t.test....which is basically just t.test(y1,y2) etc
But I'm having trouble accessing the data I wish to compare.
So I basically have two seperate data outputs
The first one is just called 'data' and it has an output similar to this
Time Kilometres
0 0
1 0.05
2 0.1
3 0.15
4 0.2
5 0.25
6 0.3
7 0.35
8 0.4
9 0.45
10 0.5
The other output is called 'hunt' and has this output
cuts: [20,25)
Time Kilometres
21 20 7.3
22 21 8.4
23 22 9.5
24 23 10.6
25 24 11.7
------------------------------------------------------------
cuts: [25,30)
Time Kilometres
26 25 12.8
27 26 13.9
28 27 15.0
29 28 16.1
30 29 17.2
------------------------------------------------------------
cuts: [30,35)
Time Kilometres
31 30 18.3
32 31 19.4
33 32 20.5
34 33 21.6
35 34 22.7
My question is, would it be possible to do seperate T.tests for each cut. Like get seperate p values for each cut whilst comparing each cut with my first data called 'data'.
so the p value for cuts:[0,5] =
cuts:[5:10] =
etc
Thanks again

So far it is not clear what you want to test within each group. The t.test can be used as either a two-group test or a one group test. You have only one group per cut. When used as a one-group test, there needs to be an assumed value for the mean against which the test gets run, but it's not clear what sort of reference value would be appropriate. I'm wondering if what you really want to do is a linear regression within each cut-category to test for a trend? This would implicitly be testing against a slope of zero. This is lightly tested code:
lapply( split( dat, cut.grp) ,
function(dgrp) summary(lm( Kilometres ~ Time, data=dgrp))$coefficients [ ,"Pr(>|t|)"] )

Related

Error in computing the confusino matrix of a logistic model

I created in r-studio a null logistic model.
nullModel <- glm(train$bigFire ~ 1, data = train, family = binomial)
Then it is asked to the model to make predictions on the test-set.
nullModel.pred <- predict(nullModel, test, type = "response")
At this point i want to compute the confusion matrix in order to evaluate the performances of the model.
CM <- table(test$bigFire, nullModel.pred>0.5)
The resulting output is the following:
TRUE
0 58
1 46
Even if i change the cutoff value (now set to 0.5) the result is always the same. I don't understand why since the model should perform in a different way having different cutoff values.
The dataset is the following:
month day FFMC DMC DC ISI temp RH wind rain zone bigFire
1 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 75 0
2 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 74 0
3 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 74 0
4 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 86 0
5 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 86 0
6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 86 0
It counts 517 rows.
The test and train are generated from the previous datafram with a split of 80% for train and 20% for test (104 rows).
The length of the prediction vector is:
> length(nullModel.pred)
[1] 104
and contains always the same value -> 0.542.
This is reasonable since it is only able to estimate the expected value for the response to be 1.

comparing recent averaged values to a current value in R

I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)

Spline interpolation R with conditions

I have a very large data set, structured as the sample below.
I have been trying to use the na.spline function in order to
1) identify the "fips" category with missing Yield.
2) if less than than 3 Yield values are NA per fips (here 1-3) the spline function should kick in and fill in the NA.
3) If 3 or more Yields are NA for a "fips" the code should remove the entire "fips" subset, in this case fips 2 should be removed.
My code so far:
finX <- dataset
finxx <- transform(subset(finX, ave(na.spline(finX$Yield), fips, FUN=sum)<2))
#or
finxx <- transform(subset(finX, ave(is.na(finX$Yield), fips, FUN=sum)<2))
Year fips Max Min Rain Yield
1980 1 24.7 0.0 71 37
1981 1 22.8 0.0 62 40
1982 1 22.6 0.0 47 37
1983 1 24.2 0.0 51 39
1984 1 23.8 0.0 61 47
1985 1 25.1 0.0 67 43
1980 2 24.8 0.0 72 34
1981 2 23.2 0.4 54 **NA**
1982 2 25.3 0.1 83 55
1983 2 23.0 0.0 68 **NA**
1984 2 22.4 0.7 70 **NA**
1985 2 24.6 0.0 47 31
1980 3 25.5 0.0 51 31
1981 3 25.5 0.0 51 31
1982 3 25.5 0.0 51 31
1983 3 25.5 0.0 51 **NA**
1984 3 25.5 0.0 51 31
...
Currently the codes above either do not fill in all the NA's in the final product, or simply have no result at all.
Any guidance would be very useful, thank you.
Yield needs to be converted from character to numeric or NA. Then use by to divide finX into separate data frames by fips value. For each data frame with less than 3 NA's, do the spline interpolation. Those with 3 or greater are returned as NULL. Combine the list of returned data frames into single data frame. Code would look like:
library(zoo)
# convert finX$Yield values from character to either numeric or NA
finX$Yield <- sapply(finX$Yield, function(x) if(x =="**NA**") NA_real_ else as.numeric(x))
# use spline interpolation on fips sets with less than 3 NA's
finxx <- by(finX, finX$fips, function(x) if(sum(is.na(x$Yield)) < 3) transform(x, Yield=na.spline(object=Yield, x=Year)) )
# combine results into a single data frame
finxx <- do.call(rbind, finxx)
Alternatively after the conversion to numeric values, you could use ave on the Yield column where spline interpolation returns values on fips sets with less than 3 NA's and all NA's on any other sets. All rows with any NA's in the final result would then be deleted. Code is as follows:
finxx2 <- transform(finX, Yield=ave(Yield, fips, FUN=function(x) if(sum(is.na(x)) < 3) na.spline(object=x) else NA))
finxx2 <- na.omit(finxx2)
Both versions give the same result for the sample data but the first version using by allows you to work with a full data frame for each fips set rather than with just Yield. In this case, this allowed Year to be specified for the x values in the spline interpolation so any data set with a missing Year would still give the correct interpolation. The ave version would get an incorrect answer. So the by version seems more robust.
There's also the dplyr version which is very much like the by version above and gives the same answer as the base R versions. If you're OK with working with dplyr, this is probably the most straightforward and robust approach.
library(dplyr)
finxx3 <- finX %>% group_by(fips) %>%
filter(sum(is.na(Yield)) < 3) %>%
mutate(Yield=na.spline(object=Yield, x=Year))
The first version returns
Year fips Max Min Rain Yield
1.1 1980 1 24.7 0 71 37
1.2 1981 1 22.8 0 62 40
1.3 1982 1 22.6 0 47 37
1.4 1983 1 24.2 0 51 39
1.5 1984 1 23.8 0 61 47
1.6 1985 1 25.1 0 67 43
3.13 1980 3 25.5 0 51 31
3.14 1981 3 25.5 0 51 31
3.15 1982 3 25.5 0 51 31
3.16 1983 3 25.5 0 51 31
3.17 1984 3 25.5 0 51 31

Manipulating Data in R

I have data a data frame in the following structure
transaction | customer | week | amount
12551 | ieeamo | 32 | €23.54
12553 | ieeamo | 33 | €17.00
I would like to get it in the following structure (for all weeks)
week | customer | activity last week | activity 2 weeks ago
32 | ieeamo | €0.00 | €0.00
33 | ieeamo | €23.54 | €0.00
34 | ieeamo | €17.00 | €23.54
35 | ieeamo | €0.00 | €17.00
Essentially, I am trying to convert transactional data to relative data.
My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.
Is there a more proper solution?
Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.
Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.
Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:
data <- read.table(text="
transaction customer week amount
12551 cOne 32 1.32
12552 cOne 34 1.34
12553 cTwo 34 2.34
12554 cTwo 35 2.35
12555 cOne 36 1.36
12556 cTwo 37 1.37
", header=TRUE)
Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:
customer_list <- unique(data$customer)
# cOne cTwo
week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37
Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.
new_data <- data.frame(
week=sort(rep(week_span,length(customer_list))),
customer=customer_list,
activity_last_week=NA,
activity_2_weeks_ago=NA)
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne NA NA
# 4 33 cTwo NA NA
# 5 34 cOne NA NA
# 6 34 cTwo NA NA
# 7 35 cOne NA NA
# 8 35 cTwo NA NA
# 9 36 cOne NA NA
# 10 36 cTwo NA NA
# 11 37 cOne NA NA
# 12 37 cTwo NA NA
You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.
Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.
# Loop through the customers.
for (i in 1:length(customer_list)){
# Select the next customer's data.
subset <- data[data$customer==customer_list[i],]
# Extract the data values for each week.
subset_amounts <- subset$amount[match(week_span, subset$week)]
# Replace NA with zero.
subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)
# Loop through the lag columns.
for (lag in 1:2){
# Write in the data values with the appropriate
# number of offsets according to the lag.
# Truncate the extra values.
new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
}
}
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne 1.32 NA
# 4 33 cTwo 0.00 NA
# 5 34 cOne 0.00 1.32
# 6 34 cTwo 0.00 0.00
# 7 35 cOne 1.34 0.00
# 8 35 cTwo 2.34 0.00
# 9 36 cOne 0.00 1.34
# 10 36 cTwo 2.35 2.34
# 11 37 cOne 1.36 0.00
# 12 37 cTwo 0.00 2.35
In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:
data <- data.frame(week=1:20, value=1:20+(1:20/100))
# week value
# 1 1 1.01
# 2 2 2.02
# 3 3 3.03
# 4 4 4.04
# 5 5 5.05
# 6 6 6.06
# 7 7 7.07
# 8 8 8.08
# 9 9 9.09
# 10 10 10.10
# 11 11 11.11
# 12 12 12.12
# 13 13 13.13
# 14 14 14.14
# 15 15 15.15
# 16 16 16.16
# 17 17 17.17
# 18 18 18.18
# 19 19 19.19
# 20 20 20.20
We could make a lagged data set in no time, like so:
new_data <- data.frame(week=data$week[3:20], embed(data$value,3))
names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")
# week this_week last_week 2_weeks_ago
# 1 3 3.03 2.02 1.01
# 2 4 4.04 3.03 2.02
# 3 5 5.05 4.04 3.03
# 4 6 6.06 5.05 4.04
# 5 7 7.07 6.06 5.05
# 6 8 8.08 7.07 6.06
# 7 9 9.09 8.08 7.07
# 8 10 10.10 9.09 8.08
# 9 11 11.11 10.10 9.09
# 10 12 12.12 11.11 10.10
# 11 13 13.13 12.12 11.11
# 12 14 14.14 13.13 12.12
# 13 15 15.15 14.14 13.13
# 14 16 16.16 15.15 14.14
# 15 17 17.17 16.16 15.15
# 16 18 18.18 17.17 16.16
# 17 19 19.19 18.18 17.17
# 18 20 20.20 19.19 18.18

Customizing x-axis of graph

I am using scale_x_discrete() to customize ticks and labels of x-axis.
However, as figure shows, the lines cut the right-side y-axis, which doesn't look good to me. Could you please help me to fix this. The data (temp) is also shown below.
> a = ggplot(data = temp, aes(b, c, group=a,shape=a,colour=a), ordered=TRUE) + geom_line() + geom_point()
> a
> b = a + scale_x_discrete(breaks = c("2","4","8","16","32","64","128"), labels=c("2","4","8","16","32","64","128"))
> temp
a b c
1 One 2 5.1
2 One 4 6.6
3 One 8 7.7
4 One 16 8.4
5 One 32 16.1
6 One 64 38.0
7 One 128 49.2
8 Two 2 5.9
9 Two 4 7.7
10 Two 8 9.2
11 Two 16 10.3
12 Two 32 16.8
13 Two 64 32.4
14 Two 128 45.7
15 Three 2 4.7
16 Three 4 7.0
17 Three 8 8.5
18 Three 16 9.6
19 Three 32 14.8
20 Three 64 31.0
21 Three 128 34.5
22 Four 2 4.3
23 Four 4 6.9
24 Four 8 8.3
25 Four 16 9.1
26 Four 32 14.0
27 Four 64 23.8
Why are you using a discrete scale for something at appears to be continuous.
If you replace scale_x_discrete with scale_x_continuous then this should work as you wish.
b <- a + scale_x_continuous(breaks = 2^(1:7))
b
You might be interested in a transformation to base 2, given the way your data for b appear only to be integer powers of 2.
a + scale_x_continuous(breaks = 2^(1:7), trans = 'log2')
There is also the "expand" argument from the ggplot website. Adjust the numbers to whatever look you are trying to achieve
a + scale_x_discrete(breaks = c("2","4","8","16","32","64","128"),
labels=c("2","4","8","16","32","64","128"),
expand = c(.1,.1))

Resources