I have a very large data set, structured as the sample below.
I have been trying to use the na.spline function in order to
1) identify the "fips" category with missing Yield.
2) if less than than 3 Yield values are NA per fips (here 1-3) the spline function should kick in and fill in the NA.
3) If 3 or more Yields are NA for a "fips" the code should remove the entire "fips" subset, in this case fips 2 should be removed.
My code so far:
finX <- dataset
finxx <- transform(subset(finX, ave(na.spline(finX$Yield), fips, FUN=sum)<2))
#or
finxx <- transform(subset(finX, ave(is.na(finX$Yield), fips, FUN=sum)<2))
Year fips Max Min Rain Yield
1980 1 24.7 0.0 71 37
1981 1 22.8 0.0 62 40
1982 1 22.6 0.0 47 37
1983 1 24.2 0.0 51 39
1984 1 23.8 0.0 61 47
1985 1 25.1 0.0 67 43
1980 2 24.8 0.0 72 34
1981 2 23.2 0.4 54 **NA**
1982 2 25.3 0.1 83 55
1983 2 23.0 0.0 68 **NA**
1984 2 22.4 0.7 70 **NA**
1985 2 24.6 0.0 47 31
1980 3 25.5 0.0 51 31
1981 3 25.5 0.0 51 31
1982 3 25.5 0.0 51 31
1983 3 25.5 0.0 51 **NA**
1984 3 25.5 0.0 51 31
...
Currently the codes above either do not fill in all the NA's in the final product, or simply have no result at all.
Any guidance would be very useful, thank you.
Yield needs to be converted from character to numeric or NA. Then use by to divide finX into separate data frames by fips value. For each data frame with less than 3 NA's, do the spline interpolation. Those with 3 or greater are returned as NULL. Combine the list of returned data frames into single data frame. Code would look like:
library(zoo)
# convert finX$Yield values from character to either numeric or NA
finX$Yield <- sapply(finX$Yield, function(x) if(x =="**NA**") NA_real_ else as.numeric(x))
# use spline interpolation on fips sets with less than 3 NA's
finxx <- by(finX, finX$fips, function(x) if(sum(is.na(x$Yield)) < 3) transform(x, Yield=na.spline(object=Yield, x=Year)) )
# combine results into a single data frame
finxx <- do.call(rbind, finxx)
Alternatively after the conversion to numeric values, you could use ave on the Yield column where spline interpolation returns values on fips sets with less than 3 NA's and all NA's on any other sets. All rows with any NA's in the final result would then be deleted. Code is as follows:
finxx2 <- transform(finX, Yield=ave(Yield, fips, FUN=function(x) if(sum(is.na(x)) < 3) na.spline(object=x) else NA))
finxx2 <- na.omit(finxx2)
Both versions give the same result for the sample data but the first version using by allows you to work with a full data frame for each fips set rather than with just Yield. In this case, this allowed Year to be specified for the x values in the spline interpolation so any data set with a missing Year would still give the correct interpolation. The ave version would get an incorrect answer. So the by version seems more robust.
There's also the dplyr version which is very much like the by version above and gives the same answer as the base R versions. If you're OK with working with dplyr, this is probably the most straightforward and robust approach.
library(dplyr)
finxx3 <- finX %>% group_by(fips) %>%
filter(sum(is.na(Yield)) < 3) %>%
mutate(Yield=na.spline(object=Yield, x=Year))
The first version returns
Year fips Max Min Rain Yield
1.1 1980 1 24.7 0 71 37
1.2 1981 1 22.8 0 62 40
1.3 1982 1 22.6 0 47 37
1.4 1983 1 24.2 0 51 39
1.5 1984 1 23.8 0 61 47
1.6 1985 1 25.1 0 67 43
3.13 1980 3 25.5 0 51 31
3.14 1981 3 25.5 0 51 31
3.15 1982 3 25.5 0 51 31
3.16 1983 3 25.5 0 51 31
3.17 1984 3 25.5 0 51 31
Related
I created in r-studio a null logistic model.
nullModel <- glm(train$bigFire ~ 1, data = train, family = binomial)
Then it is asked to the model to make predictions on the test-set.
nullModel.pred <- predict(nullModel, test, type = "response")
At this point i want to compute the confusion matrix in order to evaluate the performances of the model.
CM <- table(test$bigFire, nullModel.pred>0.5)
The resulting output is the following:
TRUE
0 58
1 46
Even if i change the cutoff value (now set to 0.5) the result is always the same. I don't understand why since the model should perform in a different way having different cutoff values.
The dataset is the following:
month day FFMC DMC DC ISI temp RH wind rain zone bigFire
1 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 75 0
2 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 74 0
3 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 74 0
4 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 86 0
5 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 86 0
6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 86 0
It counts 517 rows.
The test and train are generated from the previous datafram with a split of 80% for train and 20% for test (104 rows).
The length of the prediction vector is:
> length(nullModel.pred)
[1] 104
and contains always the same value -> 0.542.
This is reasonable since it is only able to estimate the expected value for the response to be 1.
I'm trying to plot two boxplots in the same figure in base R, one boxplot for the values of PM2.5 of which RAIN is 0 and hour is 12, the other for values of PM2.5 when RAIN is greater than 0 and hour is 12. This is a small part of my data set, called 'dat':
No year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd WSPM station
1 7345 2014 1 1 0 20 90 18 62 NA NA -1.5 1007.3 -12.5 0 SSE 0.6 Aotizhongxin
2 7346 2014 1 1 1 43 348 25 91 1100 1 -2.6 1006.9 -12.1 0 WSW 0.2 Aotizhongxin
3 7347 2014 1 1 2 79 423 41 103 1800 1 -3.0 1006.9 -11.3 0 WSW 0.6 Aotizhongxin
4 7348 2014 1 1 3 82 337 43 101 2100 1 -3.3 1006.4 -11.1 0 SW 0.6 Aotizhongxin
5 7349 2014 1 1 4 124 594 59 130 2400 1 -2.7 1006.1 -10.5 0 ENE 1.8 Aotizhongxin
6 7350 2014 1 1 5 89 307 47 102 2500 1 -3.1 1006.6 -10.4 0 N 1.0 Aotizhongxin
7 7351 2014 1 1 6 59 161 45 91 1900 1 -2.6 1007.2 -10.9 0 S 1.0 Aotizhongxin
8 7352 2014 1 1 7 31 93 24 69 900 4 -2.9 1007.9 -10.2 0 SE 1.1 Aotizhongxin
This is what I've tried so far but I just get one boxplot out from it:
not_rainy <- subset(dat, dat$hour == 12 & dat$RAIN == 0)
rainy <- subset(dat, dat$hour == 12 & dat$RAIN > 0)
vals <- c(rainy$PM2.5,not_rainy$PM2.5)
boxplot(vals)
What should I change with this?
There is no need to concatenate the two subsets:
not_rainy <- subset(dat, dat$hour == 12 & dat$RAIN == 0)
rainy <- subset(dat, dat$hour == 12 & dat$RAIN > 0)
# vals <- c(rainy$PM2.5,not_rainy$PM2.5)
boxplot(not_rainy, rainy)
ggplot is usually the answer:
library(ggplot2)
dat$rainy <- dat$RAIN > 0
ggplot(dat[dat$hour == 12,], aes(rainy, PM2.5)) + geom_boxplot()
If you want to do it with the base plotting system:
Maybe you could try a multi-panel plot by adjusting the mfrow parameter:
par(mfrow=c(2, 1))
Then you can callboxplot() with both datasets:
boxplot(rainy)
boxplot(not_rainy)
You can also do it with ggplot2:
-You can map a categorical logical variable/parameter to the
"x" argument of the aes() and use +geom_boxplot(), as suggested by #dash2.
-Another option: put your data in the long format by calling something
like:
data.frame(data=vals, rainy=c(rep(c(TRUE, FALSE), times=length(rainy)), rep(c(TRUE, FALSE), times=length(not_rainy))))
Then use use +facet_wrap(~rainy) in your call to ggplot
Another option is to use the cowplot package:
Create two ggplot objects and save them*:
plot1<-ggplot(rainy, aes(y=PM2.5))+geom_boxplot()
plot2<-ggplot(not_rainy, aes(y=PM2.5))+geom_boxplot()
Then use cowplot::plot_grid():
cowplot::plot_grid(plot1, plot2)
*This can also be used with base plots
Let's use the airquality dataset as a base.
myaqm <- melt(airquality, id=c("Month", "Day"), na.rm = TRUE)
This gives me a simplification of my real dataset. So far, I've done this:
myaqm_dcast <- dcast(myaqm, Day+variable~Month, value.var = "value", sum, margins=c("Day", "variable", "Month"))
Which gives me this:
> head(myaqm_dcast, n=10)
Day variable 5 6 7 8 9 (all)
1 1 Ozone 41.0 0.0 135.0 39.0 96.0 311.0
2 1 Solar.R 190.0 286.0 269.0 83.0 167.0 995.0
3 1 Wind 7.4 8.6 4.1 6.9 6.9 33.9
4 1 Temp 67.0 78.0 84.0 81.0 91.0 401.0
5 1 (all) 305.4 372.6 492.1 209.9 360.9 1740.9
6 2 Ozone 36.0 0.0 49.0 9.0 78.0 172.0
7 2 Solar.R 118.0 287.0 248.0 24.0 197.0 874.0
8 2 Wind 8.0 9.7 9.2 13.8 5.1 45.8
9 2 Temp 72.0 74.0 85.0 81.0 92.0 404.0
10 2 (all) 234.0 370.7 391.2 127.8 372.1 1495.8
However, I'm trying to create an additional variables'-percentage-of-days'-subtotal column for each current numeric column. So my goal is something like:
Day variable 5 5(day %) 6 6(day %) 7 7(day %) 8 8(day %) 9 9(day %) (all) (all)(day %)
1 Ozone 41 13.4% 0 0.0% 135 27.4% 39 18.6% 96 26.6% 311 17.9%
1 Solar.R 190 62.2% 286 76.8% 269 54.7% 83 39.5% 167 46.3% 995 57.2%
1 Wind 7.4 2.4% 8.6 2.3% 4.1 0.8% 6.9 3.3% 6.9 1.9% 33.9 1.9%
1 Temp 67 21.9% 78 20.9% 84 17.1% 81 38.6% 91 25.2% 401 23.0%
1 (all) 305.4 100.0% 372.6 100.0% 492.1 100.0% 209.9 100.0% 360.9 100.0% 1740.9 100.0%
2 Ozone 36 15.4% 0 0.0% 49 12.5% 9 7.0% 78 21.0% 172 11.5%
2 Solar.R 118 50.4% 287 77.4% 248 63.4% 24 18.8% 197 52.9% 874 58.4%
2 Wind 8 3.4% 9.7 2.6% 9.2 2.4% 13.8 10.8% 5.1 1.4% 45.8 3.1%
2 Temp 72 30.8% 74 20.0% 85 21.7% 81 63.4% 92 24.7% 404 27.0%
2 (all) 234 100.0% 370.7 100.0% 391.2 100.0% 127.8 100.0% 372.1 100.0% 1495.8 100.0%
Sorry for the terrible formatting! But as you can hopefully see, the new additional columns give a percentage of each variable for that day and that month.
I've found another Stack Overflow helper suggest using tidyr and dplyr but I just couldn't adapt their example to my needs. Would someone please show me what to do?
I wrote a percentage function and used that with dplyr. Then I join the columns together.
pct <- function(x) {x/sum(x)}
df <- myaqm_dcast %>%
filter(variable != "(all)") %>%
group_by(Day) %>%
mutate_each(funs(pct), 3:8) %>%
inner_join(myaqm_dcast, by = c("Day", "variable"))
Edit: You can modify the percentage function to print however you need (*100, paste the % symbol).
Edit 2: If you can live without the (all) rows, I've filtered it out. You can always calculate the column sums using the summarise_each() function.
Why would you reshape your data? Dataframe myaqm meets the requirements of tidy data (each column is a variable, each row is an observation). You can do your calculations in this format:
library(dplyr)
myaqm %>%
group_by(Day, variable) %>%
mutate(all = sum(value),
perc = paste0(round(100 * value/all, 2), "%")
Even for creating plots (eg. by ggplot) this format is better suited than the reshaped one.
If really necessary you can reshape with tidyr/dplyr:
...
gather(key, val, -c(Month:variable, all)) %>%
unite(temp, Month, key) %>%
spread(temp, val)
I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)
I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA