R generate bins from a data frame respecting blanks - r

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?

v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

Related

Swapping data frame values randomly between different deciles of the data frame

Its a bit complicated to explain, so I hope it is clear enough, but if not I'll try and expand more.
So I have a data-frame like this:
df <- data.frame(index=sort(runif(300, -10,10)), v1=runif(300, -2,-1), v2=runif(300, 1,2))
It gives us a 3-column 300-row df. The first column ("index") contains sorted values from -10 to 10, and the next two columns ("v1"/"v2") contain random numeric values that are not important for this issue.
Now I classify my df rows into deciles according to the index column, (e.g. decile 1: places 1-30, decile 2: places 31-60) and I want to swap randomly between the rows such that all the 1st decile values are swapped randomly with the 6th decile, all 2nd decile values are swapped randomly with the 7th decile, and so on. When I say swapped I mean that the index value remains in its place but the v1 and v2 values are swapped (still coupled) with the v1 and v2 of a random row in the appropriate decile.
For example, the v1 and v2 of the first row in the df (and thus from the 1st decile), will be swapped with the v1 and v2 of the 160th row in the df (6th decile), the v1 and v2 of the second row in the df (1st decile) will be swapped with the v1 and v2 of the 175th row in the df (also 6th decile), the v1 and v2 of the 31st row in the df (2nd decile) will be swapped with the v1 and v2 of the 186th row in the df (7th decile) and so on so all of the v1+v2 values have changed places randomly to their appropriate new decile.
Hope it's clear. I've been trying to solve it for hours and couldn't figure it out.
Thanks
Using order() to sort by two indices, one being the rearranged deciles, the other one random.
set.seed(123)
dtf <- data.frame(round(cbind(index=sort(runif(20, -10, 10)),
v1=runif(20, 0, 5),
v2=runif(20, 5, 10)), 2))
ea <- nrow(dtf)/10
# Deciles shifted by 5
d <- rep(((1:10 + 4) %% 10) + 1, each=ea)
# Random index within decile
r <- c(replicate(10, sample(ea)))
cbind(dtf, z=dtf[order(d, r), -1])
# index v1 v2 z.v1 z.v2
# 12 -9.16 4.45 5.71 4.51 7.21
# 11 -9.09 3.46 7.07 4.82 5.23
# 14 -7.94 3.20 7.07 3.98 5.61
# 13 -5.08 4.97 6.84 3.45 8.99
# 15 -4.25 3.28 5.76 0.12 7.80
# 16 -3.44 3.54 5.69 2.39 6.03
# 17 -1.82 2.72 6.17 3.79 5.64
# 18 -0.93 2.97 7.33 1.08 8.77
# 19 -0.87 1.45 6.33 1.59 9.48
# 20 0.56 0.74 9.29 1.16 6.87
# 2 1.03 4.82 5.23 3.46 7.07
# 1 1.45 4.51 7.21 4.45 5.71
# 3 3.55 3.45 8.99 3.20 7.07
# 4 5.77 3.98 5.61 4.97 6.84
# 6 7.66 0.12 7.80 3.54 5.69
# 5 7.85 2.39 6.03 3.28 5.76
# 8 8.00 3.79 5.64 2.97 7.33
# 7 8.81 1.08 8.77 2.72 6.17
# 10 9.09 1.59 9.48 0.74 9.29
# 9 9.14 1.16 6.87 1.45 6.33
I think that this is what you need.
swapByBlocks <- function(df, blockSize = 30, nblocks = 10){
if((nrow(df) != blockSize*nblocks) || nblocks %%2) stop("Undefined behaviour")
swappedDF <- df[c((nrow(df)/2 +1):nrow(df), 1:(nrow(df)/2)),]
ndxMat <- sapply(1:(nblocks/2),function(dummy) sample(1:blockSize))
for(i in 1:ncol(ndxMat)) {
swappedDF[(i-1)*blockSize + 1:blockSize, ] <- swappedDF[((i-1)*blockSize + 1:blockSize)[ndxMat[,i]], ]
swappedDF[(i+nblocks/2-1)*blockSize + 1:blockSize, ] <- swappedDF[((i+nblocks/2-1)*blockSize + 1:blockSize)[order(ndxMat[,i])], ]
}
return(swappedDF)
}
A small case where you can check how it works:
res <- swapByBlocks(df[1:18,], blockSize = 3, nblocks = 6)
> df[1:18,]
index v1 v2
1 -9.859624 -1.657779 1.954094
2 -9.774898 -1.015825 1.006341
3 -9.624402 -1.713754 1.527065
4 -9.441129 -1.891834 1.803793
5 -9.424195 -1.125674 1.581199
6 -8.890537 -1.142044 1.219111
7 -8.838012 -1.173445 1.013408
8 -8.296938 -1.780396 1.570550
9 -8.172076 -1.789056 1.178596
10 -7.671897 -1.988539 1.690468
11 -7.655868 -1.095662 1.876414
12 -7.450011 -1.337443 1.632104
13 -7.204528 -1.880350 1.408944
14 -7.085862 -1.232293 1.593247
15 -7.030691 -1.087031 1.924306
16 -6.989892 -1.639967 1.495058
17 -6.978945 -1.395340 1.872944
18 -6.930379 -1.841031 1.061046
> res
index v1 v2
10 -7.450011 -1.337443 1.632104
11 -7.655868 -1.095662 1.876414
12 -7.671897 -1.988539 1.690468
13 -7.030691 -1.087031 1.924306
14 -7.085862 -1.232293 1.593247
15 -7.204528 -1.880350 1.408944
16 -6.989892 -1.639967 1.495058
17 -6.930379 -1.841031 1.061046
18 -6.978945 -1.395340 1.872944
1 -9.624402 -1.713754 1.527065
2 -9.774898 -1.015825 1.006341
3 -9.859624 -1.657779 1.954094
4 -8.890537 -1.142044 1.219111
5 -9.424195 -1.125674 1.581199
6 -9.441129 -1.891834 1.803793
7 -8.838012 -1.173445 1.013408
8 -8.172076 -1.789056 1.178596
9 -8.296938 -1.780396 1.570550
>
Here there are 18 rows with six blocks of three numbers each. Rows 1 to 3 get swapped with rows 10 to 12, rows 4 to 6 with rows 13 to 15 and rows 4
7 to 9 with rows 16 to 17.

na.omit seems to be removing negative values in my data frame?

I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

Weird date-time format on x-axis plot

I'm currently plotting several datasets of one day in R. The format of the dates in the datasets is yyyymmmdddhh. When I plot this, the formatting fails gloriously: on the x-axis, I now have 2016060125, 2016060150, etc. and a very weirdly shaped plot. What do I have to do to create a plot with a more "normal" date notation (e.g. June 1, 12:00 or just 12:00)??
Edit: the dates of these datasets are integers
The dataset looks like this:
> event_1
date P ETpot Q T fXS GRM_SintJorisweg
1 2016060112 0.0 0.151 0.00652 19.6 0.00477 0.39250
2 2016060113 0.0 0.134 0.00673 20.8 0.00492 0.38175
3 2016060114 0.0 0.199 0.00709 22.6 0.00492 0.36375
4 2016060115 0.0 0.201 0.00765 21.2 0.00492 0.36850
5 2016060116 19.4 0.005 0.00786 19.5 0.00492 0.36900
6 2016060117 2.8 0.005 0.00824 18.1 0.00492 0.36625
7 2016060118 2.6 0.017 0.00984 18.0 0.00508 0.35975
8 2016060119 9.7 0.000 0.01333 16.7 0.00555 0.34750
9 2016060120 7.0 0.000 0.01564 16.8 0.00524 0.33550
10 2016060121 4.1 0.000 0.01859 17.1 0.00524 0.32000
11 2016060122 9.5 0.000 0.02239 17.2 0.00539 0.30250
12 2016060123 2.6 0.000 0.03330 17.5 0.00555 0.27050
13 2016060200 11.6 0.000 0.03997 17.4 0.00555 0.23800
14 2016060201 0.9 0.000 0.04928 17.3 0.00555 0.21725
15 2016060202 0.0 0.000 0.05822 17.2 0.00555 0.20350
16 2016060203 2.3 0.002 0.06547 16.4 0.00555 0.18575
17 2016060204 0.0 0.016 0.07047 16.5 0.00555 0.16950
18 2016060205 0.0 0.027 0.07506 16.7 0.00555 0.16475
19 2016060206 0.0 0.070 0.07762 18.0 0.00555 0.16525
20 2016060207 0.0 0.285 0.08006 19.5 0.00555 0.14500
21 2016060208 0.0 0.224 0.08109 20.3 0.00555 0.15875
22 2016060209 0.0 0.362 0.07850 21.3 0.00555 0.17825
23 2016060210 0.0 0.433 0.07441 22.0 0.00524 0.19175
24 2016060211 0.0 0.417 0.07380 23.9 0.00492 0.19050
I want to plot the date on the x-axis and the Q on the y-axis
Create a minimal verifiable example with your data:
date_int <- c(2016060112,2016060113,2016060114,2016060115,2016060116,2016060117,2016060118,2016060119,2016060120,2016060121,2016060122,2016060123,2016060200,2016060201,2016060202,2016060203,2016060204,2016060205,2016060206,2016060207,2016060208,2016060209,2016060210,2016060211)
Q <- c(0.00652,0.00673,0.00709,0.00765,0.00786,0.00824,0.00984,0.01333,0.01564,0.01859,0.02239,0.0333,0.03997,0.04928,0.05822,0.06547,0.07047,0.07506,0.07762,0.08006,0.08109,0.0785,0.07441,0.0738)
df <- data.frame( date_int, Q)
So, now we have a dataframe 'df'
With the dataframe 'df' you can convert your date_int column to a date format with hours and update the dataframe:
date_time <- strptime(df$date_int, format = '%Y%m%d%H', tz= "UTC")
df$date_int <- date_time
Finally,
plot(df)
You will see a nice plot! Like the following:
Ps.: Please note that you need to use abbreviations specified on "Date and Times in R" (e.g. "%Y%m%d%H" in this case)
Ref.: https://www.stat.berkeley.edu/~s133/dates.html
Here a lubridate answer:
library(lubridate)
event_1$date <- ymd_h(event_1$date)
or base R:
event_1$date <- as.POSIXct(event_1$date, format = "%Y%d%d%H")
What is happening is the dates are getting interpreted as numeric classes. As indicated, you need to convert. To get the formatting correct, you need to do a little more:
set.seed(123)
library(lubridate)
## date
x <- ymd_h(2016060112)
y <- ymd_h(2016060223)
dfx <- data.frame(
date = as.numeric(format(seq(x, y, 3600), "%Y%m%d%H")),
yvar = rnorm(36))
dfx$date_x <- ymd_h(dfx$date)
# plot 1
plot(dfx$date, dfx$yvar)
Now using date_x which is POSIXct:
#plot 2
# converted to POSIXct
class(dfx$date_x)
## [1] "POSIXct" "POSIXt"
plot(dfx$date_x, dfx$yvar)
You will need to fix your date axis to get the format you desire:
#plot 3
# using axis.POSIXct to help things
with(dfx, plot(date_x, yvar, xaxt="n"))
r <- round(range(dfx$date_x), "hours")
axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%b-%d %H:%M")

How to reshape a matrix

I have a matrix (d) that looks like:
d <-
as.matrix(read.table(text = "
month Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
X10 10 7.04 8.07 9.4 8.17 9.39 8.13 9.43 9.06 8.59 9.37 9.79 8.47 8.86
X11 11 12.10 11.50 12.6 13.70 11.90 11.50 13.10 17.20 19.00 14.60 13.70 13.20 16.10
X12 12 24.00 22.00 22.2 20.50 21.60 22.50 23.10 23.30 30.50 34.10 36.10 37.40 28.90
X1 1 18.30 16.30 16.2 14.80 16.60 15.40 15.20 14.80 16.70 14.90 15.00 13.80 15.90
X2 2 16.70 14.40 15.3 14.10 15.50 16.70 15.20 16.10 18.00 26.30 28.00 31.10 34.20",
header=TRUE))
going from Q1 to Q31 (its the days in each month). what I would like to get is:
month day Q
10 1 7.04
10 2 8.07
and so on for the 31 days and the 12 months.
I have tried using the following code:
reshape(d, direction="long", varying = list(colnames(d)[2:32]), v.names="Q", idvar="month", timevar="day")
but I get the error :
Error in d[, timevar] <- times[1L] : subscript out of bounds
Can anyone tell me what is wrong with the code? I don't really understand the help file on "reshape", it's a bit confusing... Thanks!
Almost there - you're just missing as.data.frame(d) to make your matrix into a data frame. Also you don't need the list in varying - just a vector, so
reshape(as.data.frame(d), varying=colnames(d)[2:32], v.names="Q",
direction="long", idvar="month", timevar="day")
The help file is confusing as heck, not least because (as I've learned) the necessary information almost always actually is in there --- somewhere.
As a prime example, midway through the help file, there is this bit:
The function will
attempt to guess the ‘v.names’ and ‘times’ from these names [i.e. the ones in the 'varying' argument]. The
default is variable names like ‘x.1’, ‘x.2’, where ‘sep = "."’
specifies to split at the dot and drop it from the name. To have
alphabetic followed by numeric times use ‘sep = ""’.
That last sentence is the one you need here: "Q1", "Q2", etc. are indeed "alphabetic followed by numeric", so you need to set sep = "" argument if reshape() is to know how to split apart those column names.
Try this:
res <- reshape(as.data.frame(d), idvar="month", timevar="day",
varying = -1, direction = "long", sep = "")
head(res[with(res, order(month,day)),])
# month day Q
# 1.1 1 1 18.3
# 1.2 1 2 16.3
# 1.3 1 3 16.2
# 1.4 1 4 14.8
# 1.5 1 5 16.6
# 1.6 1 6 15.4
The help file on reshape is not a bit confusing. It's a LOT confusing. Assuming your matrix has 12 rows(1 for each month) and 31 columns (I'm guessing you have NA values months with fewer than 31), you could easily construct this by hand.
d <- data.frame(month = rep(d[,1], 31), day = rep(1:31, each = 12), Q = as.vector(d[,2:32])
Now, back to your reshape... I'm guessing it's not parsing the names of your columns correctly. It might work better with Q.1, Q.2, etc. BTW, my reshaping above really depends on what you presented actually being a matrix and not a data.frame.

Resources