I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.
Related
I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.
I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5
First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
The trick here is to use tidy::spread() and tibble::enframe to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)
I'm currently plotting several datasets of one day in R. The format of the dates in the datasets is yyyymmmdddhh. When I plot this, the formatting fails gloriously: on the x-axis, I now have 2016060125, 2016060150, etc. and a very weirdly shaped plot. What do I have to do to create a plot with a more "normal" date notation (e.g. June 1, 12:00 or just 12:00)??
Edit: the dates of these datasets are integers
The dataset looks like this:
> event_1
date P ETpot Q T fXS GRM_SintJorisweg
1 2016060112 0.0 0.151 0.00652 19.6 0.00477 0.39250
2 2016060113 0.0 0.134 0.00673 20.8 0.00492 0.38175
3 2016060114 0.0 0.199 0.00709 22.6 0.00492 0.36375
4 2016060115 0.0 0.201 0.00765 21.2 0.00492 0.36850
5 2016060116 19.4 0.005 0.00786 19.5 0.00492 0.36900
6 2016060117 2.8 0.005 0.00824 18.1 0.00492 0.36625
7 2016060118 2.6 0.017 0.00984 18.0 0.00508 0.35975
8 2016060119 9.7 0.000 0.01333 16.7 0.00555 0.34750
9 2016060120 7.0 0.000 0.01564 16.8 0.00524 0.33550
10 2016060121 4.1 0.000 0.01859 17.1 0.00524 0.32000
11 2016060122 9.5 0.000 0.02239 17.2 0.00539 0.30250
12 2016060123 2.6 0.000 0.03330 17.5 0.00555 0.27050
13 2016060200 11.6 0.000 0.03997 17.4 0.00555 0.23800
14 2016060201 0.9 0.000 0.04928 17.3 0.00555 0.21725
15 2016060202 0.0 0.000 0.05822 17.2 0.00555 0.20350
16 2016060203 2.3 0.002 0.06547 16.4 0.00555 0.18575
17 2016060204 0.0 0.016 0.07047 16.5 0.00555 0.16950
18 2016060205 0.0 0.027 0.07506 16.7 0.00555 0.16475
19 2016060206 0.0 0.070 0.07762 18.0 0.00555 0.16525
20 2016060207 0.0 0.285 0.08006 19.5 0.00555 0.14500
21 2016060208 0.0 0.224 0.08109 20.3 0.00555 0.15875
22 2016060209 0.0 0.362 0.07850 21.3 0.00555 0.17825
23 2016060210 0.0 0.433 0.07441 22.0 0.00524 0.19175
24 2016060211 0.0 0.417 0.07380 23.9 0.00492 0.19050
I want to plot the date on the x-axis and the Q on the y-axis
Create a minimal verifiable example with your data:
date_int <- c(2016060112,2016060113,2016060114,2016060115,2016060116,2016060117,2016060118,2016060119,2016060120,2016060121,2016060122,2016060123,2016060200,2016060201,2016060202,2016060203,2016060204,2016060205,2016060206,2016060207,2016060208,2016060209,2016060210,2016060211)
Q <- c(0.00652,0.00673,0.00709,0.00765,0.00786,0.00824,0.00984,0.01333,0.01564,0.01859,0.02239,0.0333,0.03997,0.04928,0.05822,0.06547,0.07047,0.07506,0.07762,0.08006,0.08109,0.0785,0.07441,0.0738)
df <- data.frame( date_int, Q)
So, now we have a dataframe 'df'
With the dataframe 'df' you can convert your date_int column to a date format with hours and update the dataframe:
date_time <- strptime(df$date_int, format = '%Y%m%d%H', tz= "UTC")
df$date_int <- date_time
Finally,
plot(df)
You will see a nice plot! Like the following:
Ps.: Please note that you need to use abbreviations specified on "Date and Times in R" (e.g. "%Y%m%d%H" in this case)
Ref.: https://www.stat.berkeley.edu/~s133/dates.html
Here a lubridate answer:
library(lubridate)
event_1$date <- ymd_h(event_1$date)
or base R:
event_1$date <- as.POSIXct(event_1$date, format = "%Y%d%d%H")
What is happening is the dates are getting interpreted as numeric classes. As indicated, you need to convert. To get the formatting correct, you need to do a little more:
set.seed(123)
library(lubridate)
## date
x <- ymd_h(2016060112)
y <- ymd_h(2016060223)
dfx <- data.frame(
date = as.numeric(format(seq(x, y, 3600), "%Y%m%d%H")),
yvar = rnorm(36))
dfx$date_x <- ymd_h(dfx$date)
# plot 1
plot(dfx$date, dfx$yvar)
Now using date_x which is POSIXct:
#plot 2
# converted to POSIXct
class(dfx$date_x)
## [1] "POSIXct" "POSIXt"
plot(dfx$date_x, dfx$yvar)
You will need to fix your date axis to get the format you desire:
#plot 3
# using axis.POSIXct to help things
with(dfx, plot(date_x, yvar, xaxt="n"))
r <- round(range(dfx$date_x), "hours")
axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%b-%d %H:%M")
I have a matrix (d) that looks like:
d <-
as.matrix(read.table(text = "
month Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
X10 10 7.04 8.07 9.4 8.17 9.39 8.13 9.43 9.06 8.59 9.37 9.79 8.47 8.86
X11 11 12.10 11.50 12.6 13.70 11.90 11.50 13.10 17.20 19.00 14.60 13.70 13.20 16.10
X12 12 24.00 22.00 22.2 20.50 21.60 22.50 23.10 23.30 30.50 34.10 36.10 37.40 28.90
X1 1 18.30 16.30 16.2 14.80 16.60 15.40 15.20 14.80 16.70 14.90 15.00 13.80 15.90
X2 2 16.70 14.40 15.3 14.10 15.50 16.70 15.20 16.10 18.00 26.30 28.00 31.10 34.20",
header=TRUE))
going from Q1 to Q31 (its the days in each month). what I would like to get is:
month day Q
10 1 7.04
10 2 8.07
and so on for the 31 days and the 12 months.
I have tried using the following code:
reshape(d, direction="long", varying = list(colnames(d)[2:32]), v.names="Q", idvar="month", timevar="day")
but I get the error :
Error in d[, timevar] <- times[1L] : subscript out of bounds
Can anyone tell me what is wrong with the code? I don't really understand the help file on "reshape", it's a bit confusing... Thanks!
Almost there - you're just missing as.data.frame(d) to make your matrix into a data frame. Also you don't need the list in varying - just a vector, so
reshape(as.data.frame(d), varying=colnames(d)[2:32], v.names="Q",
direction="long", idvar="month", timevar="day")
The help file is confusing as heck, not least because (as I've learned) the necessary information almost always actually is in there --- somewhere.
As a prime example, midway through the help file, there is this bit:
The function will
attempt to guess the ‘v.names’ and ‘times’ from these names [i.e. the ones in the 'varying' argument]. The
default is variable names like ‘x.1’, ‘x.2’, where ‘sep = "."’
specifies to split at the dot and drop it from the name. To have
alphabetic followed by numeric times use ‘sep = ""’.
That last sentence is the one you need here: "Q1", "Q2", etc. are indeed "alphabetic followed by numeric", so you need to set sep = "" argument if reshape() is to know how to split apart those column names.
Try this:
res <- reshape(as.data.frame(d), idvar="month", timevar="day",
varying = -1, direction = "long", sep = "")
head(res[with(res, order(month,day)),])
# month day Q
# 1.1 1 1 18.3
# 1.2 1 2 16.3
# 1.3 1 3 16.2
# 1.4 1 4 14.8
# 1.5 1 5 16.6
# 1.6 1 6 15.4
The help file on reshape is not a bit confusing. It's a LOT confusing. Assuming your matrix has 12 rows(1 for each month) and 31 columns (I'm guessing you have NA values months with fewer than 31), you could easily construct this by hand.
d <- data.frame(month = rep(d[,1], 31), day = rep(1:31, each = 12), Q = as.vector(d[,2:32])
Now, back to your reshape... I'm guessing it's not parsing the names of your columns correctly. It might work better with Q.1, Q.2, etc. BTW, my reshaping above really depends on what you presented actually being a matrix and not a data.frame.