Replace values in a table in R - r

I have this dataset
Longitude Latitude Radius Site_Type
<dbl> <dbl> <dbl> <chr>
1 -102. 1.5 5 OBS
2 -80.0 27.1 5 OBS
3 -158. 21.5 1 FEE;OBS
4 -81.6 3.98 1 FEE;OBS;NA
5 -87.0 5.50 1 OBS
6 -90.7 -0.55 1 FEE;OBS
7 -110. 24.7 1 FEE;OBS;NA
8 -89.5 28.4 1 OBS
9 -91.8 1.38 1 FEE;OBS
I want to replace NA by OBS I tried using replace() but nothing changed...

NA is character here so str_replace replace might work for you?
library(tidyverse)
df1 %>%
mutate(Site_Type = str_replace(Site_Type, "NA", "OBS"))
# Longitude Latitude Radius Site_Type
# 1 -102.0 1.50 5 OBS
# 2 -80.0 27.10 5 OBS
# 3 -158.0 21.50 1 FEE;OBS
# 4 -81.6 3.98 1 FEE;OBS;OBS
# 5 -87.0 5.50 1 OBS
# 6 -90.7 -0.55 1 FEE;OBS
# 7 -110.0 24.70 1 FEE;OBS;OBS
# 8 -89.5 28.40 1 OBS
# 9 -91.8 1.38 1 FEE;OBS

We can use sub in base R
df1$Site_Type <- sub("NA", "OBS", df1$Site_Type)

Related

how to copy part of rows based on group by 'id' in R?

I have a data frame such as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54
I have to copy the values of 'pp' column to 'nh' based on 'id' in other rows with that 'id'. my target data frame is as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA 1 0 0 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
I apprecite it if anybody share his/her comment with me.
Best Regards
Another option using na.locf:
df <- read.table(text="id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54", header=TRUE)
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <int> <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 0 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)
library(tidyverse)
df <- read_table("id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54")
df %>%
group_by(id) %>%
fill(PP:nh, .direction = 'updown')
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 NA 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)

Loop to sum weekly rolling average

I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows

Sort a dataframe according to characters in R [duplicate]

This question already has answers here:
R Sort strings according to substring
(2 answers)
Closed 2 years ago.
I got the dataframe (code) and I I want to sort it according to combName in a numerical order.
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14.0
# ... with 1,098 more rows
Afterwards the left column should look like:
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-2 ...
3 20-3 ...
4 20-4 ...
5 20-5 ...
...
10 20-10 14.7
# ... with 1,098 more rows
It do not know what I can do to reach this format.
Does this work:
library(dplyr)
library(tidyr)
df
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-102 25.4
2 20-100 21.2
3 20-101 17.6
4 20-105 24.3
5 20-10 14.7
6 20-103 46.3
7 20-104 68.7
8 20-1 8.05
9 20-106 46.3
10 20-107 14
df %>% separate(combName, into = c('1','2'), sep = '-', remove = F) %>%
type.convert(as.is = T) %>% arrange(`1`,`2`) %>% select(-c(`1`,`2`))
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

Using "first" in mutate

My dataframe looks something like the first four columns of the following:
ID Obs Seconds Mean Ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 1815522 1 1 NA 1/10.6
2 1815522 2 26 NA 26/10.6
3 1815522 3 4.68 10.6 4.68/10.6
4 1815522 4 0 10.2 0/10.6
5 1815522 5 1.5 2.06 1.5/10.6
6 1815522 6 2.22 1.24 2.22/10.6
7 1815676 1 12 NA 12/9.67
8 1815676 2 6 NA 6/9.67
9 1815676 3 11 9.67 11/9.67
10 1815676 4 1 6 1/9.67
11 1815676 5 30 14 30/9.67
12 1815676 6 29 20 29/9.67
13 1815676 7 23 27.3 23/9.67
14 1815676 8 51 34.3 51/9.67
I am trying to add a fifth column "Ratio", containing the ratio of each row's value for Seconds, and the ID-group's first not-NA value of Mean. How do I do that?
I've tried several things:
temp %>%
group_by(ID) %>%
mutate(Ratio = case_when(all(is.na(Mean)) ~ NA_real_,
!all(is.na(Mean)) ~ Seconds/(first(Mean[!is.na(Mean)]))))
This gives me the following error:
Error in mutate_impl(.data, dots) :
Column `Ratio` must be length 2 (the group size) or one, not 0
I also tried
temp %>%
group_by(ID) %>%
mutate(Ratio = ifelse(!all(is.na(Mean)), Seconds/(first(Mean[!is.na(Mean)])), NA_real_))
But in this case, it will create a column that looks like this:
Ratio
<dbl>
1 0.0947
2 0.0947
3 0.0947
4 0.0947
5 0.0947
6 0.0947
7 1.24
8 1.24
9 1.24
10 1.24
11 1.24
12 1.24
13 1.24
14 1.24
I really don't know what else to try. Please help! :)
An idea is to use fill with .direction = 'up' since you are interested in the first value, to fill your NAs and simply divide with the first value. No need for case_when to capture all NAs since it will by default give NA as an answer, i.e.
library(tidyverse)
df %>%
group_by(ID) %>%
fill(Mean, .direction = 'up') %>%
mutate(ratio = Seconds / first(Mean))
which gives,
# A tibble: 14 x 5
# Groups: ID [2]
ID Obs Seconds Mean ratio
<int> <int> <dbl> <dbl> <dbl>
1 1815522 1 1 10.6 0.0943
2 1815522 2 26 10.6 2.45
3 1815522 3 4.68 10.6 0.442
4 1815522 4 0 10.2 0
5 1815522 5 1.5 2.06 0.142
6 1815522 6 2.22 1.24 0.209
7 1815676 1 12 9.67 1.24
8 1815676 2 6 9.67 0.620
9 1815676 3 11 9.67 1.14
10 1815676 4 1 6 0.103
11 1815676 5 30 14 3.10
12 1815676 6 29 20 3.00
13 1815676 7 23 27.3 2.38
14 1815676 8 51 34.3 5.27
Try this:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(
isNA = mean(is.na(Mean)),
Ratio = if_else(isNA == 1, NA_real_, Seconds / first(Mean[!is.na(Mean)]))
)

Resources