Data
| x | Y |
| --------| --------|
| 26.88 | 3.16 |
| 28.57 | 4.21 |
| 30.94 | 2.97 |
| 33.90 | 3.06 |
| 37.24 | 2.87 |
| 39.76 | 2.95 |
| 41.89 | 2.70 |
| 44.37 | 1.25 |
| 27.20 | 5.04 |
| 26.54 | 6.69 |
| 29.21 | 4.42 |
| 33.26 | 3.15 |
| 34.80 | 3.20 |
| 37.87 | 3.11 |
| 41.88 | 2.95 |
| 44.13 | 2.26 |
| 26.42 | 7.07 |
| 24.02 | 8.72 |
| 29.73 | 6.38 |
| 31.10 | 3.85 |
| 33.16 | 3.00 |
| 36.76 | 3.28 |
| 43.26 | 3.18 |
| 42.06 | 2.73 |
| 26.73 | 9.44 |
| 23.03 | 9.72 |
| 27.07 | 6.98 |
| 29.04 | 4.67 |
| 31.83 | 3.55 |
| 36.29 | 3.89 |
| 39.45 | 3.55 |
| 42.17 | 3.37 |
| 23.51 | 10.44 |
| 21.98 | 10.90 |
| 27.21 | 8.13 |
| 28.63 | 5.76 |
| 30.92 | 3.96 |
| 35.57 | 3.94 |
| 38.33 | 3.88 |
| 40.91 | 3.58 |
| 25.15 | 13.05 |
| 19.44 | 15.91 |
| 25.94 | 10.37 |
| 28.03 | 5.17 |
| 31.25 | 4.04 |
| 35.31 | 4.24 |
| 37.02 | 4.31 |
| 38.89 | 3.99 |
| 25.12 | 15.66 |
| 18.36 | 19.86 |
| 25.05 | 12.82 |
| 27.58 | 6.07 |
| 28.83 | 4.11 |
| 33.76 | 4.17 |
| 34.48 | 4.30 |
| 37.32 | 3.97 |
| 21.27 | 20.49 |
| 16.61 | 25.53 |
| 22.68 | 16.58 |
| 25.63 | 6.34 |
| 28.15 | 4.40 |
| 32.80 | 3.99 |
| 35.27 | 4.59 |
| 36.75 | 4.35 |
Code
library(data.table)
library(readxl)
library(dplyr)
library(ggplot2)
library(patchwork)
library(ggplot2)
library(ggpubr)
library(ggpmisc)
setwd("E:/")
Data_2 <- read_excel("Data_2.xlsx")
model.0 <- lm(log(Strength) ~ Theoritical, data= Data_2)
alpha.0 <- exp(coef(model.0)[1])
beta.0 <- coef(model.0)[2]
# Starting parameters
start <- list(alpha = alpha.0, beta = beta.0)
start
model <- nls(Strength ~ alpha * exp((1/beta) * Theoritical) , data = Data_2, start = start)
summary(model)
# Plot fitted curve
plot(Data_2$Theoritical, Data_2$Strength)
line(Data_2$Theoritical, predict(model, list(x = Data_2$Theoritical)), col = 'skyblue')
When draw my plot I got following image.
I need this kind of equation for my data
y=a*e^(-x/b)
I could not get the R^2 value as well as shown in this picture
Please correct my code as well. kindly help me to provide a good code for this best fit graph for that equation. I am new to R programming.
Related
I have r dataframe in following format
+--------+---------------+--------------------+--------+
| time | Stress_ratio | shear_displacement | CX |
+--------+---------------+--------------------+--------+
| <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 |
| 50.2 | -0.219 | 4.98 | 0.0100 |
| . | . | . | . |
| . | . | . | . |
| 249.3 | -0.217 | 4.97 | 0.0200 |
| 250.4 | -0.214 | 4.96 | 0.0300 |
| 251.1 | -0.222 | 4.91 | 0.06 |
| 252.1 | -0.222 | 4.91 | 0.06 |
| 253.3 | -0.222 | 4.91 | 0.06 |
| 254.5 | -0.222 | 4.91 | 0.06 |
| 256.8 | -0.222 | 4.91 | 0.06 |
| . | . | . | . |
| . | . | . | . |
| 500.1 | -0.22 | 4.91 | 0.6 |
| 501.4 | -0.22 | 4.91 | 0.6 |
| 503.1 | -0.22 | 4.91 | 0.6 |
+--------+---------------+--------------------+--------+
and I want a new column which has repetitive values based on the difference between a range of values in column time. The range should be 250 for the column time. For example in all the rows of new_column I should get number 1 when df$time[1] and df$time[1]*4.98 is 250. Similarly this number 1 should change to 2 when the next chunk starts of difference of 250. So the new dataframe should be like
+--------+---------------+--------------------+--------+------------+
| time | Stress_ratio | shear_displacement | CX | new_column |
+--------+---------------+--------------------+--------+------------+
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 | 1 |
| 50.2 | -0.219 | 4.98 | 0.0100 | 1 |
| . | . | . | . | 1 |
| . | . | . | . | 1 |
| 249.3 | -0.217 | 4.97 | 0.0200 | 1 |
| 250.4 | -0.214 | 4.96 | 0.0300 | 2 |
| 251.1 | -0.222 | 4.91 | 0.06 | 2 |
| 252.1 | -0.222 | 4.91 | 0.06 | 2 |
| 253.3 | -0.222 | 4.91 | 0.06 | 2 |
| 254.5 | -0.222 | 4.91 | 0.06 | 2 |
| 256.8 | -0.222 | 4.91 | 0.06 | 2 |
| . | . | . | . | . |
| . | . | . | . | . |
| 499.1 | -0.22 | 4.91 | 0.6 | 2 |
| 501.4 | -0.22 | 4.91 | 0.6 | 3 |
| 503.1 | -0.22 | 4.91 | 0.6 | 3 |
+--------+---------------+--------------------+--------+------------+
If I understand what you're trying to do, a base R solution could be:
df$new_column <- df$time %/% 250 + 1
The %/% operator is integer division (sort of the complement of the modulus operator) and tells you how many copies of 250 would fit into your number; we add 1 to get the value you want.
The tidyverse version:
df <- df %>%
mutate(new_column = time %/% 250 + 1)
library(data.table)
setDT(df)[, new_column := rleid(time %/% 250)][]
Using this data set with a multiple dichotomy set and a group:
set.seed(14)
checkall <- data.frame(ID=1:200,
group=sample(c("A", "B", "C"), size=200, replace=TRUE),
q1a=sample(c(0,1), size=200, replace=TRUE),
q1b=sample(c(0,1), size=200, replace=TRUE),
q1c=sample(c(0,1), size=200, replace=TRUE),
q1d=sample(c(0,1), size=200, replace=TRUE),
q1e=sample(c(0,1), size=200, replace=TRUE),
q1f=sample(c(0,1), size=200, replace=TRUE),
q1g=sample(c(0,1), size=200, replace=TRUE),
q1h=sample(c(0,1), size=200, replace=TRUE))
#Doctor some to be related to group
checkall$q1c[checkall$group=="A"] <- sample(c(0,1,1,1), size=sum(checkall$group=="A"), replace=TRUE)
checkall$q1e[checkall$group=="A"] <- sample(c(0,0,0,1), size=sum(checkall$group=="A"), replace=TRUE)
I would like to make a table that shows frequencies and column percents like this:
library(dplyr)
if( !require(expss) ){ install.packages("expss", dependencies=TRUE); library(expss) }
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_pivot(stat_position = "inside_columns")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 | 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 | 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
But I would like to add the notations that compare the cpct values to that in the first column. I can get that on a table with just cpct values like this:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | group | | |
| | col % | A | B | C |
| | | col % | col % | col % |
| ------------ | ------ | ------ | ----- | ----- |
| q1a | 50.8 | 47.8 | 51.4 | 53.3 |
| q1b | 46.2 | 49.3 | 41.4 | 48.3 |
| q1c | 55.8 | 76.8 + | 42.9 | 46.7 |
| q1d | 44.7 | 50.7 | 42.9 | 40.0 |
| q1e | 50.3 | 27.5 - | 61.4 | 63.3 |
| q1f | 44.7 | 49.3 | 51.4 | 31.7 |
| q1g | 48.7 | 42.0 | 47.1 | 58.3 |
| q1h | 56.8 | 58.0 | 51.4 | 61.7 |
| #Total cases | 199 | 69 | 70 | 60 |
Is there a way to get the + and - notations onto the first graph in just the cpct columns? If I try to mix the lines with tab_stat_cases(label="freq") and significance_cpct(compare_type = "first_column"), I get a weird table that tries to compare both the freq and cpct columns to the first column:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
#tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| q1a | 101.0 | 50.8 - | 33.0 - | 47.8 - | 36.0 - | 51.4 - | 32.0 - | 53.3 - |
| q1b | 92.0 | 46.2 - | 34.0 - | 49.3 - | 29.0 - | 41.4 - | 29.0 - | 48.3 - |
| q1c | 111.0 | 55.8 - | 53.0 - | 76.8 | 30.0 - | 42.9 - | 28.0 - | 46.7 - |
| q1d | 89.0 | 44.7 - | 35.0 - | 50.7 - | 30.0 - | 42.9 - | 24.0 - | 40.0 - |
| q1e | 100.0 | 50.3 - | 19.0 - | 27.5 - | 43.0 - | 61.4 - | 38.0 - | 63.3 - |
| q1f | 89.0 | 44.7 - | 34.0 - | 49.3 - | 36.0 - | 51.4 - | 19.0 - | 31.7 - |
| q1g | 97.0 | 48.7 - | 29.0 - | 42.0 - | 33.0 - | 47.1 - | 35.0 - | 58.3 - |
| q1h | 113.0 | 56.8 - | 40.0 - | 58.0 - | 36.0 - | 51.4 - | 37.0 - | 61.7 |
| #Total cases | 199 | 199 | 69 | 69 | 70 | 70 | 60 | 60 |
I'm looking for the top table with the + and - notation as below:
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 +| 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 -| 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
There is a special function for such case - tab_last_sig_cpct - which will be applied only to the last calculation:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_last_sig_cpct(compare_type = "first_column") %>%
tab_pivot(stat_position = "inside_columns")
I need to run the Mann-Kendall test (package trend in R, https://cran.r-project.org/web/packages/trend/index.html) on varying length time series data. Currently the time series analysis is run with the start year that I manually specify, but that may not be the actual start date. A lot of my sites contain differing start years and some may have different ending years. I condensed my data into the following. This is water quality data, so has issues with missing data and varying start/end dates.
I also deal with NAs in the middle of the time series and at the beginning. I would like to smooth out the missing NAs when in the middle of a time series. If the NAs are at the beginning, I would like to start the time series with the first actual value.
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
| SITE_ID | PROGRAM_ID | YEAR | ANC_UEQ_L | NO3_UEQ_L | SO4_UEQ_L | SBC_ALL_UEQ_L | SBC_NA_UEQ_L |
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
| 1234 | Alpha | 1992 | 36.12 | 0.8786 | 91.90628571 | 185.5595714 | 156.2281429 |
| 1234 | Alpha | 1993 | 22.30416667 | 2.671258333 | 86.85733333 | 180.5109167 | 154.1934167 |
| 1234 | Alpha | 1994 | 25.25166667 | 3.296475 | 92.00533333 | 184.3589167 | 157.3889167 |
| 1234 | Alpha | 1995 | 23.39166667 | 1.753436364 | 97.58981818 | 184.5251818 | 160.2047273 |
| 5678 | Beta | 1983 | 4.133333333 | 20 | 134.4333333 | 182.1 | 157.4 |
| 5678 | Beta | 1984 | 2.6 | 21.85 | 137.78 | 170.67 | 150.64 |
| 5678 | Beta | 1985 | 3.58 | 20.85555556 | 133.7444444 | 168.82 | 150.09 |
| 5678 | Beta | 1986 | -5.428571429 | 40.27142857 | 124.9 | 152.4 | 136.2142857 |
| 5678 | Beta | 1987 | NA | 13.75 | 122.75 | 137.4 | 126.3 |
| 5678 | Beta | 1988 | 4.666666667 | 26.13333333 | 123.7666667 | 174.9166667 | 155.4166667 |
| 5678 | Beta | 1989 | 6.58 | 31.91 | 124.63 | 167.39 | 148.68 |
| 5678 | Beta | 1990 | 2.354545455 | 39.49090909 | 121.6363636 | 161.6454545 | 144.5545455 |
| 5678 | Beta | 1991 | 5.973846154 | 30.54307692 | 119.8138462 | 165.4661185 | 147.0807338 |
| 5678 | Beta | 1992 | 4.174359 | 16.99051285 | 124.1753846 | 148.5505115 | 131.8894862 |
| 5678 | Beta | 1993 | 6.05 | 19.76125 | 117.3525 | 148.3025 | 131.3275 |
| 5678 | Beta | 1994 | -2.51666 | 17.47167 | 117.93266 | 129.64167 | 114.64501 |
| 5678 | Beta | 1995 | 8.00936875 | 22.66188125 | 112.3575 | 166.1220813 | 148.7095813 |
| 9101 | Victor | 1980 | NA | NA | 94.075 | NA | NA |
| 9101 | Victor | 1981 | NA | NA | 124.7 | NA | NA |
| 9101 | Victor | 1982 | 33.26666667 | NA | 73.53333333 | 142.75 | 117.15 |
| 9101 | Victor | 1983 | 26.02 | NA | 94.9 | 147.96 | 120.44 |
| 9101 | Victor | 1984 | 20.96 | NA | 82.98 | 137.4 | 110.46 |
| 9101 | Victor | 1985 | 29.325 | 0.157843137 | 84.975 | 144.45 | 118.45 |
| 9101 | Victor | 1986 | 28.6 | 0.88504902 | 81.675 | 139.7 | 114.45 |
| 9101 | Victor | 1987 | 25.925 | 1.065441176 | 74.15 | 131.875 | 108.7 |
| 9101 | Victor | 1988 | 29.4 | 1.048529412 | 80.625 | 148.15 | 122.5 |
| 9101 | Victor | 1989 | 27.7 | 0.907598039 | 81.025 | 143.1 | 119.275 |
| 9101 | Victor | 1990 | 27.4 | 0.642647059 | 77.65 | 126.825 | 104.775 |
| 9101 | Victor | 1991 | 24.95 | 1.228921569 | 74.1 | 138.55 | 115.7 |
| 9101 | Victor | 1992 | 29.425 | 0.591911765 | 73.85 | 130.675 | 106.65 |
| 9101 | Victor | 1993 | 22.53333333 | 0.308169935 | 64.93333333 | 117.3666667 | 96.2 |
| 9101 | Victor | 1994 | 29.93333333 | 0.428431373 | 67.23333333 | 124.0666667 | 101.2333333 |
| 9101 | Victor | 1995 | 39.33333333 | 0.57875817 | 65.36666667 | 128.8333333 | 105.0666667 |
| 1121 | Charlie | 1987 | 12.39 | 0.65 | 99.48 | 136.37 | 107.75 |
| 1121 | Charlie | 1988 | 10.87333333 | 0.69 | 104.6133333 | 131.9 | 105.2 |
| 1121 | Charlie | 1989 | 5.57 | 1.09 | 105.46 | 136.125 | 109.5225 |
| 1121 | Charlie | 1990 | 13.4725 | 0.8975 | 99.905 | 134.45 | 108.9875 |
| 1121 | Charlie | 1991 | 11.3 | 0.805 | 100.605 | 134.3775 | 108.9725 |
| 1121 | Charlie | 1992 | 9.0025 | 7.145 | 99.915 | 136.8625 | 111.945 |
| 1121 | Charlie | 1993 | 7.7925 | 6.6 | 95.865 | 133.0975 | 107.4625 |
| 1121 | Charlie | 1994 | 7.59 | 3.7625 | 97.3575 | 129.635 | 104.465 |
| 1121 | Charlie | 1995 | 7.7925 | 1.21 | 100.93 | 133.9875 | 109.5025 |
| 3812 | Charlie | 1988 | 18.84390244 | 17.21142857 | 228.8684211 | 282.6540541 | 260.5648649 |
| 3812 | Charlie | 1989 | 11.7248 | 21.21363636 | 216.5973451 | 261.3711712 | 237.4929204 |
| 3812 | Charlie | 1990 | 2.368571429 | 35.23448276 | 216.7827586 | 286.0034483 | 264.3137931 |
| 3812 | Charlie | 1991 | 33.695 | 40.733 | 231.92 | 350.91075 | 328.443 |
| 3812 | Charlie | 1992 | 18.49111111 | 26.14818889 | 219.1488 | 301.3785889 | 281.8809222 |
| 3812 | Charlie | 1993 | 17.28181818 | 27.65394545 | 210.6605091 | 290.064 | 271.9205455 |
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
Here is the code currently that will run time series for my actual data if I change the start year to miss the NAs in the earlier data. It works great for sites that have values for that entire time, but gives me odd results when different start/end years are taken into account.
Mann_Kendall_Values_Trimmed <- filter(LTM_Data_StackOverflow_9_22_2020, YEAR >1984) %>% #I manually trimmed the data here to prevent some errors
group_by(SITE_ID) %>%
filter(n() > 2) %>% #filter sites with more than 10 years of data
gather(parameter, value, SO4_UEQ_L, ANC_UEQ_L, NO3_UEQ_L, SBC_ALL_UEQ_L, SBC_NA_UEQ_L ) %>%
#, DOC_MG_L)
group_by(parameter, SITE_ID, PROGRAM_ID) %>% nest() %>%
mutate(ts_out = map(data, ~ts(.x$value, start=c(1985, 1), end=c(1995, 1), frequency=1))) %>%
#this is where I would like to specify the first year in the actual time series with data. End year would also be tied to the last year of data.
mutate(mk_res = map(ts_out, ~mk.test(.x, alternative = c("two.sided", "greater", "less"),continuity = TRUE)),
sens = map(ts_out, ~sens.slope(.x, conf.level = 0.95))) %>%
#run the Mann-Kendall Test
mutate(mk_stat = map_dbl(mk_res, ~.x$statistic),
p_val = map_dbl(mk_res, ~.x$p.value)
, sens_slope = map_dbl(sens, ~.x$estimates)
) %>%
#Pull the parameters we need
select(SITE_ID, PROGRAM_ID, parameter, sens_slope, p_val, mk_stat) %>%
mutate(output = case_when(
sens_slope == 0 ~ "NC",
sens_slope > 0 & p_val < 0.05 ~ "INS",
sens_slope > 0 & p_val > 0.05 ~ "INNS",
sens_slope < 0 & p_val < 0.05 ~ "DES",
sens_slope < 0 & p_val > 0.05 ~ "DENS"))
How do I handle the NAs in the middle of the data?
How do I get the time series to automatically start and end on the dates with actual data ? For reference each of the site_id's has the following date ranges (not including NAs):
+-----------+-----------+-------------------+-----------+-----------+
| 1234 | 5678 | 9101 | 1121 | 3812 |
+-----------+-----------+-------------------+-----------+-----------+
| 1992-1995 | 1983-1995 | 1982 OR 1985-1995 | 1987-1995 | 1988-1993 |
+-----------+-----------+-------------------+-----------+-----------+
To make the data more consistent, I decided to organize the data as individual time-series (grouping by parameter, year, site_id, program) in Oracle before importing into R.
+---------+------------+------+--------------+-----------+
| SITE_ID | PROGRAM_ID | YEAR | Value | Parameter |
+---------+------------+------+--------------+-----------+
| 1234 | Alpha | 1992 | 36.12 | ANC |
| 1234 | Alpha | 1993 | 22.30416667 | ANC |
| 1234 | Alpha | 1994 | 25.25166667 | ANC |
| 1234 | Alpha | 1995 | 23.39166667 | ANC |
| 5678 | Beta | 1990 | 2.354545455 | ANC |
| 5678 | Beta | 1991 | 5.973846154 | ANC |
| 5678 | Beta | 1992 | 4.174359 | ANC |
| 5678 | Beta | 1993 | 6.05 | ANC |
| 5678 | Beta | 1994 | -2.51666 | ANC |
| 5678 | Beta | 1995 | 8.00936875 | ANC |
| 9101 | Victor | 1990 | 27.4 | ANC |
| 9101 | Victor | 1991 | 24.95 | ANC |
| 9101 | Victor | 1992 | 29.425 | ANC |
| 9101 | Victor | 1993 | 22.53333333 | ANC |
| 9101 | Victor | 1994 | 29.93333333 | ANC |
| 9101 | Victor | 1995 | 39.33333333 | ANC |
| 1121 | Charlie | 1990 | 13.4725 | ANC |
| 1121 | Charlie | 1991 | 11.3 | ANC |
| 1121 | Charlie | 1992 | 9.0025 | ANC |
| 1121 | Charlie | 1993 | 7.7925 | ANC |
| 1121 | Charlie | 1994 | 7.59 | ANC |
| 1121 | Charlie | 1995 | 7.7925 | ANC |
| 3812 | Charlie | 1990 | 2.368571429 | ANC |
| 3812 | Charlie | 1991 | 33.695 | ANC |
| 3812 | Charlie | 1992 | 18.49111111 | ANC |
| 3812 | Charlie | 1993 | 17.28181818 | ANC |
| 1234 | Alpha | 1992 | 0.8786 | NO3 |
| 1234 | Alpha | 1993 | 2.671258333 | NO3 |
| 1234 | Alpha | 1994 | 3.296475 | NO3 |
| 1234 | Alpha | 1995 | 1.753436364 | NO3 |
| 5678 | Beta | 1990 | 39.49090909 | NO3 |
| 5678 | Beta | 1991 | 30.54307692 | NO3 |
| 5678 | Beta | 1992 | 16.99051285 | NO3 |
| 5678 | Beta | 1993 | 19.76125 | NO3 |
| 5678 | Beta | 1994 | 17.47167 | NO3 |
| 5678 | Beta | 1995 | 22.66188125 | NO3 |
| 9101 | Victor | 1990 | 0.642647059 | NO3 |
| 9101 | Victor | 1991 | 1.228921569 | NO3 |
| 9101 | Victor | 1992 | 0.591911765 | NO3 |
| 9101 | Victor | 1993 | 0.308169935 | NO3 |
| 9101 | Victor | 1994 | 0.428431373 | NO3 |
| 9101 | Victor | 1995 | 0.57875817 | NO3 |
| 1121 | Charlie | 1990 | 0.8975 | NO3 |
| 1121 | Charlie | 1991 | 0.805 | NO3 |
| 1121 | Charlie | 1992 | 7.145 | NO3 |
| 1121 | Charlie | 1993 | 6.6 | NO3 |
| 1121 | Charlie | 1994 | 3.7625 | NO3 |
| 1121 | Charlie | 1995 | 1.21 | NO3 |
| 3812 | Charlie | 1990 | 35.23448276 | NO3 |
| 3812 | Charlie | 1991 | 40.733 | NO3 |
| 3812 | Charlie | 1992 | 26.14818889 | NO3 |
| 3812 | Charlie | 1993 | 27.65394545 | NO3 |
+---------+------------+------+--------------+-----------+
Once in R, I was able to edit the code to the following with the beginning of the code. Remaining code was the same.
Mann_Kendall_Values_Trimmed <- filter(LTM_Data_StackOverflow_9_22_2020, YEAR >1989, PARAMETER != 'doc') %>%
#filter data to start in 1990 as this removes nulls from pre-1990 sampling
group_by(SITE_ID) %>%
filter(n() > 10) %>% #filter sites with more than 10 years of data
#gather(SITE_ID, PARAMETER, VALUE) #I believe this is now redundant %>%
group_by(PARAMETER, SITE_ID, PROGRAM_ID) %>% nest() %>%
mutate(ts_out = map(data, ~ts(.x$VALUE, start=c(min(.x$YEAR), 1), c(max(.x$YEAR), 1), frequency=1)))
This achieved the result I needed for all time series that have sufficient length (greater than 2 I believe) to run the mann-kendall test. The parameter that had those issues will be dealt with in separate R code.
Consider the following table:
julia> using RDatasets, DataFrames
julia> anscombe = dataset("datasets","anscombe")
11x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1 | Y2 | Y3 | Y4 |
|-----|----|----|----|----|-------|------|-------|------|
| 1 | 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 2 | 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 3 | 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 4 | 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 5 | 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 6 | 14 | 14 | 14 | 8 | 9.96 | 8.1 | 8.84 | 7.04 |
| 7 | 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 8 | 4 | 4 | 4 | 19 | 4.26 | 3.1 | 5.39 | 12.5 |
| 9 | 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 10 | 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 11 | 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
I have defined a function as follows:
julia> f1(df, matchval, matchfield, qfields...) = isempty(qfields)
WARNING: Method definition f1(Any, Any, Any, Any...) in module Main at REPL[314]:1 overwritten at REPL[317]:1.
f1 (generic function with 3 methods)
Now below is the problem
julia> f1(anscombe, 11, "X1")
ERROR: KeyError: key :field not found
in getindex at ./dict.jl:697 [inlined]
in getindex(::DataFrames.Index, ::Symbol) at /home/arghya/.julia/v0.5/DataFrames/src/other/index.jl:114
in getindex at /home/arghya/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:228 [inlined]
in f1(::DataFrames.DataFrame, ::Int64, ::String) at ./REPL[249]:2
Where am I doing wrong? FYI I'm using Julia Version 0.5.2. How to overcome this problem? Thanks in advance!
There is nothing wrong with your code - try running just what you've posted in a fresh session. Possibly you've defined another f1 method before. If you come from R, you may assume that this is overwritten by f1(df, matchval, matchfield, qfields...) = isempty(qfields), while in fact you're just defining a new method for the f1 function. The error is probably thrown by a 3-argument version you've defined earlier. Look at https://docs.julialang.org/en/stable/manual/methods/
I'm trying to get a 2 way table in R similar to this one from Stata. I was trying to use CrossTable from gmodels package, but the table is not the same. Do you known how can this be done in R?
I hope at least to get the frequencies from
when cursmoke1 == "Yes" & cursmoke2 == "No" and reversed
In R I'm only getting totals from yes, no and NA.
Here is the output:
Stata
. tabulate cursmoke1 cursmoke2, cell column miss row
+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
| cell percentage |
+-------------------+
Current |
smoker, | Current smoker, exam 2
exam 1 | No Yes . | Total
-----------+---------------------------------+----------
No | 1,898 131 224 | 2,253
| 84.24 5.81 9.94 | 100.00
| 86.16 7.59 44.44 | 50.81
| 42.81 2.95 5.05 | 50.81
-----------+---------------------------------+----------
Yes | 305 1,596 280 | 2,181
| 13.98 73.18 12.84 | 100.00
| 13.84 92.41 55.56 | 49.19
| 6.88 35.99 6.31 | 49.19
-----------+---------------------------------+----------
Total | 2,203 1,727 504 | 4,434
| 49.68 38.95 11.37 | 100.00
| 100.00 100.00 100.00 | 100.00
| 49.68 38.95 11.37 | 100.00
R
> CrossTable(cursmoke2, cursmoke1, missing.include = T, format="SAS")
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke1
cursmoke2 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 2203 | 0 | 0 | 2203 |
| 1122.544 | 858.047 | 250.409 | |
| 1.000 | 0.000 | 0.000 | 0.497 |
| 1.000 | 0.000 | 0.000 | |
| 0.497 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 0 | 1727 | 0 | 1727 |
| 858.047 | 1652.650 | 196.303 | |
| 0.000 | 1.000 | 0.000 | 0.389 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.389 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
NA | 0 | 0 | 504 | 504 |
| 250.409 | 196.303 | 3483.288 | |
| 0.000 | 0.000 | 1.000 | 0.114 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Maybe I'm missing something here. The default settings for CrossTable seem to provide essentially what you are looking for.
Here is CrossTable with minimal arguments. (I've loaded the dataset as "temp".) Note that the results are the same as what you posted from the Stata output (you just need to multiply by 100 if you want the result as a percentage).
library(gmodels)
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 0.842 | 0.058 | 0.099 | 0.508 |
| 0.862 | 0.076 | 0.444 | |
| 0.428 | 0.030 | 0.051 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 0.140 | 0.732 | 0.128 | 0.492 |
| 0.138 | 0.924 | 0.556 | |
| 0.069 | 0.360 | 0.063 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Alternatively, you can use format="SPSS" if you want the numbers displayed as percentages.
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE, format="SPSS"))
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 84.243% | 5.814% | 9.942% | 50.812% |
| 86.155% | 7.585% | 44.444% | |
| 42.806% | 2.954% | 5.052% | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 13.984% | 73.177% | 12.838% | 49.188% |
| 13.845% | 92.415% | 55.556% | |
| 6.879% | 35.995% | 6.315% | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 49.684% | 38.949% | 11.367% | |
-------------|-----------|-----------|-----------|-----------|
Update: prop.table()
Just FYI (to save you the tedious work you did in making your own data.frame as you did), you may also be interested in the prop.table() function.
Again, using the data you linked to and assuming it is named "temp", the following gives you the underlying data from which you can construct your data.frame. You may also be interested in looking into the functions margin.table() or addmargins():
## Your basic table
CurSmoke <- with(temp, table(cursmoke1, cursmoke2, useNA = "ifany"))
CurSmoke
# cursmoke2
# cursmoke1 No Yes <NA>
# No 1898 131 224
# Yes 305 1596 280
## Row proportions
prop.table(CurSmoke, 1) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.84243231 0.05814470 0.09942299
# Yes 0.13984411 0.73177442 0.12838148
## Column proportions
prop.table(CurSmoke, 2) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.86155243 0.07585408 0.44444444
# Yes 0.13844757 0.92414592 0.55555556
## Cell proportions
prop.table(CurSmoke) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.42805593 0.02954443 0.05051872
# Yes 0.06878665 0.35994587 0.06314840