I have a large dataset ("bsa", drawn from a 23-year period) which includes a variable ("leftrigh") for "left-right" views (political orientation). I'd like to summarise how the cohorts change over time. For example, in 1994 the average value of this scale for people aged 45 was (say) 2.6; in 1995 the average value of this scale for people aged 46 was (say) 2.7 -- etc etc. I've created a year-of-birth variable ("yrbrn") to facilitate this.
I've successfully created the means:
bsa <- bsa %>% group_by(yrbrn, syear) %>% mutate(meanlr = mean(leftrigh))
Where I'm struggling is to summarise the means by year (of the survey) and age (at the time of the survey). If I could create an array (containing these means) organised by age x survey-year, I could see the change over time by inspecting the diagonals. But I have no clue how to do this -- my skills are very limited...
A tibble: 66,744 x 10
Groups: yrbrn [104]
Rsex Rage leftrigh OldWt syear yrbrn coh per agecat meanlr
1 1 [Male] 40 1 [left] 1.12 2017 1977 17 2017 [37,47) 2.61
2 2 [Female] 79 1.8 0.562 2017 1938 9 2017 [77,87) 2.50
3 2 [Female] 50 1.5 1.69 2017 1967 15 2017 [47,57) 2.59
4 1 [Male] 73 2 0.562 2017 1944 10 2017 [67,77) 2.57
5 2 [Female] 31 3 0.562 2017 1986 19 2017 [27,37) 2.56
6 1 [Male] 74 2.2 0.562 2017 1943 10 2017 [67,77) 2.50
7 2 [Female] 58 2 0.562 2017 1959 13 2017 [57,67) 2.56
8 1 [Male] 59 1.2 0.562 2017 1958 13 2017 [57,67) 2.53
9 2 [Female] 19 4 1.69 2017 1998 21 2017 [17,27) 2.46
Possible format for presenting this information to see change over time:
1994 1995 1996 1997 1998 1999 2000
18
19
20
21
22
23
24
25
etc.
You can group_by both age and year at the same time:
# Setup (& make reproducible data...)
n <- 10000
df1 <- data.frame(
'yrbrn' = sample(1920:1995, size = n, replace = T),
'Syear' = sample(2005:2015, size = n, replace = T),
'leftrigh' = sample(seq(0,5,0.1), size = n, replace = T))
# Solution
df1 %>%
group_by(yrbrn, Syear) %>%
summarise(meanLR = mean(leftrigh)) %>%
spread(Syear, meanLR)
Produces the following:
# A tibble: 76 x 12
# Groups: yrbrn [76]
yrbrn `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1920 3.41 1.68 2.26 2.66 3.21 2.59 2.24 2.39 2.41 2.55 3.28
2 1921 2.43 2.71 2.74 2.32 2.24 1.89 2.85 3.27 2.53 1.82 2.65
3 1922 2.28 3.02 1.39 2.33 3.25 2.09 2.35 1.83 2.09 2.57 1.95
4 1923 3.53 3.72 2.87 2.05 2.94 1.99 2.8 2.88 2.62 3.14 2.28
5 1924 1.77 2.17 2.71 2.18 2.71 2.34 2.29 1.94 2.7 2.1 1.87
6 1925 1.83 3.01 2.48 2.54 2.74 2.11 2.35 2.65 2.57 1.82 2.39
7 1926 2.43 3.2 2.53 2.64 2.12 2.71 1.49 2.28 2.4 2.73 2.18
8 1927 1.33 2.83 2.26 2.82 2.34 2.09 2.3 2.66 3.09 2.2 2.27
9 1928 2.34 2.02 2.1 2.88 2.14 2.44 2.58 1.67 2.57 3.11 2.93
10 1929 2.31 2.29 2.93 2.08 2.11 2.47 2.39 1.76 3.09 3 2.9
Related
I am looking to calculate the TDO3 value at every date during the year 2020. I have interpolated data sets of both temperature and dissolved oxygen in 0.25 meter increments from 1m - 22m below the surface between the dates of Jan-1-2020 and Dec-31-2020.
TDO3 is the temperature when dissolved oxygen is 3mg/L. Below are snips of the merged data set.
> print(do_temp, n=85)
# A tibble: 31,110 x 4
date depth mean_temp mean_do
<date> <dbl> <dbl> <dbl>
1 2020-01-01 1 2.12 11.6
2 2020-01-01 1.25 2.19 11.5
3 2020-01-01 1.5 2.27 11.4
4 2020-01-01 1.75 2.34 11.3
5 2020-01-01 2 2.42 11.2
6 2020-01-01 2.25 2.40 11.2
7 2020-01-01 2.5 2.39 11.1
8 2020-01-01 2.75 2.38 11.1
9 2020-01-01 3 2.37 11.0
10 2020-01-01 3.25 2.41 11.0
11 2020-01-01 3.5 2.46 11.0
12 2020-01-01 3.75 2.50 10.9
13 2020-01-01 4 2.55 10.9
14 2020-01-01 4.25 2.54 10.9
15 2020-01-01 4.5 2.53 10.9
16 2020-01-01 4.75 2.52 11.0
17 2020-01-01 5 2.51 11.0
18 2020-01-01 5.25 2.50 11.0
19 2020-01-01 5.5 2.49 11.0
20 2020-01-01 5.75 2.49 11.1
21 2020-01-01 6 2.48 11.1
22 2020-01-01 6.25 2.49 10.9
23 2020-01-01 6.5 2.51 10.8
24 2020-01-01 6.75 2.52 10.7
25 2020-01-01 7 2.54 10.5
26 2020-01-01 7.25 2.55 10.4
27 2020-01-01 7.5 2.57 10.2
28 2020-01-01 7.75 2.58 10.1
29 2020-01-01 8 2.60 9.95
30 2020-01-01 8.25 2.63 10.1
31 2020-01-01 8.5 2.65 10.2
32 2020-01-01 8.75 2.68 10.3
33 2020-01-01 9 2.71 10.5
34 2020-01-01 9.25 2.69 10.6
35 2020-01-01 9.5 2.67 10.7
36 2020-01-01 9.75 2.65 10.9
37 2020-01-01 10 2.63 11.0
38 2020-01-01 10.2 2.65 10.8
39 2020-01-01 10.5 2.67 10.6
40 2020-01-01 10.8 2.69 10.3
41 2020-01-01 11 2.72 10.1
42 2020-01-01 11.2 2.75 9.89
43 2020-01-01 11.5 2.78 9.67
44 2020-01-01 11.8 2.81 9.44
45 2020-01-01 12 2.84 9.22
46 2020-01-01 12.2 2.83 9.39
47 2020-01-01 12.5 2.81 9.56
48 2020-01-01 12.8 2.80 9.74
49 2020-01-01 13 2.79 9.91
50 2020-01-01 13.2 2.80 10.1
51 2020-01-01 13.5 2.81 10.3
52 2020-01-01 13.8 2.82 10.4
53 2020-01-01 14 2.83 10.6
54 2020-01-01 14.2 2.86 10.5
55 2020-01-01 14.5 2.88 10.4
56 2020-01-01 14.8 2.91 10.2
57 2020-01-01 15 2.94 10.1
58 2020-01-01 15.2 2.95 10.0
59 2020-01-01 15.5 2.96 9.88
60 2020-01-01 15.8 2.97 9.76
61 2020-01-01 16 2.98 9.65
62 2020-01-01 16.2 2.99 9.53
63 2020-01-01 16.5 3.00 9.41
64 2020-01-01 16.8 3.01 9.30
65 2020-01-01 17 3.03 9.18
66 2020-01-01 17.2 3.05 9.06
67 2020-01-01 17.5 3.07 8.95
68 2020-01-01 17.8 3.09 8.83
69 2020-01-01 18 3.11 8.71
70 2020-01-01 18.2 3.13 8.47
71 2020-01-01 18.5 3.14 8.23
72 2020-01-01 18.8 3.16 7.98
73 2020-01-01 19 3.18 7.74
74 2020-01-01 19.2 3.18 7.50
75 2020-01-01 19.5 3.18 7.25
76 2020-01-01 19.8 3.18 7.01
77 2020-01-01 20 3.18 6.77
78 2020-01-01 20.2 3.18 5.94
79 2020-01-01 20.5 3.18 5.10
80 2020-01-01 20.8 3.18 4.27
81 2020-01-01 21 3.18 3.43
82 2020-01-01 21.2 3.22 2.60
83 2020-01-01 21.5 3.25 1.77
84 2020-01-01 21.8 3.29 0.934
85 2020-01-01 22 3.32 0.100
# ... with 31,025 more rows
https://github.com/TRobin82/WaterQuality
The above link will get you to the raw data.
What I am looking for is a data frame that looks like this but it will have 366 rows for each date during the year.
> TDO3
dates tdo3
1 2020-1-1 3.183500
2 2020-2-1 3.341188
3 2020-3-1 3.338625
4 2020-4-1 3.437000
5 2020-5-1 4.453310
6 2020-6-1 5.887560
7 2020-7-1 6.673700
8 2020-8-1 7.825672
9 2020-9-1 8.861190
10 2020-10-1 11.007972
11 2020-11-1 7.136880
12 2020-12-1 2.752500
However a DO value of a perfect 3 mg/L is not found in the interpolation data frame of DO so I would need the function to find the closest value to 3 without going below then match the depth of that value up with the other data frame for temperature to assign the proper temperature at that depth.
I am assuming the best route to take is a for-loop but not sold on the proper way to go about this question.
here's one way of doing it with tidyverse-style functions. Note that this code is reproducible because anyone can run it and should get the same answer. It's great that you showed us your data, but it's even better to post the output of dput() because then people can load the data and start helping you immediately.
This code does the following:
Load the data from the link you provided. But since there were several data files I had to guess which one you meant.
Groups the observations by date.
Puts the observations in increasing order of mean_do.
Removes rows with values of mean_do that are strictly less than 3.
Takes the first ordered observation for each date (this will be the one with the lowest value of mean_do that is greater than or equal to 3).
Rename the column mean_temp as tdo3 since it's the temperature for that date when the dissolved oxygen level was closest to 3mg/L.
library(tidyverse)
do_temp <- read_csv("https://raw.githubusercontent.com/TRobin82/WaterQuality/main/DateDepthTempDo.csv") %>%
select(-X1)
do_temp %>%
group_by(date) %>%
arrange(mean_do) %>%
filter(mean_do > 3) %>%
slice_head(n=1) %>%
rename(tdo3 = mean_temp) %>%
select(date, tdo3)
Here are the results. They're a bit different from the ones you posted, so I'm not sure if I've misunderstood you or if those were just illustrative and not real results.
# A tibble: 366 x 2
# Groups: date [366]
date tdo3
<date> <dbl>
1 2020-01-01 3.18
2 2020-01-02 3.18
3 2020-01-03 3.19
4 2020-01-04 3.21
5 2020-01-05 3.21
6 2020-01-06 3.21
7 2020-01-07 3.24
8 2020-01-08 3.28
9 2020-01-09 3.27
10 2020-01-10 3.28
# ... with 356 more rows
Let me know if you were looking for something else.
I've got a tibble that I'm struggling to turn into a tsibble.
# A tibble: 13 x 8
year `Administration, E~ `All Staff` `Ambulance staff` `Healthcare Assi~ `Medical and De~ `Nursing, Midwife~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2009 3.97 5.08 7.16 6.94 1.36 6.19
2 2010 4.12 5.07 6.89 7.02 1.41 6.02
3 2011 4.06 5.03 6.69 7.06 1.36 6.02
4 2012 4.40 5.40 7.79 7.48 1.52 6.44
5 2013 4.28 5.35 8.19 7.46 1.48 6.44
6 2014 4.45 5.56 8.87 7.82 1.53 6.67
7 2015 4.30 5.29 6.86 7.54 1.44 6.30
8 2016 4.21 5.15 7.56 7.15 1.66 6.17
9 2017 4.33 5.13 7.32 7.20 1.69 6.04
10 2018 4.58 5.30 7.96 7.00 1.73 6.38
11 2019 4.71 5.52 7.66 7.96 1.94 6.65
12 2020 4.69 5.98 7.49 8.37 2.11 7.56
13 2021 4.19 5.72 9.62 8.47 1.71 7.29
# ... with 1 more variable: Scientific, Therapeutic and Technical staff <dbl>
How would I turn this into a tsibble so that I can plot graphs with ggplot2?
When trying as_tsibble()
absence_ts <- as_tsibble(absence, key = absence$All Staff, index = absence$year)
it comes up with the following error:
Error: Must subset columns with a valid subscript vector. x Can't convert from <double> to <integer> due to loss of precision.
I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15
I am struggling with some data munging. To get to the table below I have used group_by and summarise_at to find the means of Q1-Q10 by cid and time (I started with multiple values for each cid and at each time point), then filtered down to just have cids that appear about both time 1 and 2. Using this (or going back to my raw data if there is a cleaner way) I want to count for each cid how many of the means of Q1-Q10 increased at time 2, then, for each GROUP mind the mean number of increases.
GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89
...
I have seen examples using spread on iris data but this was for the difference on a single variable. Any help appreciated.
Try this. Gives you the mean increase by GROUP and Qs:
df <- read.table(text = "GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89", header = TRUE)
library(dplyr)
library(tidyr)
df %>%
# Convert to long
pivot_longer(-c(GROUP, cid, time), names_to = "Q") %>%
# Group by GROUP, cid, Q
group_by(GROUP, cid, Q) %>%
# Just in case: sort by time
arrange(time) %>%
# Increased at time 2 using lag
mutate(is_increase = value > lag(value)) %>%
# Mean increase by GROUP and Q
group_by(GROUP, Q) %>%
summarise(mean_inc = mean(is_increase, na.rm = TRUE))
#> # A tibble: 20 x 3
#> # Groups: GROUP [2]
#> GROUP Q mean_inc
#> <fct> <chr> <dbl>
#> 1 A Q1 0.5
#> 2 A Q10 0.5
#> 3 A Q2 0.5
#> 4 A Q3 0.5
#> 5 A Q4 0.5
#> 6 A Q5 0.5
#> 7 A Q6 0.5
#> 8 A Q7 0.5
#> 9 A Q8 0
#> 10 A Q9 0
#> 11 B Q1 0
#> 12 B Q10 0
#> 13 B Q2 0
#> 14 B Q3 0
#> 15 B Q4 0
#> 16 B Q5 0
#> 17 B Q6 0
#> 18 B Q7 0
#> 19 B Q8 0
#> 20 B Q9 0
Created on 2020-04-12 by the reprex package (v0.3.0)
Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.