My data looks like this:
set.seed(1234)
library(tidyverse)
df <- data.frame(Time = c(1,1,2,2,3,3),
Region = c("A", "B", "A", "B", "A", "B"),
Age_1 = round(rnorm(6, mean = 10),0),
Age_2 = round(rnorm(6, mean = 10),0),
Age_3 = round(rnorm(6, mean = 10),0),
Age_4 = round(rnorm(6, mean = 10),0),
Age_5 = round(rnorm(6, mean = 10),0))
I need to generate ratios of population change for each region and point in time. For instance, Ratio_2 for Time == 2 would be Age_2 (at Time == 2) / Age_1 (at Time == 1), grouped by Region. I could do this manually by typing:
df %>%
group_by(Region) %>%
mutate(Ratio_2 = Age_2 / dplyr::lag(Age_1, order_by = Time),
Ratio_3 = Age_3 / dplyr::lag(Age_2, order_by = Time),
Ratio_4 = Age_4 / dplyr::lag(Age_3, order_by = Time),
Ratio_5 = Age_5 / dplyr::lag(Age_4, order_by = Time))
df
# A tibble: 6 x 11
# Groups: Region [2]
Time Region Age_1 Age_2 Age_3 Age_4 Age_5 Ratio_2 Ratio_3 Ratio_4 Ratio_5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 11 8 9 9 10 NA NA NA NA
2 1 B 10 9 10 10 11 NA NA NA NA
3 2 A 9 10 9 8 12 0.909 1.12 0.889 1.33
4 2 B 9 10 9 9 9 1 1 0.9 0.9
5 3 A 8 11 9 9 12 1.22 0.9 1 1.5
6 3 B 9 9 9 9 9 1 0.9 1 1
Since my original data has lots of age groups, this procedure involves lots of manual coding. A programmatic solution in my mind could look something like this:
df %>%
group_by(Region) %>%
mutate(across(4:7, ~ . / dplyr::lag(.[?], order_by = Time), .names="Ratio_{.col}"))
The part containing dplyr::lag(.[?]) needs to reference the previous column in the data frame relative to . but I haven't found a method for doing so.
Note: This question is related to a post from yesterday, in which I was trying to solve the problem at hand with the data being in long format. Doing it in wide format is a different question though, which is why I opened this question.
Here is one option with across
library(dplyr)
library(stringr)
df %>%
group_by(Region) %>%
mutate(across(matches('^Age_[2-5]$'),
~ ./lag(get(str_replace(cur_column(), '\\d+',
as.character(readr::parse_number(cur_column())-1))), order_by = Time ),
.names = "Ratio_{.col}" )) %>%
ungroup
Or it can be done in a simplified way
library(purrr)
df[str_c('Region_', 2:5)] <- map2(df[4:7], df[3:6],
~ .x/lag(.y, order_by = df$Time))
Related
I have a dataset that looks like this:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
There are different locations where we evaluated variables with three levels each (score 0, 1, 2). Now I would like to get a result that contains the proportions of each score by location. The number of individuals examined (ID) is not the same at each location.
So what I did was making functions to use with lapply:
score0 = function(a){sum(a==0)}
score1 = function(a){sum(a==1)}
score2 = function(a){sum(a==2)}
And I tried this, as well as many other things:
df %>%
group_by(location) %>%
lapply(FUN = score0)
But it doesn't work. Again, what I would like to get is a data frame with the proportions of each score or level (0, 1, 2) per location. Or at least the number of occurrences of each score, so I can divide it by the number of individuals per location.
I hope this makes sense.
I also checked this question Calculate proportions of categories within groups but cannot apply the solution to my data with multiple variables.
Thanks for your help!
Something like this?
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location) %>%
count(score) %>%
mutate(frac = n / sum(n))
resulting in
# A tibble: 12 × 4
# Groups: location [4]
location score n frac
<chr> <dbl> <int> <dbl>
1 A 0 3 0.25
2 A 1 6 0.5
3 A 2 3 0.25
4 B 0 7 0.389
Thank you, #danloo, this was almost it. What I wanted was this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n))
with the following result:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4
I have data set like this:
df <- data.frame( ID = c("A","A","A","B","B","B","C","C","C"),
levels = c( "Y", "R", "O","Y", "R", "O","Y", "R", "O" ),
Counts=c(5,1,5,10,2,1,3,5,8))
ID levels Counts
A Y 5
A R 1
A O 5
B Y 10
B R 2
B O 1
C Y 3
C R 5
C O 8
I want to create another column that has a percentage of the second column(levels) like this formula
freq=(Y+O/Y+O+R)*100
So now the data frame should look like this :
ID freq
A 0.1
B 0.2
C 0.3
I tried a couple of solutions but it did not work can you please help me?
Using pivot_wider
df1 %>%
pivot_wider(id_cols = ID, values_from = Counts, names_from = levels) %>%
mutate(freq = (Y+O/Y+O+R)*100,
freq. = (Y+O)/(Y+O+R)*100) # %>% select(-Y, -R, -O)
ID Y R O freq freq.
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 5 1 5 1200 90.9
2 B 10 2 1 1310 84.6
3 C 3 5 8 1867. 68.8
I'm not sure what does your formula want.
You may try using match -
library(dplyr)
df %>%
group_by(ID) %>%
summarise(freq = (Counts[match('Y', levels)] + Counts[match('O', levels)])/sum(Counts))
# ID freq
# <chr> <dbl>
#1 A 0.909
#2 B 0.846
#3 C 0.688
I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")
How do I convert the dataframe?
Before:
set.seed(1)
df <- data.frame( n = rpois(16, 2),
year = rep(2011, 16),
month = rep(seq(1,4,1), times = rep(4,4)))
After:
df1 <- data.frame( n = c(8,11,4,9),
year = rep(2011, 4),
month = rep(seq(1,4,1)))
I think that what you want is this, using dplyr:
library(dplyr)
df %>%
group_by(year, month) %>%
summarise(n = sum(n))
# A tibble: 4 x 3
# Groups: year [1]
year month n
<dbl> <dbl> <int>
1 2011 1 8
2 2011 2 11
3 2011 3 4
4 2011 4 9
Using base R with aggregate
aggregate(n ~ ., df, sum)
# year month n
#1 2011 1 8
#2 2011 2 11
#3 2011 3 4
#4 2011 4 9
I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA