Reorganizing a dataframe so it can be graphed (Change column into a row 'category'?) - r

I have a dataframe like
year lwg_pred bon_pred lwg_prey bon_prey
1990 10 5 20 30
1991 1 2 3 4
1992 5 6 7 8
1993 9 10 11 12
1994 13 14 15 16
This is quite difficult to barplot as is, so I was trying to reorganize it so it would be like
year location type number
1990 lwg pred 10
1990 bon pred 5
1990 lwg prey 20
1990 bon prey 30
The end goal is to have a barplot that has x-axis as Location, y-axis with the number, fill with the type and faceted by the year.
I took a look at a previous post,
barplot_df3 <- barplot_df3 |>
pivot_longer(everything(),
names_sep = "_",
names_to = c("location", ".value"))
but I couldn't figure out how to keep year nor made it 'categorize' pred/prey
Got this:
location pred prey
lwg 10 20
...
I also took a look at transpose, but I don't think it will be of much help.
I would like some suggestions on either how to reorganize the data so it can be graphed or perhaps there was a way to graph it originally, and I was unaware. Thank you for your help in advance!

You don't need the magic ".value", you can just use the new column names you want.
library(tidyr)
barplot_df3 %>%
pivot_longer(-year, names_to = c("location", "type"), names_sep = "_")
will give
year location type value
<int> <chr> <chr> <int>
1 1990 lwg pred 10
2 1990 bon pred 5
3 1990 lwg prey 20
4 1990 bon prey 30
5 1991 lwg pred 1
6 1991 bon pred 2
7 1991 lwg prey 3
8 1991 bon prey 4
9 1992 lwg pred 5
10 1992 bon pred 6
11 1992 lwg prey 7
12 1992 bon prey 8
13 1993 lwg pred 9
14 1993 bon pred 10
15 1993 lwg prey 11
16 1993 bon prey 12
17 1994 lwg pred 13
18 1994 bon pred 14
19 1994 lwg prey 15
20 1994 bon prey 16

Related

Is it possible to make groups based on an ID of a person in R?

I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10

R - manipulating time series data

I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]

Average of a variable by collapsing two columns in r [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I would wish to find the average per season for each year. Each year is observed 4 times. The seasons are two but are repeated twice as shown below
year=rep(c(1990:1992),each=4)
season=c("W","D","W","D","W","W","D","D","D","W","W","D")
temp=c(28,25,26,21,28,25,20,20,20,35,28,21)
df=data.frame(year,season,temp)
which gives
year season temp
1 1990 W 28
2 1990 D 25
3 1990 W 26
4 1990 D 21
5 1991 W 28
6 1991 W 25
7 1991 D 20
8 1991 D 20
9 1992 D 20
10 1992 W 35
11 1992 W 28
12 1992 D 21
i want to collapse this data to have the average of the two seasons for each year as below
year season avgtemp
1 1990 D 23.0
2 1990 W 27.0
3 1991 D 20.0
4 1991 W 25.1
5 1992 D 20.5
6 1992 W 31.5
How can i obtain this?
Try below:
aggregate(df[, 3], df[, 1:2], mean)
library(tidyvere)
df %>%
group_by(year,season) %>%
summarise(avgtemp=mean(temp))
# A tibble: 6 x 3
# Groups: year [?]
year season avgtemp
<int> <fct> <dbl>
1 1990 D 23
2 1990 W 27
3 1991 D 20
4 1991 W 26.5
5 1992 D 20.5
6 1992 W 31.5

unlist and merge into a single dataframe in r

I have a list of dataframes that I need to be combined into a single one.
year<-1990:2000
v1<-1:11
v2<-20:30
df1<-data.frame(year,v1)
df2<-data.frame(year,v2)
ldf<-list(df1,df2)
I now want to unlist this dataframe and get
> head(df)
year v1 v2
1 1990 1 20
2 1991 2 21
3 1992 3 22
4 1993 4 23
Note that my question is different from the solution provided in a similar question, where the solution to that question was: `df <- ldply(ldf, data.frame)
Because what I am essentially looking for, is a more automatic way of doing this: df<-merge(df1,df2, by="year")
With more number of list elements, a convenient option is reduce with one of the join functions
library(tidyverse)
ldf %>%
reduce(inner_join, by = "year")
# year v1 v2
#1 1990 1 20
#2 1991 2 21
#3 1992 3 22
#4 1993 4 23
#5 1994 5 24
#6 1995 6 25
#7 1996 7 26
#8 1997 8 27
#9 1998 9 28
#10 1999 10 29
#11 2000 11 30
Is there anything wrong with:
df <- merge(ldf[[1]], ldf[[2]], by="year")
Or for a long list:
df1 <- ldf[[1]]
for (x in 2:length(ldf)) {
df1 <- merge(df1, ldf[[x]])
}
# year v1 v2
# 1 1990 1 20
# 2 1991 2 21
# 3 1992 3 22
# 4 1993 4 23
# 5 1994 5 24
# 6 1995 6 25
# 7 1996 7 26
# 8 1997 8 27
# 9 1998 9 28
# 10 1999 10 29
# 11 2000 11 30

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

Resources