I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1
Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.
You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)
Related
I'm trying to compute mean + standard deviation for a dataset. I have a list of organizations, but one organization has just one single row for the column "cpue." When I try to compute the grouped mean for each organization and another variable (scientific name), this organization is removed and yields a NA. I would like to retain the single-group value however, and for it to be in the "mean" column so that I can plot it (without sd). Is there a way to tell dplyr to retain groups with a single row when calculating the mean? Data below:
l<- df<- data.frame(organization = c("A","B", "B", "A","B", "A", "C"),
species= c("turtle", "shark", "turtle", "bird", "turtle", "shark", "bird"),
cpue= c(1, 2, 1, 5, 6, 1, 3))
l2<- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd=sd(cpue))
Any help would be much appreciated!
We can create an if/else condition in sd to check for the number of rows i.e. if n() ==1 then return the 'cpue' or else compute the sd of 'cpue'
library(dplyr)
l1 <- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd= if(n() == 1) cpue else sd(cpue), .groups = 'drop')
-output
l1
# A tibble: 6 x 4
# organization species mean sd
#* <chr> <chr> <dbl> <dbl>
#1 A bird 5 5
#2 A shark 1 1
#3 A turtle 1 1
#4 B shark 2 2
#5 B turtle 3.5 3.54
#6 C bird 3 3
If the condition is based on the value of grouping variable 'organization', then create the condition in if/else by extracting the grouping variable with cur_group()
l %>%
group_by(organization, species) %>%
summarise(mean = mean(cpue),
sd = if(cur_group()$organization == 'A') cpue else sd(cpue),
.groups = 'drop')
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I have a dataframe that looks like this (but with lots more columns, and no helpful "KEEP" column):
df <- tribble( ~Lots.of.cols, ~analyte, ~meta, ~value, ~KEEP,
1, "A", "analyte", NA, FALSE,
1, "A", "unit", "m", FALSE,
1, "A", "method", NA, FALSE,
1, "B", "analyte", "4", TRUE,
1, "B", "unit", "kg", TRUE,
1, "B", "method", "xxx", TRUE)
What I want to do is filter out all the rows of a particular analyte if the row where meta is "analyte" the value column is also NA. So in the df above, the first three rows should be filtered out because row one has meta = "analyte" and value = NA. The final three rows (analyte = "B") should be kept because the fourth row (meta = "analyte") has !is.na(value).
So there are two approaches I've tried. The first is to group_by(analyte) and then try filtering or alternatively
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte", "meta")) -> df
With both approaches I can remove the individual row where meta = "analyte" & is.na(value) but not the other rows in the group.
The issue is that your table is not in tidy format, i.e. 1 observation = 1 row.
To have as tidy data, you'd need to pivot wider. This is why I pivotted, filtered, then re-pivotted.
Also, it's confusing that you have two things named "analyte" that are not the same thing, hence why I changed the name.
df %>%
mutate(meta = str_replace(meta, "analyte", "analyte_value")) %>%
pivot_wider(names_from = meta, values_from = value) %>%
filter(!is.na(analyte_value)) %>%
pivot_longer(cols = analyte_value:method)
#> # A tibble: 3 x 4
#> Lots.of.cols analyte name value
#> <dbl> <chr> <chr> <chr>
#> 1 1 B analyte_value 4
#> 2 1 B unit kg
#> 3 1 B method xxx
Your anti_join was almost good, just don't put the "meta" variable in the by = c(...) like that :
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte")) -> df
Result :
# A tibble: 3 x 5
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 TRUE
2 1 B unit kg TRUE
3 1 B method xxx TRUE
I would first fix your KEEP column, and them filter the data by it. First I group your data by analyte using group_by() from dplyr, them I apply the logical test to discover if in some row of each group, there is a row with meta = analyte and value = NA, and them I use the any() function to discover if any of these results from the test, are TRUE in each group. After that, I just use filter() to select the desired rows.
library(tidyverse)
df <- df %>%
group_by(analyte) %>%
mutate(KEEP = any(meta == "analyte" & is.na(value))) %>%
filter(KEEP == FALSE)
Here is the result:
# A tibble: 3 x 5
# Groups: analyte [1]
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 FALSE
2 1 B unit kg FALSE
3 1 B method xxx FALSE
I have a flight database with 4 columns like shown below.
Original:
I want an output which gives rows based on unique combination of 3 col (origin/destination/Airline), sums the number of passengers for each unique combination and count the numbers of rows for each unique combination. The result would be something like this.
Output:
I am able to do 1 part of it using the group_by function
df %>% group_by(Origin, destination, carrier) %>% summarise(count = n())
How to include the sum of population?
We can use dplyr
library(dplyr)
df1 %>%
group_by(Origin, Destination, Airline) %>%
dplyr::summarise(count = n(), TotalPassengers = sum(Passengers))
# Groups: Origin, Destination [2]
# Origin Destination Airline count TotalPassengers
# <chr> <chr> <chr> <int> <dbl>
#1 ABE ATL 9A 2 3
#2 ABE ATL DL 1 5
#3 NYC SFA AA 3 21
#4 NYC SFA DL 1 5
data
df1 <- data.frame(Origin = rep(c("ABE", "NYC"), c(3, 4)),
Destination = rep(c("ATL", "SFA"), c(3, 4)),
Airline = c("9A", "9A", "DL", "AA", "AA", "AA", "DL"),
Passengers = c(2, 1, 5, 4, 10, 7, 5))
I have a data.frame with such as
df1 <- data.frame(id = c("A", "A", "B", "B", "B"),
cost = c(100, 10, 120, 102, 102)
I know that I can use
df1.a <- group_by(df1, id) %>%
summarise(no.c = n(),
m.costs = mean(cost))
to calculate the number of observations and mean by id. How could I do so if I want to calculate the number of observations and mean for all rows that are NOT equal to the ID, so it would for example give me 3 as value for observations not A and 2 for observations not B.
I would like to use the dplyr package and the group_by functions since I have to this for a lot of huge dataframes.
You can use the . to refer to the whole data.frame, which lets you calculate the differences between the group and the whole:
df1 %>% group_by(id) %>%
summarise(n = n(),
n_other = nrow(.) - n,
mean_cost = mean(cost),
mean_other = (sum(.$cost) - sum(cost)) / n_other)
## # A tibble: 2 × 5
## id n n_other mean_cost mean_other
## <fctr> <int> <int> <dbl> <dbl>
## 1 A 2 3 55 108
## 2 B 3 2 108 55
As you can see from the results, with two groups you could just use rev, but this approach will scale to more groups or calculations easily.
Looking for something like this? This calculates the total cost and total number of rows firstly and then subtract the total cost and total number of rows for each group and take average for the cost:
sumCost = sum(df1$cost)
totRows = nrow(df1)
df1 %>%
group_by(id) %>%
summarise(no.c = totRows - n(),
m.costs = (sumCost - sum(cost))/no.c)
# A tibble: 2 x 3
# id no.c m.costs
# <fctr> <int> <dbl>
#1 A 3 108
#2 B 2 55