R Aggregate over multiple columns - r

i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.
Edit: The print from dput(head(df))
> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
The dataframe looks like this sample from it:
date x1 x2 x3 x4 x5 xn year month day
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995-01-01 50.8 62.2 90.2 60 NA 53.2 1995 1 1
2 1999-08-02 62.6 58.7 NA 72 NA 61.1 1999 8 2
3 2001-09-03 57.2 49.9 70.1 68.4 NA 56.6 2001 9 3
4 2008-05-04 56.6 56.4 75.8 65.5 NA 58.6 2008 5 4
5 2012-04-05 36.8 43.2 83.3 63.2 NA 36.2 2012 4 5
6 2019-12-31 39.1 41.6 98.5 55.9 NA 44.4 2019 12 31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
$ x1 : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
$ x2 : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
xn
$ year : num [1:9131] 1995 1995 1995 1995 1995 ...
$ month : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...
My goal is to get for every observation point xn the count of all observations which cross a certain limit per year.
So far i tried to reach this with the Aggregate function.
To get the mean of every year i used the following command:
aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)
this works perfect, i get the mean for every year for every observation point.
To get the sum of one station i used the following code
aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
which results in this print:
year x1
1 1995 52
2 1996 43
3 1997 44
4 1998 42
5 1999 38
6 2000 76
7 2001 52
8 2002 58
9 2003 110
10 2004 34
11 2005 64
12 2006 46
13 2007 46
14 2008 17
15 2009 41
16 2010 30
17 2011 40
18 2012 47
19 2013 40
20 2014 21
21 2015 56
22 2016 27
23 2017 45
24 2018 22
25 2019 45
So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.
But how do I get them all at once?
I tried the following codes:
aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
Argumente müssen dieselbe Länge haben
But unfortunately none of them works. Could someone please give me a hint where my mistake is?

Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind() the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.

This should solve your problem
library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year(date)) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE))

Related

for loop in R to compute yearly evolution of a variable

I repost here what I posted on stats exchange, having been told it was better suited for stack overflow. Here is the structure of my dataset for reproducibility :
structure(list(numero = c("133", "62", "75", "76", "86", "281"
), tranche_age = c("20-30", "20-30", "20-30", "20-30", "20-30",
"20-30"), tranche_anciennete = c("5 ans et moins", "5 à 10 ans",
"5 ans et moins", "5 ans et moins", "5 à 10 ans", "5 à 10 ans"
), code_statut = c("C", "E", "E", "E", "E", "E"), code_contrat = c("A",
"A", "A", "A", "A", "A"), taux_demploi_mois = c(100, 100, 100,
100, 100, 100), echelon = c("E1", NA, NA, NA, NA, NA), niveau = c("N7",
NA, NA, NA, NA, NA), brut_mensuel = c(NA, 786.13, 1156.95, 1156.95,
904.79, 904.79), estimation_annuelle = c(NA, 10219.69, 15040.35,
15040.35, 11762.27, 11762.27), annee = c(2017, 2017, 2017, 2017,
2017, 2017), primes_en_montant = c(0, 0, 0, 0, 0, 0), primes_en_pourcentage =
c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), brut_mensuel_ETP = c(NA,
786.13, 1156.95, 1156.95, 904.79, 904.79)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Each worker is identified with one number ("numero"), which doesn't change from year to year. I would like to compute a new variable, to add to this dataframe, representing the evolution of the "estimation_annuelle" (which is the yearly wage) of each worker, from year to year (from 2017 to 2021), and then the average annual growth rate over the 5 years. Then, I would like to view those who have less than a 2% raise on one year (2017-2018 for example), and see whether it has been caught up in the following years or no (that is, if one's wage has increased by less than 2% between 2017 and 2018, if the wage increased one had between 2018 and 2019 compensated, and by how much, the insufficient raise on the previous yearly period).
I have tried a code to compute the variable evolution from year to year, which doesn't work :
test <- liste_complete %>%
group_by(annee, numero) %>%
select(numero, annee, estimation_annuelle)%>%
data.frame()
for(i in 1:length(test$estimation_annuelle)) {
print((test[i+1,] - test[i,])/test[i,])
}
And I have not found anything to compute the average annual growth rate (here is the formula : https://investinganswers.com/dictionary/a/average-annual-growth-rate-aagr), nor computed whether the insufficient increase for those who are concerned has been made up for in the following years.
Could anyone help ?
We can use a summarise then a match.
df$annee <- c(2017, 2017, 2018,2018, 2019,2019)
df$brut_mensuel[1] <- 11000
# first, summarise
summary <- df %>% select(numero, annee, estimation_annuelle, brut_mensuel) %>%
group_by(annee) %>% summarise(estimation_annuelle=mean(brut_mensuel)) %>% arrange(annee) %>%
mutate(salaire_annee_prec = lag(estimation_annuelle),
variation_annee_precedente=(estimation_annuelle-salaire_annee_prec)/salaire_annee_prec)
# matching
df$variation_annee_prec <- summary$variation_annee_precedente[match(df$annee,summary$annee)]
df
# A tibble: 6 x 15
numero tranche_age tranche_anciennete code_statut code_contrat taux_demploi_mois echelon niveau brut_mensuel estimation_annuelle annee primes_en_montant
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 133 20-30 5 ans et moins C A 100 E1 N7 11000 NA 2017 0
2 62 20-30 5 à 10 ans E A 100 NA NA 786. 10220. 2017 0
3 75 20-30 5 ans et moins E A 100 NA NA 1157. 15040. 2018 0
4 76 20-30 5 ans et moins E A 100 NA NA 1157. 15040. 2018 0
5 86 20-30 5 à 10 ans E A 100 NA NA 905. 11762. 2019 0
6 281 20-30 5 à 10 ans E A 100 NA NA 905. 11762. 2019 0
primes_en_pourcentage brut_mensuel_ETP variation_annee_prec
<dbl> <dbl> <dbl>
1 NA NA NA
2 NA 786. NA
3 NA 1157. -0.804
4 NA 1157. -0.804
5 NA 905. -0.218
6 NA 905. -0.218

Find percentage of NA values per month and year in a list of dara frames

I have a list of 83 csv files with three variables.
I have created new date columns including, month and year.
One of my dataframes from the list looks like this:
> head(estaciones$AeropuertodeBocas_93002)
Date Tx2m Tn2m Pr year month day
1 1988-01-01 27.4 23.1 41.3 1988 1 1
2 1988-01-02 29.8 24.0 0.3 1988 1 2
3 1988-01-03 30.4 24.0 0.4 1988 1 3
4 1988-01-04 30.0 24.2 2.4 1988 1 4
5 1988-01-05 29.6 23.2 9.1 1988 1 5
6 1988-01-06 30.0 23.1 5.2 1988 1 6
I would like to create a new file with the percentage of NA values per variable and per month and year. For example Jun 1988: 2% of missing values for variable "Pr" and dataframe "x".
I have tried using:
na_by_month <- map(estaciones, ~ .x %>%
mutate(Month=month(Date), Mis = rowSums(is.na(.))) %>%
group_by(Month) %>%
summarise(Sum=sum(Mis), Percentage=mean(Mis)))
This is only calculating missing values percentage for each month for the whole series and not per year.
Data (one of several dfs):
df <- structure(list(Date = structure(c(6574,
6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30),
Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
How can I create a new file containing percentage of missing values for each of my data frames inside the list, per month and per year? Thank You
If you're trying to calculate the percentage of missing values by month/year and just by year you could write a function that you can then map to your list of dataframes:
library(dplyr)
library(purrr)
library(openxlsx)
library(rlang)
ldf <- list(df, df, df)
f <- function(data, ...){
v <- enquos(...)
data %>%
group_by(!!! v) %>%
summarize(across(Tx2m:Pr,
list(missing = ~ mean(is.na(.))),
.names = paste0("{.col}_{.fn}_", quo_name(v[[1]]))),
.groups = "drop")
}
miss <- imap(ldf, ~ left_join(f(.x, month, year), f(.x, year), by = "year"))
write.xlsx(miss, "output.xlsx")
How it works
You provide the function f your dataframe and the variables you want to group by and it will calculate the percentage of missing values for those group by variables. For example, f(df, month, year) will group your data by month and year and calculate the percentage of missing values for each variable in the range Tx2m:Pr.
f(df, month, year)
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
f(df, year)
year Tx2m_missing_year Tn2m_missing_year Pr_missing_year
<int> <dbl> <dbl> <dbl>
1 1988 0 0 0
Note: the order of your grouping variables matters here. The first group by variable is used to construct the output variable names (eg Tn2m_missing_month).
If you want the number of missing by month/year and by year for each element of your list, then we can apply this function using imap and merge the results by year.
left_join(f(df, month, year), f(df, year), by = "year")
month year Tx2m_missing_month Tn2m_missing_month Pr_missing_month
<int> <int> <dbl> <dbl> <dbl>
1 1 1988 0 0 0
# ... with 3 more variables: Tx2m_missing_year <dbl>,
# Tn2m_missing_year <dbl>, Pr_missing_year <dbl>
Note: The missing by year will be repeated for each month within the year.
Lastly, write.xlsx will write a list of dataframes to an Excel workbook, where each sheet will be an element of your list.
If I've misunderstood your post and you only want the percentage missing by month within year then you can simplify this to:
miss <- imap(ldf, ~ f(.x, month, year))
Plot
To plot you could do something like this:
library(ggplot2)
library(tidyr)
library(scales)
library(lubridate)
plots <- imap(miss, ~ .x %>%
select(ends_with("year")) %>%
distinct() %>%
pivot_longer(cols = -year,
names_pattern = "(.*?)_(.*)",
names_to = c("var", NA)) %>%
mutate(date = ymd(year, truncated = 2L)) %>%
ggplot(aes(x = date, y = value, color = var, group = var)) +
geom_point() +
geom_line() +
scale_y_continuous(labels = percent_format()) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y")
)
plots[[1]]
where each variable is a line, it's y-axis value is the percent missing, and the x-axis is the year.
Note: with the given data in the example, the graphic is not that interesting and gives a warning about there being only one point. Additionally, all the points are overlapping on the same (x,y) coordinate with the given data.
df <- structure(list(Date = structure(c(6574, 6575, 6576, 6577, 6578, 6579), class = "Date"),
Tx2m = c(27.4, 29.8, 30.4, 30, 29.6, 30), Tn2m = c(23.1, 24, 24, 24.2, 23.2, 23.1),
Pr = c(41.3, 0.3, 0.4, 2.4, 9.1, 5.2),
year = c(1988, 1988, 1988, 1988, 1988, 1988 ),
month = c(1, 1, 1, 1, 1, 1), day = 1:6),
row.names = c(NA, 6L), class = "data.frame")
nongroup_vars <- setdiff(colnames(df),c('year','month'))
nongroup_vars_mr <- paste0(nongroup_vars,'_missing_ratio')
df %>%
group_by(month,year) %>%
summarise_all(function(x) mean(is.na(x))) %>%
ungroup %>%
rename_with(~nongroup_vars_mr,all_of(nongroup_vars))
it says missing ratios for each group.
output;
# A tibble: 1 × 7
month year Date_missing_ratio Tx2m_missing_ratio Tn2m_missing_ratio Pr_missing_ratio day_missing_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1988 0 0 0 0 0

How to change an ID's NA to character value, based on other ID's cell values/characteristics in R?

I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!

R: Add interpolated values in between columns of dataframe?

I have a data frame that looks like this
Region 2000 2001 2002 2003 2004 2005
Australia 15.6 18.4 19.2 20.2 39.1 50.2
Norway 19.05 20.2 15.3 10 10.1 5.6
and basically I need a quick way to add extra columns in-between the currently existing columns that contain interpolated values of the surrounding columns.
Think of it like this: say you don't want columns for every year, but rather columns for every quarter. Then, for every pair of years (like 2000 and 2001), we would need to add 3 extra columns in-between these years.
The values of these columns will just be interpolated values. So, for Australia, the value in 2000 is 15.6 and in 2001 it is 18.4. So we calculate (18.4 - 15.6)/4 = 0.7, and then the values should now be 15.6, 16.3, 17, 17.7, and finally 18.4.
I have a working solution that builds up the new dataframe from scratch using a for loop. It is EXTREMELY slow. How to speed this up?
This is how I did it when I had a similar problem. Not the most sophisticated solution but it works.
Australia=c( 15.6, 18.4, 19.2, 20.2, 39.1, 50.2)
library(zoo)
midpoints=rollmean(Australia, 2)
biyearly=c(rbind(Australia,midpoints))
midpoints=rollmean(biyearly, 2)
quarterly=c(rbind(biyearly,midpoints))
quarterly
#[1] 15.600 16.300 17.000 17.700 18.400 18.600 18.800 19.000 19.200 19.450 19.700
#[12] 19.950 20.200 24.925 29.650 34.375 39.100 41.875 44.650 47.425 50.200 33.600
#[23] 17.000 16.300
Here is one way with tidyverse:
library(tidyverse)
df %>%
#get data in long format
pivot_longer(cols = -Region) %>%
#group by Region
group_by(Region) %>%
#Create 4 number sequence between every 2 value
summarise(temp = list(unlist(map2(value[-n()], value[-1], seq, length.out = 4)))) %>%
#Get data in long format
unnest(temp) %>%
group_by(Region) %>%
#Create column name
mutate(col = paste0(rep(names(df)[-c(1, ncol(df))], each = 4), "Q", 1:4)) %>%
#Spread data in wide format
pivot_wider(names_from = col, values_from = temp)
# A tibble: 2 x 21
# Groups: Region [2]
# Region `2000Q1` `2000Q2` `2000Q3` `2000Q4` `2001Q1` `2001Q2` `2001Q3` `2001Q4` `2002Q1`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Austr… 15.6 16.5 17.5 18.4 18.4 18.7 18.9 19.2 19.2
#2 Norway 19.0 19.4 19.8 20.2 20.2 18.6 16.9 15.3 15.3
# … with 11 more variables: `2002Q2` <dbl>, `2002Q3` <dbl>, `2002Q4` <dbl>,
# `2003Q1` <dbl>, `2003Q2` <dbl>, `2003Q3` <dbl>, `2003Q4` <dbl>, `2004Q1` <dbl>,
# `2004Q2` <dbl>, `2004Q3` <dbl>, `2004Q4` <dbl>
data
df <- structure(list(Region = structure(1:2, .Label = c("Australia",
"Norway"), class = "factor"), `2000` = c(15.6, 19.05), `2001` = c(18.4,
20.2), `2002` = c(19.2, 15.3), `2003` = c(20.2, 10), `2004` = c(39.1,
10.1), `2005` = c(50.2, 5.6)), class = "data.frame", row.names = c(NA, -2L))
Here is a solution using dplyr. Should be more consistent and much faster than a loop:
# dummy data
df <- tibble(Region = LETTERS[1:5],
`2000` = 1:5,
`2001` = 3:7,
`2002` = 10:14)
# function to calculate quarterly values
into_quarter <- function(x) x / 4
df %>%
# create new variables that contain quarterly values
mutate_at(vars(starts_with("200")),
.funs = list("Q1" = into_quarter,
"Q2" = into_quarter,
"Q3" = into_quarter,
"Q4" = into_quarter)) %>%
# sort them approriatly.
# can also be done with base R and order(names), depending on desired result
select(Region,
starts_with("2000"),
starts_with("2001"),
starts_with("2002"),
# in case there are also other variables and to not loose any information
everything())

difference between first non-NA and last non-NA in each row

I have a data frame with up to 5 measurements (x) and their corresponding time:
df = structure(list(x1 = c(92.9595722286402, 54.2085219673818,
46.3227062573019,
NA, 65.1501442134141, 49.736451235317), time1 = c(43.2715277777778,
336.625, 483.975694444444, NA, 988.10625, 510.072916666667),
x2 = c(82.8368681534474, 53.7981639701784, 12.9993531230419,
NA, 64.5678816290574, 55.331442940348), time2 = c(47.8166666666667,
732, 506.747222222222, NA, 1455.25486111111, 958.976388888889
), x3 = c(83.5433119686794, 65.723072881366, 19.0147593408309,
NA, 65.1989838202356, 36.7000828457705), time3 = c(86.5888888888889,
1069.02083333333, 510.275, NA, 1644.21527777778, 1154.95694444444
), x4 = c(NA, 66.008102917677, 40.6243513885846, NA, 62.1694420909955,
29.0078249523063), time4 = c(NA, 1379.22986111111, 520.726388888889,
NA, 2057.20833333333, 1179.86805555556), x5 = c(NA, 61.0047472617535,
45.324715258421, NA, 59.862110645527, 45.883161439362), time5 = c(NA,
1825.33055555556, 523.163888888889, NA, 3352.26944444444,
1364.99513888889)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
"NA" means that the person (row) didn't have a measurement.
I would like to calculate the difference between the last existing measurement and the first one.
So for the first one it would be x3 minus x1 (6.4), for the second it would be -6.8 and so on.
I tried something like this, which didnt work:
df$diff = apply(df %>% select(., contains("x")), 1, function(x) head(x,
na.rm = T) - tail(x, na.rm=T))
Any suggestions? Also, is apply/rowwise the most efficient way, or is there a vectorized function to do that?
A vectorized way would be using max.col where we get "first" and "last" non-NA value using ties.method parameter
#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")
#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])
#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] -
new_df[cbind(1:nrow(df), first_col)]
Using apply you could do
x_cols <- grep("^x", names(df))
df$new_calc <- apply(df[x_cols], 1, function(x) {
new_x <- x[!is.na(x)]
if (length(new_x) > 0)
new_x[length(new_x)] - new_x[1L]
else NA
})
We can use tidyverse methods on the tbl_df. Create a row names column (rownames_to_column), gather the 'x' columns to 'long' format while removing the NA elements (na.rm = TRUE), grouped by row name, get the difference of first and last 'val'ues and bind the extracted column with the original dataset 'df'
library(tidyverse)
rownames_to_column(df, 'rn') %>%
select(rn, starts_with('x')) %>%
gather(key, val, -rn, na.rm = TRUE) %>%
group_by(rn) %>%
summarise(Diff = diff(c(first(val), last(val)))) %>%
mutate(rn = as.numeric(rn)) %>%
complete(rn = min(rn):max(rn)) %>%
pull(Diff) %>%
bind_cols(df, new_col = .)
# A tibble: 6 x 11
# x1 time1 x2 time2 x3 time3 x4 time4 x5 time5 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 93.0 43.3 82.8 47.8 83.5 86.6 NA NA NA NA -9.42
#2 54.2 337. 53.8 732 65.7 1069. 66.0 1379. 61.0 1825. 6.80
#3 46.3 484. 13.0 507. 19.0 510. 40.6 521. 45.3 523. -0.998
#4 NA NA NA NA NA NA NA NA NA NA NA
#5 65.2 988. 64.6 1455. 65.2 1644. 62.2 2057. 59.9 3352. -5.29
#6 49.7 510. 55.3 959. 36.7 1155. 29.0 1180. 45.9 1365. -3.85

Resources