I have a data frame which has distances from a unit's centroid to different points. The points are identified by numbers and what I am trying to obtain a new column where I get the distance to the closest object.
So the data frame looks like this:
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
sample2
FID year centroid_distance_1 centroid_distance_2 centroid_distance_3 centroid_distance_4
1 12 1990 220.3 520.0 NA 725.3
2 12 1994 220.3 520.0 12.8 725.3
3 14 1983 515.6 24.3 124.2 44.2
4 15 1953 NA NA NA NA
5 17 1957 200.2 NA NA 62.9
6 18 2000 22.0 51.8 18.8 217.9
FID is an identifier of each unit and year a year indicator. Each row is a FID*year pair. centroid_distance_xis the row's distance between its centroid and the object x. This is a small sample of the data frame, which contains much more columns and rows.
What I am looking for is something like this:
short_distance <- c(220.3, 12.8, 24.3, NA, 62.9,18.8)
unit <- c(1, 3, 2, NA, 4, 3)
ideal.df <- data.frame(FID, year, short_distance, unit)
ideal.df
FID year short_distance unit
1 12 1990 220.3 1
2 12 1994 12.8 3
3 14 1983 24.3 2
4 15 1953 NA NA
5 17 1957 62.9 4
6 18 2000 18.8 3
Where basically, I add one column with named short_distance which is the cell with the lower value a row takes of all the centroid_distance_* columns above, and one named unit which identifies the object from which each row has the smaller distance (so if one row has smallest value in centorid_distance_1 it takes the value of 1 for unit).
I have tried a bunch of things with dplyr and pivot and re-pivoting the dataframe but I'm really not getting there.
Thanks a lot for the help!
Another solution based in the tidyverse - using pivot_longer - could look as follows.
library(dplyr)
library(tidyr)
library(stringr)
sample2 %>%
pivot_longer(-c(FID, year)) %>%
group_by(year, FID) %>%
slice_min(value, n = 1, with_ties = FALSE) %>%
mutate(unit = str_sub(name, -1)) %>%
select(-name, short_distance = value)
# Groups: year, FID [6]
# FID year short_distance unit
# <dbl> <dbl> <dbl> <chr>
# 1 15 1953 NA 1
# 2 17 1957 62.9 4
# 3 14 1983 24.3 2
# 4 12 1990 220. 1
# 5 12 1994 12.8 3
# 6 18 2000 18.8 3
My first couple of attempts at this weren't working like I imagined, either - couldn't always get the NA behavior you want - but here's one that works:
library(dplyr)
library(reshape2) # Or use tidyr if you prefer
sample2 %>%
# Melt/unpivot to one value per row
melt(id.vars = c("FID", "year")) %>%
# Extract the unit number
mutate(
unit = sub(x = variable,
pattern = "^centroid_distance_",
replacement = "")
) %>%
group_by(FID, year) %>% # Group by FID and year to get one row of output for each
arrange(value) %>% # Put smallest distance at the top of each group
slice_head(n = 1) # Take one row from the top of each group
Base R solution
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
Apply function min for each row and add it to the data frame as column short_distance. Ignore the warning and handle it in the next operation.
sample2$short_distance <- apply(sample2[,3:6], 1, min, na.rm = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
sample2$short_distance[is.infinite(sample2$short_distance)] <- NA #C hange `Inf` created by the `min` function to to `NA`
Get units with which.min. ifelse is required because min.which would drop NA rows.
sample2$unit <- apply(sample2[,3:6], 1, function(x) ifelse(length(which.min(x)) == 0, NA, which.min(x)))
Keep only relevant columns
sample2 <- sample2[, c(1,2,7,8)]
sample2
#> FID year short_distance unit
#> 1 12 1990 220.3 1
#> 2 12 1994 12.8 3
#> 3 14 1983 24.3 2
#> 4 15 1953 NA NA
#> 5 17 1957 62.9 4
#> 6 18 2000 18.8 3
Created on 2021-01-18 by the reprex package (v0.3.0)
Here is a solution using dplyr & stringr packages (but you can just import tidyverse):
library(tidyverse)
df <- sample2 %>%
gather('centroid', 'dist', 3:length(.)) %>%
group_by(year) %>%
slice(if(all(is.na(dist))) 1L else which.min(dist)) %>%
mutate(centroid = str_replace(centroid, "centroid_distance_", ""))
df
Returns:
# A tibble: 6 x 4
# Groups: year [6]
FID year centroid dist
<dbl> <dbl> <chr> <dbl>
1 15 1953 1 NA
2 17 1957 4 62.9
3 14 1983 2 24.3
4 12 1990 1 220.
5 12 1994 3 12.8
6 18 2000 3 18.8
A data.table solution
setDT(sample2)
s <- melt(sample2, id = 1:2, variable.name = "object", value.name = "distance") ## pivot
s[, obj := as.numeric(object) ## transform factor into numeric
][, .(shortest = min(distance, na.rm=TRUE), unit= which.min(distance)), by = .(FID, year) ## calculate the shortest and which
][is.infinite(shortest), shortest:= NA # transform Inf into NA
][] ## report
Related
I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France
We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]
My dataframe consists of monthly weather data as follows for a given location
set.seed(123)
dat <-
data.frame(Year = rep(1980:1985, each = 12),
Month = rep(1:12, times = 6),
value = runif(12*6))
I have split the year into seasons as shown below.
s1 <- c(11, 12, 1, 2) # season 1 consists of month 11, 12, 1 and 2 i.e. cuts across years
s2 <- c(3, 4, 5) # season 2 consists of month 3, 4, 5
s3 <- c(6, 7, 8, 9, 10) # season 3 consists of month 6, 7, 8, 9, 10
Taking example for 1980 -
season 1 is Nov-Dec from 1979 and Jan-Feb from 1980
season 2 is from March - May of 1980
season 3 is June - Oct of 1980
However, for year 1980, season 1 is incomplete since it only has months 1 and 2 and missing
the months 11 and 12 from 1979.
In contrast, for year 1985 season 1 to season 3 is complete and hence
I do not need months 11 and 12 from 1985 since it contributes to 1986 season1
With this background, I want to sum monthly values of each season by year
so that the dataframe is in year X season format instead of year-month format
In doing so there will be no values for 1980 season1 since it has missing months.
For cases when months cut across years, I don't know how to sum individual months?
library(dplyr)
season_list <- list(s1, s2, s3)
temp_list <- list()
for(s in seq_along(season_list)){
season_ref <- unlist(season_list[s])
if(sum(diff(season_ref) < 0) != 0){ # check if season cuts across years
dat %>%
dplyr::filter(Month %in% season_ref) %>%
# how do I sum across years for this exception
} else {
# if season does not cut across years, simply filter the months in each year and add
temp_list[[s]] <-
dat %>%
dplyr::filter(Month %in% season_ref) %>%
dplyr::group_by(Year) %>%
dplyr::summarise(season_value = sum(value)) %>%
dplyr::mutate(season = s)
}
}
Assuming that you want to sum the values for each season calculate the Season and endYear (the year that the season ends) and then sum by those.
dat %>%
group_by(endYear = Year + (Month %in% 11:12),
Season = 1 * (Month %in% s1) +
2 * (Month %in% s2) +
3 * (Month %in% s3)) %>%
summarize(value = sum(value), .groups = "drop")
giving:
# A tibble: 19 x 3
endYear Season value
<int> <dbl> <dbl>
1 1980 1 1.08
2 1980 2 2.23
3 1980 3 2.47
4 1981 1 2.66
5 1981 2 1.25
6 1981 3 2.91
7 1982 1 3.00
8 1982 2 1.43
9 1982 3 3.50
10 1983 1 1.48
11 1983 2 0.693
12 1983 3 1.49
13 1984 1 1.82
14 1984 2 1.29
15 1984 3 1.77
16 1985 1 2.03
17 1985 2 1.47
18 1985 3 3.31
19 1986 1 1.38
my data is here:
x <- data.frame("Year" = c(1945,1945,1945,1946,1946,1946, 1947,1947,1947), "Age" = c(1,2,3,1,2,3,1,2,3), "Value" = c(4,5,6,7,8,9,10,11,12))
I would like to assign the value from "year+1 and age +1" to a new variable. Ex. For the case with year =1945 and age=1, I would like to assign the value = 8 (from year = 1946, age =2 ) to the new variable.
My ideal result will be like this:
x <- data.frame("Year" = c(1945,1945,1945,1946,1946,1946, 1947,1947,1947), "Age" = c(1,2,3,1,2,3,1,2,3), "Value" = c(4,5,6,7,8,9,10,11,12),"Year1moereandAge1more"= c(8,9,NA, 11, 12, NA, NA, NA,NA))
Thank you for helping a beginner.
Using a modified self-join:
library(dplyr)
x %>%
transmute(Year = Year - 1, Age = Age - 1, Year1moereandAge1more = Value) %>%
right_join(x) %>%
arrange(Year, Age)
# Joining, by = c("Year", "Age")
# Year Age Year1moereandAge1more Value
# 1 1945 1 8 4
# 2 1945 2 9 5
# 3 1945 3 NA 6
# 4 1946 1 11 7
# 5 1946 2 12 8
# 6 1946 3 NA 9
# 7 1947 1 NA 10
# 8 1947 2 NA 11
# 9 1947 3 NA 12
I need to use na.locf from the zoo package to replace NA values with the last observed value. However, I need to do this only for specific country & variable pairs. These pairs are specified logically using a seperate data frame, an example of which is shown below.
Country <- c("FRA", "DEU", "CHE")
acctm <- c(0, 0, 1)
acctf <- c(1, 1, 0)
df1 <- data.frame(Country, acctm, acctf)
Country acctm acctf
1 FRA 0 1
2 DEU 0 1
3 CHE 1 0
a 1 meaning use na.locf for this pair. An example of the dataset where replacement would be needed is shown below.
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(20, 30, 10, NA, 20, NA)
acctf <- c(20, NA, 15, NA, 40, NA)
df2 <- data.frame(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 NA
3 DEU 2010 10 15
4 DEU 2020 NA NA
5 CHE 2010 20 40
6 CHE 2020 NA NA
Given both of the example datasets, the result of the function executing na.locf on df2 for country/variable pairs indicated by df1 should look like this:
acctm <- c(20, 30, 10, NA, 20, 20)
acctf <- c(20, 20, 15, 15, 40, NA)
df3 <- data.frame(Country, Year, acctm, acctf)
Country2 Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA
The real application is a much larger dataset, so "calls" should be generalized. Thanks.
One option is a join with data.table on the 'Country' column, then use Map to apply the na.locf on the second dataset columns ('nm1') based on the value of the corresponding columns of first dataset and assign (:=) the output back to the columns
library(zoo)
library(data.table)
nm1 <- c('acctm', 'acctf')
nm2 <- paste0("i.", nm1)
setDT(df2)[df1, (nm1) := Map(function(x, y) if(y == 1) na.locf0(x)
else x, mget(nm1), mget(nm2)), on = .(Country), by = .EACHI]
df2
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 20
#3: DEU 2010 10 15
#4: DEU 2020 NA 15
#5: CHE 2010 20 40
#6: CHE 2020 20 NA
One dplyr and tidyr option could be:
df2 %>%
pivot_longer(-c(Country, Year)) %>%
left_join(df1 %>%
pivot_longer(names_to = "cond_names",
values_to = "cond_values", -Country),
by = c("Country" = "Country",
"name" = "cond_names")) %>%
group_by(Country, name) %>%
mutate(value = if_else(cond_values == 1, na.locf(value), value)) %>%
select(-cond_values) %>%
pivot_wider()
Country Year acctm acctf
<fct> <dbl> <dbl> <dbl>
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA
Left join df2 to df1 on Country and then grouping by Country generate the appropriate value for each numeric column. Note that we use na.locf0 which ensures that the result has the same length as the input. Finally select the appropriate columns.
library(dplyr)
library(zoo)
df2 %>%
left_join(df1, by = "Country") %>%
group_by(Country) %>%
mutate(acctm = if (first(acctm.y)) na.locf0(acctm.x) else acctm.x,
acctf = if (first(acctf.y)) na.locf0(acctf.x) else acctf.x) %>%
ungroup %>%
select(names(df2))
giving:
# A tibble: 6 x 4
Country Year acctm acctf
<fct> <dbl> <dbl> <dbl>
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA
I've got a rather ugly bit of data to tidy up and need help! What my data look like now:
countries <- c("Austria", "Belgium", "Croatia")
df <- tibble("age" = c(28,42,19, 67),
"1_recreate_1"=c(NA,15,NA,NA),
"1_recreate_2"=c(NA,10,NA,NA),
"1_recreate_3"=c(NA,8,NA,NA),
"1_recreate_4"=c(NA,4,NA,NA),
"1_fairness" = c(NA, 7, NA, NA),
"1_confidence" = c(NA, 5, NA, NA),
"2_recreate_1"=c(29,NA,NA,30),
"2_recreate_2"=c(20,NA,NA,24),
"2_recreate_3"=c(15,NA,NA,15),
"2_recreate_4"=c(11,NA,NA,9),
"2_fairness" = c(4, NA, NA, 1),
"2_confidence" = c(5, NA, NA, 4),
"3_recreate_1"=c(NA,NA,50,NA),
"3_recreate_2"=c(NA,NA,40,NA),
"3_recreate_3"=c(NA,NA,30,NA),
"3_recreate_4"=c(NA,NA,20,NA),
"3_fairness" = c(NA, NA, 2, NA),
"3_confidence" = c(NA, NA, 2, NA),
"overall" = c(3,3,2,5))
What I need them to look like at the end (hard-coding it):
df <- tibble(age = rep(c(28,42,19,67), each=4),
country = rep(c("Belgium", "Austria", "Croatia", "Belgium"), each=4),
recreate = rep(1:4, times=4),
fairness = rep(c(4,7,2,1), each=4),
confidence = rep(c(5,5,2,4), each=4),
allocation = c(29, 20, 15, 11,
15, 10, 8, 4,
50, 40, 30, 20,
30, 24, 15, 9),
overall = rep(c(3,3,2,5), each=4))
Steps to get there (I think!):
1. Replace the starting numbers for those columns using my list of countries.
The number that starts the string is the index in countries. In other words, 16_recreate_1 would correspond with the 16th country in the vector countries. I think the following code works (though am not sure it's exactly right):
for(i in length(countries):1){
colnames(df) <- str_replace(colnames(df), paste0(i,"_"), paste0(countries[i],"_"))
}
2. Create a new variable called "country" by getting the name of the column(s) that is NOT NA for each row.
I tried a BUNCH of experimentation with which.max and names, but couldn't get it fully functional.
3. Create new variables (recreate_1...recreate_4) that grab the [country_name]_recreate_1...[country_name]_recreate_4 value for each row, whatever country is non-NA for that person.
Maybe rowSums is the way to do this?
4. Make the data long instead of wide
I think this is going to require gather, but I'm not sure how to gather from only the variables country and recreate_1...recreate_4.
I'm so sorry this is so complex. Tidyverse solutions are preferred but any help is greatly appreciated!
A somehow different tidyverse possibility could be:
df %>%
gather(variable, allocation, na.rm = TRUE) %>%
separate(variable, c("ID", "variable", "recreate"), convert = TRUE) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
select(-variable, -ID)
recreate allocation country
<int> <dbl> <fct>
1 1 15 Austria
2 2 10 Austria
3 3 8 Austria
4 4 4 Austria
5 1 29 Belgium
6 1 30 Belgium
7 2 20 Belgium
8 2 24 Belgium
9 3 15 Belgium
10 3 15 Belgium
11 4 11 Belgium
12 4 9 Belgium
13 1 50 Croatia
14 2 40 Croatia
15 3 30 Croatia
16 4 20 Croatia
Here it, first, transforms the data from wide to long format, removing the rows with NA. Second, it separates the variable names into three columns. Third, it transforms the vector of countries into a df and assigns each country a unique ID. Finally, it joins the two and removes the redundant variables.
A solution to the edited question:
df %>%
select(matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, allocation, -rowid, na.rm = TRUE) %>%
separate(var, c("ID", "var", "recreate"), convert = TRUE) %>%
select(-var) %>%
left_join(data.frame(countries) %>%
mutate(country = countries,
ID = seq_along(countries)) %>%
select(-countries), by = c("ID" = "ID")) %>%
left_join(df %>%
select(-matches("(recreate)")) %>%
rowid_to_column() %>%
gather(var, val, -rowid, na.rm = TRUE) %>%
mutate(var = gsub("[^[:alpha:]]", "", var)) %>%
spread(var, val), by = c("rowid" = "rowid")) %>%
select(-rowid, -ID)
recreate allocation country age confidence fairness overall
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 15 Austria 42 5 7 3
2 2 10 Austria 42 5 7 3
3 3 8 Austria 42 5 7 3
4 4 4 Austria 42 5 7 3
5 1 29 Belgium 28 5 4 3
6 1 30 Belgium 67 4 1 5
7 2 20 Belgium 28 5 4 3
8 2 24 Belgium 67 4 1 5
9 3 15 Belgium 28 5 4 3
10 3 15 Belgium 67 4 1 5
11 4 11 Belgium 28 5 4 3
12 4 9 Belgium 67 4 1 5
13 1 50 Croatia 19 2 2 2
14 2 40 Croatia 19 2 2 2
15 3 30 Croatia 19 2 2 2
16 4 20 Croatia 19 2 2 2
Here it, first, selects the columns that contain recreate and adds a columns with row ID. Second, it follows the steps from the original solution. Third, it selects the columns that do not contain recreate, performs a wide-to-long data transformation, removes the number from column names and transforms the data back to the original wide format. Finally, it joins the two on row ID and removes the redundant variables.
library(dplyr)
library(tidyr)
df %>% mutate(rid=row_number()) %>%
gather(key,val,-c(age,overall,rid, matches('recreate'))) %>% mutate(country=sub('(^\\d)_.*','\\1',key),country=countries[as.numeric(country)]) %>%
filter(!is.na(val)) %>% mutate(key=sub('(^\\d\\_)(.*)','\\2',key)) %>%
spread(key,val) %>% gather(key = recreate,value = allocation,-c(rid,age,overall,Country,confidence,fairness)) %>%
filter(!is.na(allocation)) %>% mutate(recreate=sub('.*_(\\d$)','\\1',recreate))
Here (^\\d)_.* means get the first digit while .*_(\\d$) means get the last digit.