This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 2 years ago.
I have a long dataset with student grades and courses going across many semesters. It has many NAs and many rows for each student. I want it to have one long row per student to fill in those NAs but keep the same column names.
Here's a sample:
library(tidyverse)
sample <- tibble(student = c("Corey", "Corey", "Sibley", "Sibley"),
fall_course_1 = c("Math", NA, "Science", NA),
fall_course_2 = c(NA, "English", NA, NA),
fall_grade_1 = c(90, NA, 98, NA),
fall_grade_2 = c(NA, 60, NA, NA))
And here's what I'd like it to look like:
library(tidyverse)
answer <- tibble(student = c("Corey", "Sibley"),
fall_course_1 = c("Math", "Science"),
fall_course_2 = c("English", NA),
fall_grade_1 = c(90, 98),
fall_grade_2 = c(60, NA))
Some semesters, some students take many classes and some just one. I've tried using coalesce(), but I can't figure it out. Any help would be appreciated!
This should do it, pivot the data long, remove the na's and then pivot it back to wide.
You need to convert the numeric values to character temporarily so they can go in the same column as the course labels, then type_convert() is a lazy way to put them back again.
library(dplyr)
library(tidyr)
library(readr)
reshaped <- sample %>%
mutate_if(is.numeric, as.character) %>%
pivot_longer(-student) %>%
drop_na() %>%
pivot_wider(student, names_from = name, values_from = value) %>%
type_convert()
You could get the first non-NA value in each column for each student.
library(dplyr)
sample %> group_by(student) %>% summarise_all(~na.omit(.)[1])
# A tibble: 2 x 5
# student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
# <chr> <chr> <chr> <dbl> <dbl>
#1 Corey Math English 90 60
#2 Sibley Science NA 98 NA
This approach returns NA if there are all NA values in a group.
Using a custom colaesce function and dplyr:
coalesce_all_columns <- function(df) {
return(coalesce(!!! as.list(df)))
}
library(dplyr)
sample %>%
group_by(student) %>%
summarise_all(coalesce_all_columns)
# A tibble: 2 x 5
student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
<chr> <chr> <chr> <dbl> <dbl>
1 Corey Math English 90 60
2 Sibley Science NA 98 NA
You could also use data.table package as follows:
library(data.table)
setDT(sample)[, lapply(.SD, na.omit), student]
sample
# 1: Corey Math English 90 60
# 2: Sibley Science <NA> 98 NA
Related
I have a data frame of multiple variables for for different years, that looks kind of like this:
df <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(1,6,8,NA),
X1990.1=c(10,20,NA,2),
X1990.2=c(2,4,6,8),
X1990.3=c(1,NA,3,6),
X1990.4=c(8,7,5,4),
X1991=c(2,6,3,5),
X1991.1=c(NA,20,NA,2),
X1991.2=c(NA,NA,NA,NA),
X1991.3=c(1,NA,3,5),
X1991.4=c(8,9,6,3))
I made this example with only 5 variables per year and with only 2 year, but in reality is a much larger df, with tens of variables for the years 1990 to 2020.
I want to create a new dataframe with the sums all the columns for the same year, so that the new data frame looks like this:
df_sum <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(22, 37, 22, 20),
X1991=c(11,35,12,15))
I was thinking some loop over rowSums(across(matches('pattern')), na.rm = TRUE) that I found on another questions, but so far have not been successful to implement.
Thanks!
We can reshape to 'long' format with pivot_longer, and get the sum while reshaping back to 'wide'
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = starts_with("X"), names_to = "name1") %>%
mutate(name1 = str_remove(name1, "\\.\\d+$")) %>%
pivot_wider(names_from = name1, values_from = value,
values_fn = ~ sum(.x, na.rm = TRUE))
-output
# A tibble: 4 × 3
name X1990 X1991
<chr> <dbl> <dbl>
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15
Or in base R, use split.default to split the data into a list of datasets based on the column name pattern, get the rowSums and cbind with the first column
cbind(df[1], sapply(split.default(df[-1],
trimws(names(df)[-1], whitespace = "\\.\\d+")), rowSums, na.rm = TRUE))
name X1990 X1991
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15
I have a question considering merging rows in a dataframe:
I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:
I have a dataframe with a structure like this:
person_id test_date serial_number freezer_number test_1 test_2 test_3 test_4
x 01/01/2010 c d positive NA NA NA
x 05/01/2010 a b NA positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
I want to merge the rows so that the data of the other columns remain intact (mainly the test
date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.
This would be the ideal output:
person_id test_date serial_number freezer_number test_date2 test_1 test_2 test_3 test_4
x 01/01/2010 c d 05/01/2010 positive positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
How do I go about this? I have tried the "aggregate()" functions before, however this is very unclear to me.
Any help is appreciated, I can give more information to clarify my current code and data frame!
You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by #Andrea M).
library(dplyr)
library(lubridate)
df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))
Output
> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA
What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider().
However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.
# Load libraries
library(tidyr)
library(dplyr)
library(stringr)
# Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)
df2 <- df %>%
# Add a column indicating test number
group_by(person_id) %>%
mutate(test_number = row_number(),
# Gather the test results into a single column
test_result = paste0(test_1, test_2, test_3, test_4) %>%
str_remove_all("NA")) %>%
select(-(test_1:test_4)) %>%
# Reshape from long to wide
pivot_wider(names_from = test_number,
values_from = c(test_date, serial_number,
freezer_number, test_result)) %>%
# Reorder the columns
relocate(ends_with("1"), .before = ends_with("2"))
df2
This question already has an answer here:
Create new variables with mutate_at while keeping the original ones
(1 answer)
Closed 3 years ago.
I have a dataset with many columns with similar names. Some of the columns have values in cents and others in dollars, e.g:
library (tidyverse)
data<- tribble(
~col1_cents, ~col1,~col2_cents, ~col2,
1000, NA, 3000, NA,
NA, 20, NA, 25.2,
2000, NA, 2030, NA,
)
For one variable, it's easy to divide the value by 100 and then assign it to the dollar variable, and delete the cent variable e.g.:
data %>% mutate( if_else(is.na(col1),
col1_cents/100,
col1) %>%
select(-col1_cents)
Is there a generalisable way to do this for all variables in the dataset that end in _cents? I tried this with mutate_at and ends_with but could not get it to rename to the original variable without _cents...
Thanks!
You can use mutate_at
library(dplyr)
data %>% mutate_at(vars(ends_with("cents")), ~./100)
# A tibble: 3 x 4
# col1_cents col1 col2_cents col2
# <dbl> <dbl> <dbl> <dbl>
#1 10 NA 30 NA
#2 NA 20 NA 25.2
#3 20 NA 20.3 NA
If you then want to combine the two columns, we can use split.default to split columns based on similarity of the names, use imap_dfc from purrr along with coalesce to combine them together.
df1 <- data %>% mutate_at(vars(ends_with("cents")), ~./100)
purrr::imap_dfc(split.default(df1, sub("_.*", "", names(df1))),
~.x %>% mutate(!!.y := coalesce(.x[[2]], .x[[1]])) %>% select(.y))
# col1 col2
# <dbl> <dbl>
#1 10 30
#2 20 25.2
#3 20 20.3
I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1
Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.
You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)
Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.
I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.
Example:
#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values
dat <- as.data.frame(cbind(a, b, c, d))
> desired output
a b c d
1 B A 1
2 C A <NA>
Rules of thumb:
1) Need to remove NA values from each column
2) Loop along data subsets (column "a" in example above)
3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values
Ideas:
lapply or dplyr is probably helpful to loop along all columns
na.omit is likely helpful, if the subsetting column that has entries for all
rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column
Any help is appreciated to put the final pieces together! Thank you!
One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.
library(dplyr)
dat %>%
group_by(a) %>%
summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
# a b c d
# <fctr> <fctr> <fctr> <fctr>
# 1 1 B A 1
# 2 2 C A <NA>
Solution with data.table and na.omit
library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)
I think the merge statement can be improved
Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:
library(dplyr)
library(tidyr)
dat %>%
filter_at(vars(b:c), any_vars(!is.na(.))) %>%
group_by(a) %>%
fill(b) %>%
fill(c) %>%
filter_at(vars(b:c), all_vars(!is.na(.)))
# A tibble: 2 x 4
# Groups: a [2]
a b c d
<fctr> <fctr> <fctr> <fctr>
1 1 B A 1
2 2 C A NA
You could also use just dplyr:
dat %>%
group_by(a) %>%
summarise_each(funs(first(.[!is.na(.)])))