If I have a dataset with three columns like this below
Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F
How could I clean this data such that I see a dataset like this below ?
Id Date Gender
1 03-11-1977 F
2 04-17-2005 M
3 06-04-1999 F
Thanks.
fill the values by Id and filter NA values.
library(dplyr)
df %>%
group_by(Id) %>%
tidyr::fill(Gender, .direction = "updown") %>%
filter(!is.na(Date))
# Id Date Gender
# <int> <chr> <chr>
#1 1 03-11-1977 F
#2 2 04-17-2005 M
#3 3 06-04-1999 F
You may use na.omit in a by approach.
dat <- do.call(rbind, by(dat, dat$Id, function(x) cbind(x[1,1,drop=F], lapply(x[-1], na.omit))))
dat
# Id Date Gender
# 1 1 03-11-1977 F
# 2 2 04-17-2005 M
# 3 3 06-04-1999 F
Data:
dat <- read.table(header=T,text=' Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F')
Related
I have two almost identical dataframes containing the same people (df_A and df_B). I would now like to check for each person how many values in df_A and df_B match (e.g., Person 1 has 3 identical values in df_A and df_B, whereas Person 4 has 2 identical values).
I would like to create new variables that contain the information about the number of matching values.
df_A and df_B could look like this:
df_A <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 NA 1 1 NA NA 1
4 1 1 NA NA 1 NA
5 NA NA NA 1 1 1", header=TRUE)
df_B <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 1 NA 1 1 NA NA
4 1 1 1 NA NA NA
5 1 1 1 NA NA NA", header=TRUE)
Ideally, the end result would look like this:
df_C <- read.table(text=
"ID Matches
1 3
2 3
3 1
4 2
5 0", header=TRUE)
Do you have any ideas on how achieve this most efficiently using R?
I'm relatively new to R and would like to learn how to solve such problems without lengthy code. Thanks for your hints!
Here's an idea.
library(dplyr)
library(tidyr)
left_join(df_A, df_B, by = 'ID') %>%
pivot_longer(-ID, names_pattern = '(.*).[xy]') %>%
group_by(ID, name) %>%
summarise(matches = !any(is.na(value)) & n_distinct(value, na.rm = TRUE)) %>%
summarise(matches = sum(matches))
#> # A tibble: 5 × 2
#> ID matches
#> <int> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 1
#> 4 4 2
#> 5 5 0
This question already has answers here:
Average across Columns in R, excluding NAs
(2 answers)
Closed 11 months ago.
I have a dataset with an identifier variable and some numeric variables. I want to calculate the mean of the columns according to the identifier variable. Here is a simple example:
From this
id v1 v2 v3 v4
d 1 2 NA NA
e NA NA 3 3
e NA NA 2 4
d 3 5 NA NA
I want to get to this:
id v1 v2 v3 v4 mean
d 1 2 NA NA 1.5
e NA NA 3 3 3
e NA NA 2 4 3
d 3 5 NA NA 4
I would like to use an if else statement like:
ifelse(id=d, colMeans(v1:v2), colMeans(v3:v4)
Thank you in advance!
df %>%
rowwise %>%
mutate(mean = mean(c_across(v1:v4), na.rm = T))
# A tibble: 4 x 6
# Rowwise:
id v1 v2 v3 v4 mean
<chr> <int> <int> <int> <int> <dbl>
1 d 1 2 NA NA 1.5
2 e NA NA 3 3 3
3 e NA NA 2 4 3
4 d 3 5 NA NA 4
Or:
df %>%
rowwise %>%
mutate(mean = mean(c_across(where(is.numeric)), na.rm = T))
I would like to conditionally subset a dataframe in R, using dplyr::select_if(). More specifically, I have a dataframe that is made up of a grouping variable and numerous other variables that contain a bunch of NAs:
data <- tibble(group = sort(rep(letters[1:5],3)),
var_1 = c(1,1,1,1,rep(NA,11)),
var_2 = c(1,1,1,1,1,1,rep(NA,9)),
var_3 = 1,
var_4 = c(1,1,rep(NA,10),1,1,1),
var_5 = c(1,1,1,1,1,1,NA,NA,NA,NA,NA,NA,1,1,1))
# A tibble: 15 x 6
group var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 1 1 1 1
2 a 1 1 1 1 1
3 a 1 1 1 NA 1
4 b 1 1 1 NA 1
5 b NA 1 1 NA 1
6 b NA 1 1 NA 1
7 c NA NA 1 NA NA
8 c NA NA 1 NA NA
9 c NA NA 1 NA NA
10 d NA NA 1 NA NA
11 d NA NA 1 NA NA
12 d NA NA 1 NA NA
13 e NA NA 1 1 1
14 e NA NA 1 1 1
15 e NA NA 1 1 1
In this dataframe, I need to identify and remove columns like var_4 in this case that only occur in one group (but irrespective of whether or not they show up in the last group: "e"). Importantly, everything else has to remain untouched (i.e. I want to keep variables that look like var_1,var_2,var_3, and var_5). This is what I tried:
library(dplyr)
data %>%
filter(group!="e") %>% # Ignore last group.
select_if(~ function(col)) %>% # Write function to look for cols that only have values for one group of the total four groups remaining (a-d).
names() -> cols_to_drop # Save col names.
data %>% select(-cols_to_drop) -> new_data # Subset by saved col names.
Unfortunately, I can't figure out how to write that function inside select_if() to specify that grouping variable condition.
A second thing that I have been wondering about is whether I can use select_if() to remove cols based on the percentage of NAs it contains. Is there a way?
I am not sure if select_if would be able to do such grouped selection of columns.
Here is one way to do this getting data in long format :
library(dplyr)
cols <- data %>%
filter(group != "e") %>%
tidyr::pivot_longer(cols = starts_with('var')) %>%
group_by(name, group) %>%
summarise(value = any(!is.na(value))) %>%
summarise(value = sum(value)) %>%
filter(value > 1) %>%
pull(name)
#Select the columns
data %>% select(group, cols)
# group var_1 var_2 var_3 var_5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 1 1 1 1
# 3 a 1 1 1 1
# 4 b 1 1 1 1
# 5 b NA 1 1 1
# 6 b NA 1 1 1
# 7 c NA NA 1 NA
# 8 c NA NA 1 NA
# 9 c NA NA 1 NA
#10 d NA NA 1 NA
#11 d NA NA 1 NA
#12 d NA NA 1 NA
#13 e NA NA 1 1
#14 e NA NA 1 1
#15 e NA NA 1 1
The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")
I was looking at this example code below,
r element frequency and column name
and was wondering if there is any way to show the index of each element in each column, in addition to the rank and frequency in r. so for example, the desired input and output would be
df <- read.table(header=T, text='A B C D
a a b c
b c x e
c d y a
d NA NA z
e NA NA NA
f NA NA NA',stringsAsFactors=F)
and output
element frequency columns ranking A B C D
1 a 3 A,B,D 1 1 1 na 2
3 c 3 A,B,D 1 3 2 na 1
2 b 2 A,C 2 2 na 1 na
4 d 2 A,B 2 4 3 na na
5 e 2 A,D 2 5 na na 2
6 f 1 A 3 6 na na na
8 x 1 C 3 na na 2 na
9 y 1 C 3 na na 3 na
10 z 1 D 3 na na na 3
Thank you.
Perhaps there is a way to do this in one step, but it's not coming to mind at the moment. So, continuing with my previous answer:
library(dplyr)
library(tidyr)
step1 <- df %>%
gather(var, val, everything()) %>% ## Make a long dataset
na.omit %>% ## We don't need the NA values
group_by(val) %>% ## All calculations grouped by val
summarise(column = toString(var), ## This collapses
freq = n()) %>% ## This counts
mutate(ranking = dense_rank(desc(freq))) ## This ranks
step2 <- df %>%
mutate(ind = 1:nrow(df)) %>% ## Add an indicator column
gather(var, val, -ind) %>% ## Go long
na.omit %>% ## Remove NA
spread(var, ind) ## Go wide
inner_join(step1, step2)
# Joining by: "val"
# Source: local data frame [9 x 8]
#
# val column freq ranking A B C D
# 1 a A, B, D 3 1 1 1 NA 3
# 2 b A, C 2 2 2 NA 1 NA
# 3 c A, B, D 3 1 3 2 NA 1
# 4 d A, B 2 2 4 3 NA NA
# 5 e A, D 2 2 5 NA NA 2
# 6 f A 1 3 6 NA NA NA
# 7 x C 1 3 NA NA 2 NA
# 8 y C 1 3 NA NA 3 NA
# 9 z D 1 3 NA NA NA 4