Delete rows that have incomplete value in other column in R - r

I wanted to delete rows in x1 column that don't appear in EVERY month in another column:
The dataset is as follows:
id month
1 01
2 01
3 01
1 02
2 02
1 03
2 03
I want to delete id = 3 from the dataset, since it doesn't appear in month = 02
Im using R
Thank you for helping

You can split the dataset and use Reduce, i.e.
remove <- Reduce(setdiff, split(df$id, df$month))
df[!df$id %in% remove,]
id month
1 1 1
2 2 1
4 1 2
5 2 2
6 1 3
7 2 3
As #jay.sf mentioned, you need to assign it back to your dataframe,
df <- df[!df$id %in% remove,]

Using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(n_distinct(month) == n_distinct(df$month)) %>%
ungroup
-output
# A tibble: 6 × 2
id month
<int> <int>
1 1 1
2 2 1
3 1 2
4 2 2
5 1 3
6 2 3
Or using data.table
library(data.table)
data_hh[, if(uniqueN(month) == uniqueN(.SD$month)) .SD, .(id)]
data
data_hh <- structure(list(id = c(18354L, 18815L, 19014L, 63960L, 72996L,
73930L), month = c(1, 1, 1, 1, 1, 1), value = c(113.33, 251.19,
160.15, 278.8, 254.39, 733.22), x1 = c(96.75, 186.78, 106.02,
195.23, 184.57, 473.92), x2 = c(1799.1, 5399.1, 1799.1, 1349.1,
2924.1, 2024.1), x3 = c(85.37, 74.36, 66.2, 70.02, 72.55, 64.63
), x4 = c(6.29, 4.65, 8.9, 20.66, 8.69, 36.22)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))

Related

rowMeans() grouping by variable [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 12 months ago.
this is probably trivial, but my data looks like this:
t <- structure(list(var = 1:5, ID = c(1, 2, 1, 1, 3)), class = "data.frame", row.names = c(NA,
-5L))
> t
var ID
1 1 1
2 2 2
3 3 1
4 4 1
5 5 3
I would like to get a mean value for each ID, so my idea was to transform them into this (variable names are not important):
f <- structure(list(ID = 1:3, var.1 = c(1, 2, 5), var.2 = c(2, NA,
NA), var.3 = c(3, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
> f
ID var.1 var.2 var.3
1 1 1 2 3
2 2 2 NA NA
3 3 5 NA NA
so that I could then calculate the mean for each var.x.
I know it's possible with tidyr (possibly pivot_wider?), but I can't figure out how to group it. How do I get a mean value for each ID?
Thank you in advance
You could use ave to get the mean of var for each ID:
t$mean = ave(t$var, t$ID, FUN = mean)
Result:
var ID mean
1 1 1 2.666667
2 2 2 2.000000
3 3 1 2.666667
4 4 1 2.666667
5 5 3 5.000000
If you want a simple table with the means, you could use aggregate:
aggregate(formula = var~ID, data = t, FUN = mean)
ID var
1 1 2.666667
2 2 2.000000
3 3 5.000000
If you want to use rowMeans on your t dataframe, then we can first use pivot_wider, then get the mean of the row.
library(tidyverse)
t %>%
group_by(ID) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = var, names_prefix = "var.") %>%
mutate(mean = rowMeans(select(., starts_with("var")), na.rm = TRUE))
# ID var.1 var.2 var.3 mean
# <dbl> <int> <int> <int> <dbl>
# 1 1 1 3 4 2.67
# 2 2 2 NA NA 2
# 3 3 5 NA NA 5
Or since t is in long form, then we can just group by ID, then get the mean for all values in that group.
t %>%
group_by(ID) %>%
summarise(mean = mean(var))
# ID mean
# <dbl> <dbl>
#1 1 2.67
#2 2 2
#3 3 5
Or for f, we can use rowMeans for each row that will include any column that starts with var.
f %>%
mutate(mean = rowMeans(select(., starts_with("var")), na.rm = TRUE))
# ID var.1 var.2 var.3 mean
#1 1 1 2 3 2
#2 2 2 NA NA 2
#3 3 5 NA NA 5

Fill missing values with previous values by row using dplyr

I am working with a dataframe in R which has some missing values across rows. Data frame is next (dput added in the end):
df
id V1 V2 V3 V4
1 01 1 1 1 NA
2 02 2 1 NA NA
3 03 3 1 NA NA
4 04 4 1 2 NA
Each row is a different id. As you can see the rows have missing values. I would like to know how can I get a dataframe completed in this style without using reshape to long or pivot as my real data is very large:
df
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
I was trying to use fill from tidyr but at row level I am having issues. I have seen some posts where it is used along with dplyr function across but I can not find it. I have tried using group_by(id) and rowwise but I have not had success. Also only the variables/columns starting with V should be filled with previous values.
Data is next:
#Data
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
Many thanks for your time.
One solution could be using na.locf function from package zoo combining with purrr::pmap function in a row-wise operation. na.locf takes the most recent non-NA value and replace all the upcoming NA values by that. Just as a reminder c(...) in both solutions captures all values of V1:V4 in each row in every iteration. However, I excluded id column in both as it is not involved in the our calculations.
library(zoo)
library(purrr)
df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or we can use coalesce function from dplyr. We can replace every NA values in each row with the last non-NA value, something we did earlier with na.locf. However this solution is a bit verbose:
df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or you could also use this:
library(purrr)
df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
The warning message can be ignored. It is in fact produced because we have 6 NA values but the result of applying dplyr::coalesce on every vector is 1 element resulting in 4 elements to replace 6 slots.
One option using dplyr could be:
df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .)))
id V1 V2 V3 V4
1 1 1 1 1 1
2 2 2 1 1 1
3 3 3 1 1 1
4 4 4 1 2 2
A dplyr approach
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr, warn.conflicts = F)
df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1))))
#> id V1 V2 V3 V4
#> 1 01 1 1 1 2
#> 2 02 2 1 2 2
#> 3 03 3 1 2 2
#> 4 04 4 1 2 2
If you'll group_by on id column, you won't have to use [-1] on cur_data()`
df %>% group_by(id) %>%
mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[!is.na(cur_data())],1))))
A data.table option with nafill
> setDT(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id]
id V1 V2 V3 V4
1: 01 1 1 1 1
2: 02 2 1 1 1
3: 03 3 1 1 1
4: 04 4 1 2 2
If the reason you want to avoid reshaping is to save runtime then that idea is actually mistaken if the benchmark below continues to hold at scale. Note that f which transposes, uses na.locf and then transposes back is the fastest.
library(microbenchmark)
library(data.table)
library(dplyr)
library(purrr)
library(zoo)
microbenchmark(times = 10,
a = df %>% mutate(pmap_df(., ~ na.locf(c(...)[-1]))),
b = df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])})),
c = df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .))),
d = df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1)))),
e = as.data.table(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id],
f = data.frame(id = df$id, t(na.locf(t(df[-1])))))
giving:
Unit: milliseconds
expr min lq mean median uq max neval
a 11.343302 12.934702 15.032001 13.115151 14.799400 30.135901 10
b 11.641301 13.116401 14.030551 14.426751 15.012701 15.517501 10
c 28.201501 30.470801 33.375761 32.672950 36.671101 40.448701 10
d 25.394901 26.648801 30.044331 27.971251 32.433801 39.570600 10
e 3.750801 4.023700 8.771401 4.150701 4.367502 50.636700 10
f 2.454701 2.458201 3.009181 2.603951 2.952302 6.126101 10

Increase the value in the next row by 1 if two values are equal

I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7

how to separate 2 numbers within a column in R

In R, I want to separate numbers that are in the same column. My data appear like this:
id time
1 1,2
2 3,4
3 4,5,6
I want it to appear like this:
1 1
1 2
2 3
2 4
3 4
3 5
3 6
Though not shown, there are different iterations of time that vary depending on the id. For example:
4 1,6,7
5 1,3,6
6 1,4,5
7 1,3,5
8 2,3,4
There are 100 ids and the time column has different #s that vary in order as shown above.
Does anyone have advice to do this?
An option with separate_rows
library(dplyr)
library(tidyr)
df %>%
separate_rows(time, sep = "(?<=.)(?=.)", convert = TRUE)
# A tibble: 4 x 2
# id time
# <dbl> <int>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
data
df <- structure(list(id = c(1, 2), time = c(12, 34)), class = "data.frame",
row.names = c(NA,
-2L))
Using tidyverse you could try the following. Make sure time is character type, and use strsplit to split up into single characters.
library(tidyverse)
df %>%
mutate(time = strsplit(as.character(time), ",")) %>%
unnest(cols = time)
Or you can just use separate_rows and indicate comma as separator:
df %>%
separate_rows(time, sep = ',')
Or in base R you could try this:
s <- strsplit(df$time, ',', fixed = T)
data.frame(id = unlist(s), time = rep(df$id, lengths(s)))
Output
# A tibble: 10 x 2
id time
<int> <chr>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 4
6 3 5
7 3 6
8 4 1
9 4 6
10 4 7
Data
df <- structure(list(id = 1:4, time = c("1,2", "3,4", "4,5,6", "1,6,7"
)), class = "data.frame", row.names = c(NA, -4L))

Monthly time trend from dataframe of dates

I have a dataset that looks like this:
group id date1 date2 date3 date4
1 1 1 1991-10-14 1992-05-20 1992-12-09 1993-06-30
2 1 2 <NA> 1992-05-21 1992-12-10 1993-06-29
3 1 3 <NA> <NA> 1992-12-08 1993-06-29
4 1 4 1991-10-14 1992-05-19 <NA> <NA>
5 1 5 1991-10-15 1992-05-21 <NA> 1993-06-30
6 1 6 1991-10-15 <NA> <NA> 1993-06-30
Here the data is in R format:
structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L),
id = 1:6,
date1 = structure(c(7956, NA, NA, 7956, 7957, 7957), class = "Date"),
date2 = structure(c(8175, 8176, NA, 8174, 8176, NA), class = "Date"),
date3 = structure(c(8378, 8379, 8377, NA, NA, NA), class = "Date"),
date4 = structure(c(8581, 8580, 8580, NA, 8581, 8581), class = "Date")),
.Names = c("group", "id", "date1", "date2", "date3", "date4"),
row.names = c(NA, 6L), class = "data.frame")
That is, we have a grouping variable, several individuals and four possible dates of interest.
Now I want to construct a linear month time trend for each individual from this. In other words, I try to construct a trend with value 1 on the first non-NA date. After that, the trend for the remaining non-NA periods are the months passed since the first non-NA date.
My goal is this structure (individual 1, group 1):
group id period trend
1 1 1 1 1
2 1 1 2 8
3 1 1 3 15
4 1 1 4 21
That is, a molten data.frame with the months passed since t = 1.
I've played around with the ideas from this thread: Number of months between two dates. However, I can't find a solution that does not involve a for-loop and and excruciating number of if-statements.
Any help appreciated!
Here is one potential solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
gather(period, date, -group, -id) %>%
arrange(group, id, period) %>%
mutate(date = as.Date(date)) %>%
group_by(group, id) %>%
filter(!all(is.na(date))) %>%
mutate(
trend = as.integer(
floor(difftime(date, date[which.max(!is.na(date))], units = 'days') / 30)
) + 1,
period = str_replace(period, 'date', '')
) %>%
select(-date)
Output is as follows:
# A tibble: 24 x 4
# Groups: group, id [6]
group id period trend
<int> <int> <chr> <dbl>
1 1 1 1 1
2 1 1 2 8
3 1 1 3 15
4 1 1 4 21
5 1 2 1 NA
6 1 2 2 1
7 1 2 3 7
8 1 2 4 14
9 1 3 1 NA
10 1 3 2 NA
# ... with 14 more rows
NOTE: Edited to add a filter to filter out cases where ALL dates are NA for a given group / id. Otherwise, which,max will fail.
data.table approach
I leave the rounding and/or adding +1 to you.. this is always tricky with months. I personally try to avoid this, and calculate with days or weeks (or just about anything BUT months)...
library( data.table)
dt <- melt ( as.data.table( df ), id.vars = c("group", "id"), variable.name = "date_id", value.name = "date" )
setkey(dt, id, group, date_id)
dt[, diff := lubridate::interval( date[which.min( date ) ], date ) / months(1) , by = c("group", "id")]
head(dt)
# group id date_id date diff
# 1: 1 1 date1 1991-10-14 0.000000
# 2: 1 1 date2 1992-05-20 7.193548
# 3: 1 1 date3 1992-12-09 13.833333
# 4: 1 1 date4 1993-06-30 20.533333
# 5: 1 2 date1 <NA> NA
# 6: 1 2 date2 1992-05-21 0.000000

Resources