In R, how can I calculate cumsum for a defined time period prior to the row being calculate? Prefer dplyr if possible.
For example, if the period was 10 days, then the function would achieve cum_rolling10:
date value cumsum cum_rolling10
1/01/2000 9 9 9
2/01/2000 1 10 10
5/01/2000 9 19 19
6/01/2000 3 22 22
7/01/2000 4 26 26
8/01/2000 3 29 29
13/01/2000 10 39 29
14/01/2000 9 48 38
18/01/2000 2 50 21
19/01/2000 9 59 30
21/01/2000 8 67 38
25/01/2000 5 72 24
26/01/2000 1 73 25
30/01/2000 6 79 20
31/01/2000 6 85 18
A solution using dplyr, tidyr, lubridate, and zoo.
library(dplyr)
library(tidyr)
library(lubridate)
library(zoo)
dt2 <- dt %>%
mutate(date = dmy(date)) %>%
mutate(cumsum = cumsum(value)) %>%
complete(date = full_seq(date, period = 1), fill = list(value = 0)) %>%
mutate(cum_rolling10 = rollapplyr(value, width = 10, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)
dt2
# A tibble: 15 x 4
date value cumsum cum_rolling10
<date> <dbl> <int> <dbl>
1 2000-01-01 9 9 9
2 2000-01-02 1 10 10
3 2000-01-05 9 19 19
4 2000-01-06 3 22 22
5 2000-01-07 4 26 26
6 2000-01-08 3 29 29
7 2000-01-13 10 39 29
8 2000-01-14 9 48 38
9 2000-01-18 2 50 21
10 2000-01-19 9 59 30
11 2000-01-21 8 67 38
12 2000-01-25 5 72 24
13 2000-01-26 1 73 25
14 2000-01-30 6 79 20
15 2000-01-31 6 85 18
DATA
dt <- structure(list(date = c("1/01/2000", "2/01/2000", "5/01/2000",
"6/01/2000", "7/01/2000", "8/01/2000", "13/01/2000", "14/01/2000",
"18/01/2000", "19/01/2000", "21/01/2000", "25/01/2000", "26/01/2000",
"30/01/2000", "31/01/2000"), value = c(9L, 1L, 9L, 3L, 4L, 3L,
10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)), .Names = c("date", "value"
), row.names = c(NA, -15L), class = "data.frame")
I recommend using runner package designed to calculate functions on rolling/running windows. You can achieve this by using sum_run - one liner here:
library(runner)
library(dplyr)
df %>%
mutate(
cum_rolling_10 = sum_run(
x = df$value,
k = 10,
idx = as.Date(df$date, format = "%d/%m/%Y"))
)
df
# date value cum_rolling_10
# 1 1/01/2000 9 9
# 2 2/01/2000 1 10
# 3 5/01/2000 9 19
# 4 6/01/2000 3 22
# 5 7/01/2000 4 26
# 6 8/01/2000 3 29
# 7 13/01/2000 10 29
# 8 14/01/2000 9 38
# 9 18/01/2000 2 21
# 10 19/01/2000 9 30
# 11 21/01/2000 8 38
# 12 25/01/2000 5 24
# 13 26/01/2000 1 25
# 14 30/01/2000 6 20
# 15 31/01/2000 6 18
Enjoy!
this solution will avoid memory overhead, and migrate to sparklyr will be easy.
lag = 7
dt %>%
mutate(date = dmy(date)) %>%
mutate(order = datediff(date,min(date)) %>%
arrange(desc(order)) %>%
mutate(n_order = lag(order + lag,1L,default = 0)) %>%
mutate(b_order = ifelse(order - n_order >= 0,order,-1)) %>%
mutate(m_order = cummax(b_order)) %>%
group_by(m_order) %>%
mutate(rolling_value = cumsum(value))
Use slide_index_sum() from slider, which is designed to have the same API as purrr.
library(slider)
library(dplyr)
df <- tibble(
date = c(
"1/01/2000", "2/01/2000", "5/01/2000", "6/01/2000", "7/01/2000",
"8/01/2000", "13/01/2000", "14/01/2000", "18/01/2000", "19/01/2000",
"21/01/2000", "25/01/2000", "26/01/2000", "30/01/2000", "31/01/2000"
),
value = c(9L, 1L, 9L, 3L, 4L, 3L, 10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)
)
df <- mutate(df, date = as.Date(date, format = "%d/%m/%Y"))
df %>%
mutate(
cumsum = cumsum(value),
cum_rolling10 = slide_index_sum(value, date, before = 9L)
)
#> # A tibble: 15 × 4
#> date value cumsum cum_rolling10
#> <date> <int> <int> <dbl>
#> 1 2000-01-01 9 9 9
#> 2 2000-01-02 1 10 10
#> 3 2000-01-05 9 19 19
#> 4 2000-01-06 3 22 22
#> 5 2000-01-07 4 26 26
#> 6 2000-01-08 3 29 29
#> 7 2000-01-13 10 39 29
#> 8 2000-01-14 9 48 38
#> 9 2000-01-18 2 50 21
#> 10 2000-01-19 9 59 30
#> 11 2000-01-21 8 67 38
#> 12 2000-01-25 5 72 24
#> 13 2000-01-26 1 73 25
#> 14 2000-01-30 6 79 20
#> 15 2000-01-31 6 85 18
Related
I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43
I have a dataset like this
Company Year Value
A 22 15
A 23 17
A 24 13
A 25 20
B 22 187
B 23 153
B 24 135
C...ect
I need to make all values, for each company, equal to the value of 2022. Like this:
Company Year Value
A 22 15
A 23 15
A 24 15
A 25 15
B 22 187
B 23 187
B 24 187
C...ect
And then multiply each value by a given rate (eg. 2% for each value) that coumpounds. Like this:
Company Year Value
A 22 15
A 23 15x1,02
A 24 15x1,02^2
A 25 15x1,02^3
B 22 187
B 23 187x1,02
B 24 187x1,02^2
C...ect
Can someone help me please?
df <- data.frame(
stringsAsFactors = FALSE,
Company = c("A", "A", "A", "A", "B", "B", "B"),
Year = c(22L, 23L, 24L, 25L, 22L, 23L, 24L),
Value = c(15L, 17L, 13L, 20L, 187L, 153L, 135L)
)
library(tidyverse)
df %>%
group_by(Company) %>%
mutate(Value = first(Value) * 1.02 ^ (row_number() - 1)) %>%
ungroup()
#> # A tibble: 7 x 3
#> Company Year Value
#> <chr> <int> <dbl>
#> 1 A 22 15
#> 2 A 23 15.3
#> 3 A 24 15.6
#> 4 A 25 15.9
#> 5 B 22 187
#> 6 B 23 191.
#> 7 B 24 195.
Created on 2022-03-18 by the reprex package (v2.0.1)
You can do:
library(tidyverse)
df %>%
group_by(Company) %>%
mutate(Value = Value[Year == "22"]*(1.02^(0:(n()-1))))
Company Year Value
1 A 22 15.00000
2 A 23 15.30000
3 A 24 15.60600
4 A 25 15.91812
5 B 22 187.00000
6 B 23 190.74000
7 B 24 194.55480
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 1 year ago.
I have this dataframe:
id a1 a2 b1 b2 c1 c2
<int> <int> <int> <int> <int> <int> <int>
1 1 83 33 55 33 85 86
2 2 37 0 60 98 51 0
3 3 97 71 85 8 44 40
4 4 51 6 43 15 55 57
5 5 28 53 62 73 70 9
df <- structure(list(id = 1:5, a1 = c(83L, 37L, 97L, 51L, 28L), a2 = c(33L,
0L, 71L, 6L, 53L), b1 = c(55L, 60L, 85L, 43L, 62L), b2 = c(33L,
98L, 8L, 15L, 73L), c1 = c(85L, 51L, 44L, 55L, 70L), c2 = c(86L,
0L, 40L, 57L, 9L)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I want to:
Combine columns with same starting character to one column by shifting each row of the second column by 1 down and naming the new column with the character of the two columns.
My desired output:
id a b c
<dbl> <dbl> <dbl> <dbl>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
I have tried using lagfunction but I don`t know how to combine and shift columns at the same time!
To clarify a picture:
You can use the following solution. I also have modified your data set an added an id column:
library(tidyr)
df %>%
pivot_longer(!id, names_to = c(".value", NA), names_pattern = "([[:alpha:]])(\\d)")
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
We can pivot_longer, remove the digits from name, then pivot_wider and unnest
library(stringr)
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols = -id)%>%
mutate(name=str_remove(name, '[0-9]'))%>%
pivot_wider(names_from = name)%>%
unnest(everything())
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
Doing it as a pivot_longer(), then pivot_wider() is easier to read, but #Anoushiravan R's answer to more direct
library(tidyverse)
df %>%
rownames_to_column(var = "id") %>% # Add the id column
pivot_longer(-id) %>% # Make long
mutate(order = str_sub(name, -1), name = str_sub(name, 1, 1)) %>% # Breakout the name column
pivot_wider(names_from = name) %>% # Make wide again
select(-order) # Drop the ordering column
I think ANoushiravan's solution is the tidiest way to do it. We could also use {dplyover} (disclaimer) for this:
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
group_by(id) %>%
summarise(across2(ends_with("1"),
ends_with("2"),
~ c(.x,.y),
.names = "{pre}"),
)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 10 x 4
#> # Groups: id [5]
#> id a b c
#> <int> <int> <int> <int>
#> 1 1 83 55 85
#> 2 1 33 33 86
#> 3 2 37 60 51
#> 4 2 0 98 0
#> 5 3 97 85 44
#> 6 3 71 8 40
#> 7 4 51 43 55
#> 8 4 6 15 57
#> 9 5 28 62 70
#> 10 5 53 73 9
Created on 2021-07-28 by the reprex package (v0.3.0)
Say if we have a dataframe looking like this below:
a b c d
22 18 25 9
12 24 6 18
37 8 22 25
24 19 12 27
I would like to create two new columns out of these ones:
a) One column storing the name of the column in which each row gets its highest value.
b) Another one storing its highest value.
In other words, my desired output would look as follows:
a b c d max_col max_val
22 18 25 9 c 25
12 24 6 18 b 24
37 8 22 25 a 37
24 19 12 27 d 27
How should I do to retrieve this?
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(max_col = names(df)[which.max(c_across(a:d))], max_val = max(c_across(a:d)))
# A tibble: 4 x 6
# Rowwise:
a b c d max_col max_val
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 22 18 25 9 c 25
2 12 24 6 18 b 24
3 37 8 22 25 a 37
4 24 19 12 27 d 27
>
It can be also reached reshaping data and merging:
library(tidyverse)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
left_join(
df %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>% filter(value==max(value)[1]) %>%
rename(max_col=name,max_val=value)
) %>% select(-id)
Output:
a b c d max_col max_val
1 22 18 25 9 c 25
2 12 24 6 18 b 24
3 37 8 22 25 a 37
4 24 19 12 27 d 27
Some data used:
#Data
df <- structure(list(a = c(22L, 12L, 37L, 24L), b = c(18L, 24L, 8L,
19L), c = c(25L, 6L, 22L, 12L), d = c(9L, 18L, 25L, 27L)), class = "data.frame", row.names = c(NA,
-4L))
We can do this in a vectorized efficient way with max.col from base R - gets the position index of the max value in a row, which is used to extract the corresponding column name with [ , and pmax to return the max value per row
mcol <- names(df)[max.col(df, 'first')]
mval <- do.call(pmax, df)
df[c('max_col', 'max_val')] <- list(mcol, mval)
-output
df
# a b c d max_col max_val
#1 22 18 25 9 c 25
#2 12 24 6 18 b 24
#3 37 8 22 25 a 37
#4 24 19 12 27 d 27
Or using tidyverse, we can use the same max.col and pmax to get the column names of the max value per row and the max value of the row
library(dplyr)
library(purrr)
df %>%
mutate(max_col = names(cur_data())[max.col(cur_data(), 'first')],
max_val = invoke(pmax, select(cur_data(), where(is.numeric))))
-output
# a b c d max_col max_val
# 1 22 18 25 9 c 25
# 2 12 24 6 18 b 24
# 3 37 8 22 25 a 37
# 4 24 19 12 27 d 27
data
df <- structure(list(a = c(22L, 12L, 37L, 24L), b = c(18L, 24L, 8L,
19L), c = c(25L, 6L, 22L, 12L), d = c(9L, 18L, 25L, 27L)),
class = "data.frame", row.names = c(NA,
-4L))
Here is another base R option
inds <- max.col(df)
df <- cbind(df,
max_col = names(df)[inds],
max_val = df[cbind(seq_along(inds), inds)]
)
which gives
a b c d max_col max_val
1 22 18 25 9 c 25
2 12 24 6 18 b 24
3 37 8 22 25 a 37
4 24 19 12 27 d 27
Until now I've been working with a medium size dataset for an Ocupation Survey(around 200 mb total), here's the data if you want to review it: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
I have the following code:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
library(dplyr)
base<-tibble(ID_VIV=datos$ID_VIV, ID_PERSONA=datos$ID_PERSONA, EDAD=datos$EDAD, CONACT=datos$CONACT)
base$maxage <- ave(base$EDAD, base$ID_VIV, FUN=max)
base$Condición_I<-case_when(base$CONACT==32 & base$EDAD>=60 ~ 1,
base$CONACT>=10 & base$EDAD>=60 & base$CONACT<=16 ~ 2,
base$CONACT==20 & base$EDAD>=60 | base$CONACT==31 & base$EDAD>=60 | (base$CONACT>=33 & base$CONACT<=35 & base$EDAD>=60) ~ 3)
base <- subset(base, maxage >= 60)
base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else Condición_I)
base$ID_VIV<-as.character(base$ID_VIV)
base$ID_PERSONA<-as.character(base$ID_PERSONA)
base
And ended up with:
# A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 4
3 10010000008 1001000000802 47 33 83 NA 4
4 10010000008 1001000000801 47 10 83 NA 4
5 10010000012 1001000001204 4 NA 60 NA 4
6 10010000012 1001000001203 2 NA 60 NA 4
7 10010000012 1001000001201 60 10 60 2 4
8 10010000012 1001000001202 21 10 60 NA 4
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
The Condición_I column value is a code for the labour conditions of each individual(row), some of this individuals share house (that's why they share ID_VIV), I only care about the individuals that are 60yo or more, all the NA are individuals who live with a 60+yo but I do not care about their situation (but I need to keep them), I need the column Condición_V to display another value following this conditions:
Condición_I == 1 ~ 1
Condición_I == 2 ~ 2
Condición_I == 3 ~ 3
Any combination of Condición_I ~ 4
This means that if all the 60 and+_yo individuals in a house have Condición_I == 1 then Condición_V will be 1 that's true up to code 3, when there are x.e. one person C_I == 1 and another one C_I == 3 in the same house, then Condición_V will be 4
And I'm hoping to get this kind of result:
A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 3
3 10010000008 1001000000802 47 33 83 NA 3
4 10010000008 1001000000801 47 10 83 NA 3
5 10010000012 1001000001204 4 NA 60 NA 2
6 10010000012 1001000001203 2 NA 60 NA 2
7 10010000012 1001000001201 60 10 60 2 2
8 10010000012 1001000001202 21 10 60 NA 2
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
I know my error is in:
`#base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else` Condición_I)
Is there a way to use that line of code ignoring the NA values or is it my best option to do it otherway, I do not have to do it the way I'm trying and any other way or help will be much appreciated!
We can wrap with na.omit on the Condición_I column, check the number of distinct elements with n_distinct and if it is greater than 1, return 4 or else return the na.omit of the column
library(dplyr)
base %>%
group_by(ID_VIV) %>%
mutate(Condición_V = if(n_distinct(na.omit(Condición_I)) > 1)
4 else na.omit(Condición_I)[1])
# A tibble: 10 x 7
# Groups: ID_VIV [4]
# ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
# <chr> <chr> <int> <int> <int> <int> <dbl>
# 1 10010000007 1001000000701 69 32 69 1 1
# 2 10010000008 1001000000803 83 33 83 3 3
# 3 10010000008 1001000000802 47 33 83 NA 3
# 4 10010000008 1001000000801 47 10 83 NA 3
# 5 10010000012 1001000001204 4 NA 60 NA 2
# 6 10010000012 1001000001203 2 NA 60 NA 2
# 7 10010000012 1001000001201 60 10 60 2 2
# 8 10010000012 1001000001202 21 10 60 NA 2
# 9 10010000014 1001000001401 67 32 67 1 4
#10 10010000014 1001000001402 64 33 67 3 4
data
base <- structure(list(ID_VIV = c("10010000007", "10010000008", "10010000008",
"10010000008", "10010000012", "10010000012", "10010000012", "10010000012",
"10010000014", "10010000014"), ID_PERSONA = c("1001000000701",
"1001000000803", "1001000000802", "1001000000801", "1001000001204",
"1001000001203", "1001000001201", "1001000001202", "1001000001401",
"1001000001402"), EDAD = c(69L, 83L, 47L, 47L, 4L, 2L, 60L, 21L,
67L, 64L), CONACT = c(32L, 33L, 33L, 10L, NA, NA, 10L, 10L, 32L,
33L), maxage = c(69L, 83L, 83L, 83L, 60L, 60L, 60L, 60L, 67L,
67L), Condición_I = c(1L, 3L, NA, NA, NA, NA, 2L, NA, 1L, 3L
)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10"), class = "data.frame")