R: Recursive Function to Move Through Dataset - r

Much of the following problem arises from the sheer size of the dataframe (198240 observations). I'll try to break it down as best as I can.
The Goal
I want to create a variable DURATION which is how long a house was sick.
The Known
Household ID and Week (There are 1120 houses and 177 weeks)
HDINC (Currently Sick variable )
HDINC_1 (Sick Week Prior variable )
The Problem
I don't understand how to get the function/loop to be traversing the dataframe in both household and time concurrently.
I know it will be a function or loop that goes something like the following (Not in R-code, but in logic)
IF (hdinc > 0) #a house on a certain date is sick
{ Duration = 1 AND look at hdinc_1
IF (hdinc_1 = 0 )
{ Duration = Duration + 0
AND Go onto the next date for that house.
IF hdinc_1 >0 then #if the house was sick last week
{ Duration = Duration + 1
Go to SameHouse, Week-1 and look at hdinc_1 to see if it was sick the week prior
I am having trouble with the following:
Getting it to start on a particular observation based on household/date
Moving the function backwards or forwards while maintaining the household
Eventually getting the function to restart using a different household
I know this is really convoluted but I can't even get the loop to start to provide y'all sample code.
Sample Data:
dat <- structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
Sample Output:

Using only base R :
# create sample data
sampleData <-
structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L),
survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L),
.Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"),
hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)),
.Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
# you must be sure the rows are already ordered, otherwise you can use something like:
#sampleData <- sampleData[order(sampleData$id_casa,sampleData$survdate),]
sampleData$Duration <-
unlist(
by(sampleData,
INDICES=sampleData$id_casa,
FUN=function(house){
tail(Reduce(f=function(prv,nxt){if(nxt == 0) 0 else (prv+nxt)},
x=as.integer(house$hdinc > 0),init=0,accumulate=TRUE),-1)
}))
> sampleData
id_casa survdate hdinc hdinc_1 Duration
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4

We can use the function rle in combination with dplyr to find runs, and then remove those where the run is of wellness:
library(dplyr)
dat %>% group_by(id_casa) %>%
mutate(duration = unlist(lapply(rle(hdinc > 0)[["lengths"]], seq, from = 1))) %>%
mutate(duration = ifelse(hdinc > 0, as.numeric(duration), 0))
Source: local data frame [11 x 5]
Groups: id_casa [2]
id_casa survdate hdinc hdinc_1 duration
(int) (fctr) (dbl) (dbl) (dbl)
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4
How it works: first we find all the runs using rle:
rle(dat$hdinc>0)
Run Length Encoding
lengths: int [1:5] 2 3 1 1 4
values : logi [1:5] TRUE FALSE TRUE FALSE TRUE
We then make a seq from 0 to each of the lengths from the rle using lapply:
z <- unlist(lapply(rle(dat$hdinc > 0)[["lengths"]], seq, from = 1))
z
[1] 1 2 1 2 3 1 1 1 2 3 4
Then we filter that by whether it was sickness or wellness:
ifelse(dat$hdinc > 0, z, 0)
[1] 1 2 0 0 0 1 0 1 2 3 4
Using dplyr group_by we make sure we are running it on each id_casa by itself.
EDIT: In base:
dat$duration2 <- ifelse(dat$hdinc > 0,
unlist(by(dat, dat$id_casa, FUN = function(x) unlist(lapply(rle(x$hdinc > 0)[["lengths"]], seq, from = 1)))),
0)

Related

Computing the outstanding balance of a loan in r by referencing the balance in a previous row

I have a data set which contains multiple customers each having multiple loans. The loans have monthly repayments. See example of the data frame called data1 (see a sample for data1 below).
The formula for Closing Balance for Fore_Cast_Horizon = 1 is:
Account_Balance + (Account_Balance*Interest_Rate) - Instalment
The formula for Closing Balance where Fore_Cast_Horizon > 1 is
Account_Balance[i-1] + (Account_Balance[i-1]*Interest_Rate[i] - Instalment[i]
This need to reset for each new ACCOUNT_ID in the data.
I need help in programming this in r
Customer_ID ACCOUNT_ID Account_Balance Instalment Interest_Rate Fore_Cast_Horizon Closing Balance
100 2 500 50 0.02 1 460.0000
100 2 500 50 0.02 2 419.2000
010 2 500 50 0.02 3 377.5840
100 4 800 80 0.03 1 744.0000
100 4 800 80 0.03 2 686.3200
100 4 800 80 0.03 3 626.9096
I think the following solves your problem:
library(tidyverse)
df <- data.frame(
Customer_ID = c(100L, 100L, 10L, 100L, 100L, 100L),
ACCOUNT_ID = c(2L, 2L, 2L, 4L, 4L, 4L),
Account_Balance = c(500L, 500L, 500L, 800L, 800L, 800L),
Instalment = c(50L, 50L, 50L, 80L, 80L, 80L),
Interest_Rate = c(0.02, 0.02, 0.02, 0.03, 0.03, 0.03),
Fore_Cast_Horizon = c(1L, 2L, 3L, 1L, 2L, 3L),
Closing.Balance = c(460, 419.2, 377.584, 744, 686.32, 626.9096)
)
df %>%
group_by(ACCOUNT_ID) %>%
mutate(
idg = row_number(ACCOUNT_ID),
raux = ifelse(idg == 1, 0, Interest_Rate),
CB2 = Account_Balance * cumprod(1+Interest_Rate) -
cumsum(Instalment * cumprod(1+raux)), raux = NULL, idg = NULL) %>%
ungroup() %>%
as.data.frame()
#> Customer_ID ACCOUNT_ID Account_Balance Instalment Interest_Rate
#> 1 100 2 500 50 0.02
#> 2 100 2 500 50 0.02
#> 3 10 2 500 50 0.02
#> 4 100 4 800 80 0.03
#> 5 100 4 800 80 0.03
#> 6 100 4 800 80 0.03
#> Fore_Cast_Horizon Closing.Balance CB2
#> 1 1 460.0000 460.0000
#> 2 2 419.2000 419.2000
#> 3 3 377.5840 377.5840
#> 4 1 744.0000 744.0000
#> 5 2 686.3200 686.3200
#> 6 3 626.9096 626.9096

Count numbers and percentage of negative, 0 and positive values for each column in R

I have a toy dataset as follows:
df <- structure(list(id = 1:11, price = c(40.59, 70.42, 1.8, 1.98,
65.02, 2.23, 54.79, 54.7, 3.32, 1.77, 3.5), month_pct = structure(c(11L,
10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 1L, 2L), .Label = c("-19.91%",
"-8.55%", "1.22%", "1.39%", "1.41%", "1.83%", "2.02%", "2.59%",
"2.86%", "6.58%", "8.53%"), class = "factor"), year_pct = structure(c(4L,
9L, 5L, 3L, 10L, 1L, 11L, 8L, 6L, 7L, 2L), .Label = c("-10.44%",
"-19.91%", "-2.46%", "-35.26%", "-4.26%", "-5.95%", "-6.35%",
"-6.91%", "-7.95%", "1.51%", "1.54%"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
Out:
id price month_pct year_pct
0 1 40.59 8.53% -35.26%
1 2 70.42 6.58% -7.95%
2 3 1.80 2.86% -4.26%
3 4 1.98 2.59% -2.46%
4 5 65.02 2.02% 1.51%
5 6 2.23 1.83% -10.44%
6 7 54.79 1.41% 1.54%
7 8 54.70 1.39% -6.91%
8 9 3.32 1.22% -5.95%
9 10 1.77 -19.91% -6.35%
10 11 3.50 -8.55% -19.91%
Now I want to count the number and percentage of positive, 0 and negative for columns month_pct and year_pct, how could I do that in R?
Thanks.
One dplyr and tidyr possibility could be:
df %>%
pivot_longer(-c(1:2)) %>%
group_by(name,
value_sign = factor(sign(as.numeric(sub("%", "", value))),
levels = -1:1,
labels = c("negative", "zero", "positive")),
.drop = FALSE) %>%
count() %>%
group_by(name) %>%
mutate(prop = n/sum(n)*100)
name value_sign n prop
<chr> <fct> <int> <dbl>
1 month_pct negative 2 18.2
2 month_pct zero 0 0
3 month_pct positive 9 81.8
4 year_pct negative 9 81.8
5 year_pct zero 0 0
6 year_pct positive 2 18.2
Here's a base R approach using regex:
sts <- data.frame(
sign = c("positive", "zero", "negative"),
month_number = c(length(which(grepl("^\\d", df$month_pct))),
length(which(df$month_pct==0)),
length(which(grepl("^-", df$month_pct)))),
month_percent = c(length(which(grepl("^\\d", df$month_pct)))/length(df$month_pct)*100,
length(which(df$month_pct==0))/length(df$month_pct)*100,
length(which(grepl("^-", df$month_pct)))/length(df$month_pct)*100),
year_number = c(length(which(grepl("^\\d", df$year_pct))),
length(which(df$year_pct==0)),
length(which(grepl("^-", df$year_pct)))),
year_percent = c(length(which(grepl("^\\d", df$year_pct)))/length(df$year_pct)*100,
length(which(df$month_pct==0))/length(df$year_pct)*100,
length(which(grepl("^-", df$year_pct)))/length(df$year_pct)*100)
)
Result:
sts
sign month_number month_percent year_number year_percent
1 positive 9 81.81818 2 18.18182
2 zero 0 0.00000 0 0.00000
3 negative 2 18.18182 9 81.81818
Using dplyr 1.0.0 here is one way :
library(dplyr)
df %>%
summarise(across(c(month_pct, year_pct),
~table(factor(sign(readr::parse_number(as.character(.))),
levels = -1:1)))) %>%
mutate(sign = c('negative', 'zero', 'positive'), .before = month_pct) %>%
rename_at(-1, ~sub('pct', 'n', .)) %>%
mutate(across(-1, list(pct = ~./sum(.) * 100)))
# sign month_n year_n month_n_pct year_n_pct
#1 negative 2 9 18.2 81.8
#2 zero 0 0 0.0 0.0
#3 positive 9 2 81.8 18.2

Time lag between sequential observations giving negative values

I am trying to calculate the time between sequential observations. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime next timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <fct>
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 4144 10
2012/07/15 11:38 2012 80 MAMO 0 2012-07-15 11:38 1 33855 01
2012/07/20 22:19 2012 80 MICRO 0 2012-07-20 22:19 0 7841 00
2012/07/29 23:03 2012 80 MICRO 0 2012-07-29 23:03 0 13004 00
2012/10/18 2:54 2012 80 MICRO 0 2012-10-18 02:54 0 -971 00
2012/10/23 2:49 2012 80 MICRO 0 2012-10-23 02:49 0 -1094 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
next: very next pp after the current observation
timedif: time difference between the current observation and the next one
seque: this should be the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>%
mutate(next = dplyr::lag(pp)) %>%
mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I have two questions:
My lag function appears to be recording the previous pp event, not the next pp event. How do I fix this?
My timedif calculation is giving me negative values, which shouldn't be possible. Why is that happening?
Just in case, here is the output for str(d1):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 96 obs. of 10 variables:
$ date : Factor w/ 1093 levels "2012/05/30","2012/05/31",..: 23 47 52 61 71 76 76 88 90 98 ...
$ time : Factor w/ 1439 levels "0:00","0:01",..: 983 219 919 963 1016 5 47 52 923 1058 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ km : num 80 80 80 80 80 80 80 80 80 80 ...
$ sps : Factor w/ 17 levels "CACA","ERDO",..: 11 7 9 9 9 9 9 4 9 11 ...
$ pp : num 1 0 0 0 0 0 0 0 0 1 ...
$ datetime: chr "2012-06-21 23:23" "2012-07-15 11:38" "2012-07-20 22:19" "2012-07-29 23:03" ...
$ next : num 0 1 0 0 0 0 0 0 0 0 ...
$ timedif : num 4144 33855 7841 13004 14453 ...
$ seque : Factor w/ 4 levels "00","01","10",..: 3 2 1 1 1 1 1 1 1 3 ...
And also:
dput(d1[1:10,])
structure(list(
date = structure(c(23L, 47L, 52L, 61L, 71L, 76L, 76L, 88L, 90L, 98L),
.Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2012/06/02", "2012/06/03", "2012/06/04", "2012/06/05", "2013/06/18", "2013/06/19", "2013/06/20", "2013/06/21", "2013/06/22", "2014/07/19", "2014/07/20", "2014/07/21", "2014/07/22", "2014/07/23", "2015/08/06", "2015/08/07", "2015/08/08", "2015/08/09", "2015/08/10"),
class = "factor"),
time = structure(c(983L, 219L, 919L, 963L, 1016L, 5L, 47L, 52L, 923L, 1058L),
.Label = c("0:00", "0:01", "0:02", "0:03", "0:04", "0:05", "0:06", "0:07", "0:33","0:34", "0:35", "0:36", "0:37","10:06", "10:07", "10:08", "10:09", "10:10", "10:11", "10:12", "10:13", "2:05", "2:06", "2:07", "2:08", "2:09", "2:10", "2:11", "9:54", "9:55", "9:56", "9:57", "9:58", "9:59"),
class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(80, 80, 80, 80, 80, 80, 80, 80, 80, 80),
sps = structure(c(11L, 7L, 9L, 9L, 9L, 9L, 9L, 4L, 9L, 11L),
.Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"),
class = "factor"),
pp = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 1),
datetime = c("2012-06-21 23:23", "2012-07-15 11:38", "2012-07-20 22:19", "2012-07-29 23:03", "2012-08-08 23:56", "2012-08-13 00:04", "2012-08-13 00:46", "2012-08-25 00:51", "2012-08-27 22:23", "2012-09-04 03:38"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(4144, 33855, 7841, 13004, 14453, 5768, 42, 17285, 4172, 10395),
seque = structure(c(3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("00", "01", "10", "11"),
class = "factor")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -10L))

dplyr mutuate add new column value only once a different column == specific value

I have a dataframe with the following structure (summary example, not actual)
dput(df1)
structure(list(MedID = c(111, 111, 111, 111, 111, 111, 222, 222,
222, 222, 222), Service = structure(c(1L, 1L, 2L, 1L, 1L, 3L,
3L, 2L, 1L, 1L, 3L), .Label = c("Acute care", "Ext care", "Outpt
care"), class = "factor"), AdmitDate = structure(c(16832, 16861,
16892, 16922, 16953, 16983, 17181, 17212, 17240, 17271, 17301), class
= "Date"), Flag = c(0, 0, 99, 0, 0, 0, 0, 99, 0, 0, 0)), .Names =
c("MedID", "Service", "AdmitDate", "Flag"), row.names = c(NA, -11L),
class = "data.frame")
> df1
MedID Service AdmitDate Flag
1 111 Acute care 2016-02-01 0
2 111 Acute care 2016-03-01 0
3 111 Ext care 2016-04-01 99
4 111 Acute care 2016-05-01 0
5 111 Acute care 2016-06-01 0
6 111 Outpt care 2016-07-01 0
7 222 Outpt care 2017-01-15 0
8 222 Ext care 2017-02-15 99
9 222 Acute care 2017-03-15 0
10 222 Acute care 2017-04-15 0
11 222 Outpt care 2017-05-15 0
I wish to use dplyr, group_by(MedID) and mutate to add a column in a new dataframe (let's call it Flag2 in df2) such that within each patient (MedID) the df2$Flag2 column == 1 for every subsequent row within that unique MedID but only after the df1$Flag column == 99, otherwise the df2$Flag2 column gets a 0. I can code this as desired if df1$Flag == 99 in the first row of a MedID, but otherwise my code either produces 1 in df2$Flag2 only in the row where df1$Flag == 99, or it produces a 1 for all rows in a given MedID where df1$Flag == 99. The desired output is:
dput(df2)
structure(list(MedID = c(111, 111, 111, 111, 111, 111, 222, 222,
222, 222, 222), Service = structure(c(1L, 1L, 2L, 1L, 1L, 3L,
3L, 2L, 1L, 1L, 3L), .Label = c("Acute care", "Ext care", "Outpt
care"), class = "factor"), AdmitDate = structure(c(16832, 16861,
16892,16922, 16953, 16983, 17181, 17212, 17240, 17271, 17301), class
= "Date"),Flag = c(0, 0, 99, 0, 0, 0, 0, 99, 0, 0, 0), Flag2 = c(0,
0, 1, 1, 1, 1, 0, 1, 1, 1, 1)), .Names = c("MedID", "Service",
"AdmitDate", "Flag", "Flag2"), row.names = c(NA, -11L), class =
"data.frame")
> df2
MedID Service AdmitDate Flag Flag2
1 111 Acute care 2016-02-01 0 0
2 111 Acute care 2016-03-01 0 0
3 111 Ext care 2016-04-01 99 1
4 111 Acute care 2016-05-01 0 1
5 111 Acute care 2016-06-01 0 1
6 111 Outpt care 2016-07-01 0 1
7 222 Outpt care 2017-01-15 0 0
8 222 Ext care 2017-02-15 99 1
9 222 Acute care 2017-03-15 0 1
10 222 Acute care 2017-04-15 0 1
11 222 Outpt care 2017-05-15 0 1
Here is a snipit example of the code, but not complete since it does not execute properly... Do I need to nest the mutate within a For loop, that seems like mingled R coding? :( Note: df1$Flag can only == 99 once per MedID, which I think should make it easier.
`df2 <- df1 %>% `
`group_by(MedID) %>%`
`mutate(Flag2 = ifelse(df1$Flag == 99, 1, 0))`
One solution could be using fill from tidyr. The approach would be to first add Flag2 and assign as 1 for rows having Flag == 99 otherwise NA.
Now fill rows downwards in Flag2 column. Finally replace all NA with 0.
library(tidyverse)
df1 %>%
group_by(MedID) %>%
mutate(Flag2 = ifelse(Flag == 99, 1L, NA)) %>%
fill(Flag2) %>%
mutate(Flag2 = ifelse(is.na(Flag2), 0L, Flag2))
Though OP has not mentioned it but if AdmitDate is expected to decide which row comes after matching Flag == 99 then arrange should be added in above code.
df1 %>%
group_by(MedID) %>%
mutate(Flag2 = ifelse(Flag == 99, 1L, NA)) %>%
arrange(AdmitDate) %>%
fill(Flag2) %>%
mutate(Flag2 = ifelse(is.na(Flag2), 0L, Flag2))

Conditionally convert numbers in an R data frame

I am trying to convert data so that each column is represeted by 0's, 1's, and 2's. I have a data frame with 5 populations and 6 variables (there are actually 100+ populations and 5,000+ variables in the real data frame):
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 11 11 22 44 11 22
2 Teton 14 44 12 34 33 22
3 Vipond Park 44 11 22 44 33 NA
4 Little Joe 11 44 NA 44 13 44
5 Rainier 14 11 11 NA 11 44
In each column, I have the following combinations of numbers:
1 and 3,
2 and 4,
2 and 3,
1 and 4,
3 and 4,
1 and 2
For each column, I need to convert one of the "doubled numbers" to a 0, the OTHER of the doubled numbers to a 2, and then those variables that are a combination of two numbers to a 1 (the intermediate value). (So, 13, 24, 23, 14, 34, and 12 should become 1.)
For example, for Var1 in the data frame above, 11 should be 0, 14 should be 1, and 44 should be 2. Some columns have only one of the doubled numbers, and then the combination of the numbers as well. There is also missing data. For example, I am trying to convert the above data frame to:
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 0 0 0 0 0 0
2 Teton 1 2 1 1 2 0
3 Vipond Park 2 0 0 0 2 NA
4 Little Joe 0 2 NA 0 1 2
5 Rainier 1 0 2 NA 0 2
Let u be the unique non-NA elements in x. is.twice is a logical vector which is TRUE for the double digits in u and FALSE for the non-double digits in u. uu is the unique double digits and other is the remaining number or it may be zero length if there is no other number. Finally compute the labels associated with c(uu, other) and perform the translation of x:
f <- function(x) {
u <- unique(na.omit(x))
# separate u into uu (double digits) and other
is.twice <- u %% 10 == u %/% 10 # true if double digit
uu <- u[is.twice]
other <- u[!is.twice]
# compute labels associated with c(uu, other)
labels <- c(0, 2)[seq_along(uu)]
if (length(other) > 0) labels <- c(labels, 1)
# translate x to appropriate labels
labels[match(x, c(uu, other))]
}
replace(DF, -1, lapply(DF[-1], f))
which for the sample data gives:
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 0 0 0 0 0 0
2 Teton 1 2 1 1 2 0
3 Vipond Park 2 0 0 0 2 NA
4 Little Joe 0 2 NA 0 1 2
5 Rainier 1 0 2 NA 0 2
Note: The above used this input:
DF <-
structure(list(pop = structure(c(1L, 4L, 5L, 2L, 3L), .Label = c("Crater",
"Little Joe", "Rainier", "Teton", "Vipond Park"), class = "factor"),
Var1 = c(11L, 14L, 44L, 11L, 14L), Var2 = c(11L, 44L, 11L,
44L, 11L), Var3 = c(22L, 12L, 22L, NA, 11L), Var4 = c(44L,
34L, 44L, 44L, NA), Var5 = c(11L, 33L, 33L, 13L, 11L), Var6 = c(22L,
22L, NA, 44L, 44L)), .Names = c("pop", "Var1", "Var2", "Var3",
"Var4", "Var5", "Var6"), class = "data.frame", row.names = c(NA,
-5L))
Update: Fixed.

Resources