Convert Yes/No/Absent data into Binary Matrix [duplicate] - r

This question already has answers here:
Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
(10 answers)
Closed 6 years ago.
At the moment I have a data set of a voting data where each person voted on a number of policies either yes or no or they were absent at the time of the vote of that particular policy.
Overall I have 23 policies but I have no idea how to convert the data into binary.
The data set is set up in the way that obviously "n" = no , "y" = yes and "a" = absent
If anyone could lend me a hand here as to how to convert the data in R to a binary Matrix I would appreciate it !

This may be done using model.matrix. Note, this is done automatically for you in many cases in R, e.g. regression analysis.
> set.seed(1)
> (df <- data.frame(id=1:10,vote=sample(c("yes","no","absent"),10,replace=TRUE)))
id vote
1 1 yes
2 2 no
3 3 no
4 4 absent
5 5 yes
6 6 absent
7 7 absent
8 8 no
9 9 no
10 10 yes
> model.matrix(~.-1,df)
id voteabsent voteno voteyes
1 1 0 0 1
2 2 0 1 0
3 3 0 1 0
4 4 1 0 0
5 5 0 0 1
6 6 1 0 0
7 7 1 0 0
8 8 0 1 0
9 9 0 1 0
10 10 0 0 1

For example:
m <- as.matrix(cbind(c('y','y','y'),c('n','n','n'),c('a','a','a')))
m[m == 'y'] <- 1
m[m == 'n'] <- 0
m[m == 'a'] <- NA

Related

Data Frame- Add number of occurrences with a condition in R

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

Update value in a subset dataframe using index of for loop [duplicate]

This question already has an answer here:
Bulk update in subset obtained from dataframe filtering [duplicate]
(1 answer)
Closed 3 years ago.
My usecase involve me to filter a dataframe with some condition. Once I get the subset dataframe, I want to traverse through the subset one row at a time and checking for certain condition and updating a value in that particular row.
Here is my implementation:
> sales_data[sales_data$month == 1 & sales_data$dept_name == 1,]
emp_name month dept_name revenue status n_points x_partition y_partition x y
1 Sam 1 1 100 Low 9 3 3 0 0
7 Kenneth 1 1 500 Very High 9 3 3 0 0
11 Jonathan 1 1 500 Low 9 3 3 0 0
12 Sam 1 1 100 Low 9 3 3 0 0
18 Kenneth 1 1 500 Very High 9 3 3 0 0
22 Jonathan 1 1 500 Low 9 3 3 0 0
23 Sam 1 1 100 Low 9 3 3 0 0
29 Kenneth 1 1 500 Very High 9 3 3 0 0
33 Jonathan 1 1 500 Low 9 3 3 0 0
Now, my subset dataframe has 9 rows. So, a for loop:
for(i in 1:nrow(sales_data[sales_data$month == 1 & sales_data$dept_name == 1, ] )) {
#Here I want to update the value of column named x with i
sales_data[sales_data$month == month_item & sales_data$dept_name == dept_item, ][i]$x <- x_vector_data[i] ##NOT CORRECT APPROACH
}
Why loop, maybe:
sales_data[sales_data$month == 1 & sales_data$dept_name == 1, "x"] <- x_vector_data

How to create a complex running calculation on an R data table

I want to create a running calculation that includes logic to restart the running sum when the value is negative. Initially I have a data table or frame like below :
df <- data.frame(value1 = c(0,0,10,0,1,0,2,0)
, value2 = c(5,1,2,6,8,3,7,2))
value1 value2
0 5
0 1
10 2
0 6
1 8
0 3
2 7
0 2
I would like to take the cumulative sum of value2 subtracted by value1. However, if the new value is less than 0, then start the running calculation over.
i.e. end up with
value1 value2 newvalue
0 5 5
0 1 6
10 2 2
0 6 8
1 8 15
0 3 18
2 7 23
0 2 25
I tried multiple attempts with data.table and dplyr packages with no luck.
EDIT: Updated df to match the actual table shown.
I am sure there are other simpler ways to do this by tweaking cumsum or other such functions, but I came up with this basic loop to produce the desired output. Hope it helps !!
> df
GroupID value1 value2
1 1 0 5
2 1 0 1
3 1 10 2
4 2 0 6
5 2 1 8
6 3 0 3
7 3 2 7
8 3 0 2
for(i in 1:nrow(df)) {
if(i == 1) {
df$newvalue[i] <- df$value2[i]
} else {
df$newvalue[i] <- (df$newvalue[i-1] + df$value2[i]) - df$value1[i]
if(df$newvalue[i] < 0 | df$GroupID[i] != df$GroupID[i-1]) {
df$newvalue[i] <- df$value2[i]
}
}
}
> df
GroupID value1 value2 newvalue
1 1 0 5 5
2 1 0 1 6
3 1 10 2 2
4 2 0 6 6
5 2 1 8 13
6 3 0 3 3
7 3 2 7 8
8 3 0 2 10
I believe that explicitly looping through the data frame is the only solution for calculating this type of conditional cumulative sum. Sagar's solution was very helpful to me (I up-voted but do not have enough reputation points for it to count).
In my experience, new value needs to be initialized prior to starting the loop in order to work properly. Below is how I would approach this:
df$newvalue <- df$value2
for(i in 2:nrow(df)) {
if(df$GroupID[i] == df$GroupID[i-1]) {
df$newvalue[i] <- max(df$newvalue[i-1] + df$value2[i]) - df$value1[i], df$value2[i])
}
}

How to find and remove columns containing more than k consecutive zeros in R data.frame?

I have a huge data.frame with around 200 variables, each represented by a column. Unfortunately, the data is sourced from a poorly formatted data dump (and hence can't be modified) which represents both missing values and zeroes as 0.
The data has been observed every 5 minutes for a month, and a day-long period of only 0s can be reasonably thought of as a day where the counter was not functioning, thereby leading to the conclusion that those 0s are actually NAs.
I want to find (and remove) columns that have at least 288 consecutive 0s at any point. Or, more generally, how can we remove columns from a data.frame containing >=k consecutive 0s?
I'm relatively new to R, and any help would be greatly appreciated. Thanks!
EDIT: Here is a reproducible example. Considering k=4, I would like to remove columns A and B (but not C, since the 0s are not consecutive).
df<-data.frame(A=c(4,5,8,2,0,0,0,0,6,3), B=c(3,0,0,0,0,6,8,2,1,0), C=c(4,5,6,0,3,0,2,1,0,0), D=c(1:10))
df
A B C D
1 4 3 4 1
2 5 0 5 2
3 8 0 6 3
4 2 0 0 4
5 0 0 3 5
6 0 6 0 6
7 0 8 2 7
8 0 2 1 8
9 6 1 0 9
10 3 0 0 10
You can use this function on your data:
cons.Zeros <- function (x, n)
{
x <- x[!is.na(x)] == 0
r <- rle(x)
any(r$lengths[r$values] >= n)
}
This function returns TRUE for the columns that need to be dropped. n is the number of consecutive zeros that you want the column to be dropped for.
For your sample dataset let's use n = 3;
df.dropped <- df[, !sapply(df, cons.Zeros, n=3)]
#output:
# > df.dropped
# C D
# 1 4 1
# 2 5 2
# 3 6 3
# 4 0 4
# 5 3 5
# 6 0 6
# 7 2 7
# 8 1 8
# 9 0 9
# 10 0 10

Data Cleaning for Survival Analysis

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

Resources