Create new variable based on the value of several other variables - r

So I have a data set that has multiple variables that I want to use to create a new variable. I have seen other questions like this that use the ifelse statement, but this would be extremely insufficient since the new variable is based on 32 other variables. The variables are coded with values of 1, 2, 3, or NA, and I am wanting the new variable to be coded as 1 if 2 or more of the 32 variables take on a value of 1, and 2 otherwise. Here is a small example of what I have been trying to do.
df <- data.frame(id = 1:10, v1 = c(1,2,2,2,3,NA,2,2,2,2), v2 = c(2,2,2,2,2,1,2,1,2,2),
v3 = c(1,2,2,2,2,3,2,2,2,2), v4 = c(2,2,2,2,2,1,2,2,2,3))
and the result I am looking for is this:
id v1 v2 v3 v4 new
1 1 1 2 1 2 1
2 2 2 2 2 2 2
3 3 2 2 2 2 2
4 4 2 2 2 2 2
5 5 3 2 2 2 1
6 6 NA 1 3 1 2
7 7 2 2 2 2 2
8 8 2 1 2 2 2
9 9 2 2 2 2 2
10 10 2 2 2 3 2
I have also tried using rowSums within the if else statement, but with the missing values this doesn't work for all observations unless I recode the NAs to another value which I want to avoid doing, and besides that I feel like there would be a much more efficient way of doing this.
I feel like it is likely that this question has been answered before, but I couldn't find anything on it. So help or direction to a previous answer would be appreciated.

It looks like you were very close to getting your desired output, but you were probably missing the na.rm = TRUE argument as part of your rowSums() call. This will remove any NAs before rowSums does its calculations.
Anyway, using your data frame from above, I created a new variable that counts the number of times 1 appears across the variables, while ignoring NA values. Note that I've subsetted the data to exclude the id column:
df$count <- rowSums(df[-1] == 1, na.rm = TRUE)
Then I created another variable using an ifelse statement that returns a 1 if the count is 2 or more or a 2 otherwise.
df$var <- ifelse(df$count >= 2, 1, 2)
The returned output:
id v1 v2 v3 v4 count var
1 1 1 2 1 2 2 1
2 2 2 2 2 2 0 2
3 3 2 2 2 2 0 2
4 4 2 2 2 2 0 2
5 5 3 2 2 2 0 2
6 6 NA 1 3 1 2 1
7 7 2 2 2 2 0 2
8 8 2 1 2 2 1 2
9 9 2 2 2 2 0 2
10 10 2 2 2 3 0 2
UPDATE / EDIT: As mentioned by Gregor in the comments, you can also just wrap the rowSums function in the ifelse statement for one line of code.

Related

How to use an if statement to fill two columns related to number of occurencies of interested values [duplicate]

I have a data set which looks something like
data<-c(0,1,2,3,4,2,3,1,4,3,2,4,0,1,2,0,2,1,2,0,4)
frame<-as.data.frame(data)
I now want to create a new variable within this data frame. If the column "data" reports a number of 2 or more, I want it to have "2" in that row, and if there is a 1 or 0 (e.g. the first two observations), I want the new variable to have a "1" for that observation.
I am trying to do this using the following code:
frame$twohouses<- if (any(frame$data>=2)) {frame$twohouses=2} else {frame$twohouses=1}
However if I run these 3 lines of script, every observation in the column "twohouses" is coded with a 2. However a number of them should be coded with a 1.
So my question: what am I doing wrong with my if else line or script? Or is there an alternative way to do this.
My question is similar to this one:
Using ifelse on factor in R
ut no one has answered that question.
Use ifelse:
frame$twohouses <- ifelse(frame$data>=2, 2, 1)
frame
data twohouses
1 0 1
2 1 1
3 2 2
4 3 2
5 4 2
...
16 0 1
17 2 2
18 1 1
19 2 2
20 0 1
21 4 2
The difference between if and ifelse:
if is a control flow statement, taking a single logical value as an argument
ifelse is a vectorised function, taking vectors as all its arguments.
The help page for if, accessible via ?"if" will also point you to ?ifelse
Try this
frame$twohouses <- ifelse(frame$data>1, 2, 1)
frame
data twohouses
1 0 1
2 1 1
3 2 2
4 3 2
5 4 2
6 2 2
7 3 2
8 1 1
9 4 2
10 3 2
11 2 2
12 4 2
13 0 1
14 1 1
15 2 2
16 0 1
17 2 2
18 1 1
19 2 2
20 0 1
21 4 2

Retain Max Value of Vector until vector Catches up

I have some cumulative count data. Because of reporting innacuracies, sometimes the cumulative sum decreases such as 0 1 2 2 3 3 2 4 5.
I would like to created a new vector that retains the largest value reported and carries it forward until the cumulative count data catches up. So the corrected version of the above would be 0 1 2 2 3 3 3 4 5
I tried the following
mydf <- data.frame(ts1 = c(0,1,1,1,2,3,2,2,3,4,4,5))
mydf$lag1 <- lag(mydf[,1])
mydf$corrected <- ifelse(is.na(mydf[,2]),mydf[,1],
ifelse(mydf[,2] > mydf[,1], mydf[,2], mydf[,1]))
which returns:
ts1 lag1 corrected
1 0 NA 0
2 1 0 1
3 1 1 1
4 1 1 1
5 2 1 2
6 3 2 3
7 2 3 3
8 2 2 2
9 3 2 3
10 4 3 4
11 4 4 4
12 5 4 5
This worked for the case of the first time that the next value was smaller than the previous value(line7) but it fails for the second time(line 8).
I thought there must be a better way of doing this. New Vector that is equal to input vector unless value decreases in which case it retains prior value until input vector exceeds that retained value.
You are looking for cummax :
cummax(mydf$ts1)
#[1] 0 1 1 1 2 3 3 3 3 4 4 5

R - Count duplicates values for each row

I'm working on a data frame that requires to calculate Fleiss's Kappa for inter-rater agreements. I'm using the 'irr' package for that.
Besides that, I need to count, for each observation, how many of raters are in agreement.
My data looks like these:
a b c
1 1 1 1
2 1 2 2
3 2 3 2
4 3 3 1
5 4 2 1
I'm expecting something like this, , where count stands for number of raters on agreement
a b c count
1 1 1 1 3
2 1 2 2 2
3 2 3 2 2
4 3 3 1 2
5 4 2 1 0
Thanks a lot.
Alternative solution if your data is in a data frame called abc:
as.numeric(apply(abc,1,function(x) {
ux<-unique(x);
tab <- tabulate(match(x, ux));
mode <- ux[tab == max(tab)];
ifelse(length(mode)==1,length(which(x==mode)),NA_character_);
} ))
When you run it gives:
[1] 3 2 2 2 NA

paste values within categories defined by multiple columns

I want to pivot the result column in df horizontally creating a data set with a separate row for each
region, state, county combination where the columns are ordered by year then city.
I also want to identify each row in the new data set by region, state and county and remove the white space between the four results columns. The code below does all of that, but I suspect it is not very efficient.
Is there a way to do this with reshape2 without creating a unique identifier for each group and numbering observations within each group? Is there a way to use apply in place of the for-loop to remove white space from a matrix? (Matrix here being used in a different manner than a mathematical or programming construct.) I realize those are two separate questions and maybe I should post each question separately.
Given that I can achieve the desired result and am only looking to improve the code I do not know whether I should even post this, but I am hoping to learn. Thanks for any advice.
df <- read.table(text= "
region state county city year result
1 1 1 1 1 1
1 1 1 2 1 2
1 1 1 1 2 3
1 1 1 2 2 4
1 1 2 3 1 4
1 1 2 4 1 3
1 1 2 3 2 2
1 1 2 4 2 1
1 2 1 1 1 0
1 2 1 2 1 NA
1 2 1 1 2 0
1 2 1 2 2 0
1 2 2 3 1 2
1 2 2 4 1 2
1 2 2 3 2 2
1 2 2 4 2 2
2 1 1 1 1 9
2 1 1 2 1 9
2 1 1 1 2 8
2 1 1 2 2 8
2 1 2 3 1 1
2 1 2 4 1 0
2 1 2 3 2 1
2 1 2 4 2 0
2 2 1 1 1 2
2 2 1 2 1 4
2 2 1 1 2 6
2 2 1 2 2 8
2 2 2 3 1 3
2 2 2 4 1 3
2 2 2 3 2 2
2 2 2 4 2 2
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
region state county results
1 1 1 1234
1 1 2 4321
1 2 1 0.00
1 2 2 2222
2 1 1 9988
2 1 2 1010
2 2 1 2468
2 2 2 3322
", header=TRUE, colClasses=c('numeric','numeric','numeric','character'))
# redefine variables for package reshape2 creating a unique id for each
# region, state, county combination and then number observations in
# each of those combinations
library(reshape2)
id.var <- df$region*100000 + df$state*1000 + df$county
obsnum <- sequence(rle(id.var)$lengths)
df2 <- dcast(df, region + state + county ~ obsnum, value.var = "result")
# remove spaces between columns of results matrix
# with a for-loop. How can I use apply to do this?
x <- df2[,4:(4+max(obsnum)-1)]
# use a dot to represent a missing observation
x[is.na(x)] = '.'
x.cat = numeric(nrow(x))
for(i in 1:nrow(x)) {
x.cat[i] = paste(x[i,], collapse="")
}
df3 <- cbind(df2[,1:3],x.cat)
colnames(df3) <- c("region", "state", "county", "results")
df3
df3 == desired.result
EDIT:
Matthew Lundberg's answer below is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, I have posted an apparent way to do that below that modifies Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.
I think this does what you want:
df$result <- as.character(df$result)
df$result[is.na(df$result)] <- '.'
aggregate(result ~ county+state+region, data=df, paste0, collapse='')
county state region result
1 1 1 1 1234
2 2 1 1 4321
3 1 2 1 0.00
4 2 2 1 2222
5 1 1 2 9988
6 2 1 2 1010
7 1 2 2 2468
8 2 2 2 3322
This relies on your data frame being sorted in the proper order (as yours is).
Matthew Lundberg's answer is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, here I provide an apparent way to do that using a modification of Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.
The first two lines are modifications of Matthew's answer.
df$result[is.na(df$result)] <- 'NA'
df2 <- aggregate(result ~ county+state+region, data=df, paste)
Then I specify that NA represents missing observations and use apply to obtain the numeric output.
df2$result[df2$result=='NA'] = NA
new.df <- data.frame(df2[,1:3], apply(df2$result,2,as.numeric))
The output is below except note that I added 0.5 to each value in df shown in the original post.
county state region X1 X2 X3 X4
1 1 1 1.5 2.5 3.5 4.5
2 1 1 4.5 3.5 2.5 1.5
1 2 1 0.5 NA 0.5 0.5
2 2 1 2.5 2.5 2.5 2.5
1 1 2 9.5 9.5 8.5 8.5
2 1 2 1.5 0.5 1.5 0.5
1 2 2 2.5 4.5 6.5 8.5
2 2 2 3.5 3.5 2.5 2.5
In my original post I asked how to remove spaces between columns in a data set using apply. That did not prove necessary thanks to Matthew Lundberg's answer to my larger question. Nevertheless, removing spaces between columns of a data set is something I frequently have to do. For completeness, here I post a way to do that using paste0 and apply that arose, in part, from Matthew's answer.
To remove all spaces from the data set x:
x <- read.table(text= "
A B C D
1 1 1 1
1 1 2 2
1 NA 1 3
1 1 2 4
1 2 1 5
1 2 NA 6
1 2 1 7
1 2 2 8
", header=TRUE, na.strings=NA)
# use a dot to represent a missing observation
x[is.na(x)] = '.'
y <- as.data.frame(apply(x, 1, function(i) paste0(i, collapse='')))
colnames(y) <- 'result'
y
Gives:
result
1 1111
2 1122
3 1.13
4 1124
5 1215
6 12.6
7 1217
8 1228
The following code removes the spaces between just the second and third columns:
z <- as.data.frame(apply(x[,2:3], 1, function(i) paste0(i, collapse='')))
y <- data.frame(x[,1], z, x[,4])
colnames(y) <- c('A','BC','D')
y
Giving:
A BC D
1 1 11 1
2 1 12 2
3 1 .1 3
4 1 12 4
5 1 21 5
6 1 2. 6
7 1 21 7
8 1 22 8

Using If/Else on a data frame

I have a data set which looks something like
data<-c(0,1,2,3,4,2,3,1,4,3,2,4,0,1,2,0,2,1,2,0,4)
frame<-as.data.frame(data)
I now want to create a new variable within this data frame. If the column "data" reports a number of 2 or more, I want it to have "2" in that row, and if there is a 1 or 0 (e.g. the first two observations), I want the new variable to have a "1" for that observation.
I am trying to do this using the following code:
frame$twohouses<- if (any(frame$data>=2)) {frame$twohouses=2} else {frame$twohouses=1}
However if I run these 3 lines of script, every observation in the column "twohouses" is coded with a 2. However a number of them should be coded with a 1.
So my question: what am I doing wrong with my if else line or script? Or is there an alternative way to do this.
My question is similar to this one:
Using ifelse on factor in R
ut no one has answered that question.
Use ifelse:
frame$twohouses <- ifelse(frame$data>=2, 2, 1)
frame
data twohouses
1 0 1
2 1 1
3 2 2
4 3 2
5 4 2
...
16 0 1
17 2 2
18 1 1
19 2 2
20 0 1
21 4 2
The difference between if and ifelse:
if is a control flow statement, taking a single logical value as an argument
ifelse is a vectorised function, taking vectors as all its arguments.
The help page for if, accessible via ?"if" will also point you to ?ifelse
Try this
frame$twohouses <- ifelse(frame$data>1, 2, 1)
frame
data twohouses
1 0 1
2 1 1
3 2 2
4 3 2
5 4 2
6 2 2
7 3 2
8 1 1
9 4 2
10 3 2
11 2 2
12 4 2
13 0 1
14 1 1
15 2 2
16 0 1
17 2 2
18 1 1
19 2 2
20 0 1
21 4 2

Resources