Related
I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1
This question already has answers here:
Count the number of non-zero elements of each column
(4 answers)
Closed 3 years ago.
I have a dataframe like so:
df
ID Black red blue green
1 0 1 1 1
2 0 0 2 1
3 0 5 1 2
4 0 0 1 2
5 0 0 1 2
with this data, how can I count the columns that have any number regardless of the actual number? Is there any way to do this?
I tried rowSums(df) but result is not what I want.
My desired output is:
ID Black red blue green count
1 0 1 1 1 3
2 0 0 2 1 2
3 0 5 1 2 3
4 0 0 1 2 2
5 0 0 1 2 2
thanks in advance!
I think this should work for you
data = data.frame(Black = rep(0, 5),
Red = c(1,0,5,0,0),
Blue = c(1,2,1,1,1),
Green = c(1,1,2,2,2))
data$count = apply(data, 1, FUN = function(x) length(x[x>0]))
Another solution:
library(dplyr)
data <-
data.frame(Black = rep(0, 5),
Red = c(1,0,5,0,0),
Blue = c(1,2,1,1,1),
Green = c(1,1,2,2,2)
)
data$count <-
data %>%
apply(2, as.logical) %>%
rowSums
this produces the same result as the answer by ccharles, but utilizes a trick that I think is broadly useful. By coercing all values to the logical class (i.e. TRUE or FALSE), you can collapse all zeros into FALSE, and any other number into TRUE. Numeric functions treat logical values as TRUE = 1 and FALSE = 0, so you can then use rowSums() to get a count for each row.
I need to find and add a column with value '1':
For each group value of variable z_axis, the row with the minimum x_axis (if there is a tie, the one with minimum y_axis)
For each group value of variable z_axis, the row with the minimum y_axis (if there is a tie, the one with minimum x_axis)
So for the following DT,
set.seed(2)
library(data.table)
test_surfaces<-data.table(
x_axis=c(sample(1:10,20,replace = T)),
y_axis=c(sample(1:10,20,replace = T)),
z_axis=c(sample(1:3,20,replace=T)),
avail=c(sample(0:1,20,replace=T))
)
I tried
test_surfaces[,.SD[which(y_axis==min(y_axis))][1,candidate:=1],by=z_axis]
test_surfaces[,.SD[which(x_axis==min(x_axis))][1,candidate:=1],by=z_axis]
which gives output
z_axis x_axis y_axis avail cand
1: 3 2 2 0 1
2: 3 6 2 0 NA
3: 3 1 2 1 NA
4: 1 6 1 0 1
5: 2 9 4 0 1
which is ok for finding the minimum by group. But
It does not compare x_axises and select the one with minimum y_axis (i.e. it should select row 3, instead of row 1 for z_axis=3
Data table is not modified permanently (i.e. column 'cand' is not permanenet). For this, i know I can use join(), but it's not elegant.
You could try to sort by x_axis and y_axis and then simply assign the first value to be 1 for each group. This will place the lowest x_axis/y_axis combination first for each group.
test_surfaces[order(z, x_axis, y_axis),
candidate:=rep(c(1,0), times = c(1, .N - 1)), by = z]
test_surfaces[order(z, y_axis, x_axis), candidate:=c(1, candidate[-1]), by = z]
head(test_surfaces[order(z, x_axis, y_axis)])
# x_axis y_axis z avail candidate
#1: 2 2 1 0 1
#2: 3 2 1 0 0
#3: 5 6 1 0 0
#4: 6 1 1 0 1
#5: 6 9 1 1 0
#6: 8 4 1 1 0
I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1
How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}