Dummy variables to factor [duplicate] - r

This question already has answers here:
For each row return the column name of the largest value
(10 answers)
Closed 2 years ago.
Hello I am trying to create a new variable in my data set, that combines each dummy of "education" with their respective character strings so I can use the factor of edu in a regression model.
I am not certain how to create a new variable "edu" with "edu4"in the first & second row and so on...
Help is much appreciated!

As you not provide the dataset by dput function I built a small example by myself.
dput(df)
structure(list(id = 1:10, edu1 = c(1, 0, 0, 0, 0, 0, 0, 0, 1,
0), edu2 = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0), edu3 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), edu4 = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
edu5 = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-10L))
Solution
df$edu = factor(apply(df[,paste0("edu", 1:5)], 1, which.max))
Result
> df
id edu1 edu2 edu3 edu4 edu5 edu
1 1 1 0 0 0 0 1
2 2 0 0 0 1 0 4
3 3 0 0 0 1 0 4
4 4 0 0 0 0 1 5
5 5 0 0 0 1 0 4
6 6 0 1 0 0 0 2
7 7 0 0 0 0 1 5
8 8 0 1 0 0 0 2
9 9 1 0 0 0 0 1
10 10 0 0 0 0 1 5

Try this: df is your data frame, and your edu variables are in colum 7 to 12. But we start from 8. If all your edu variables are 0 edu1 will be generated.
factor_variable <- factor((df[ ,8:12] %*% (1:ncol(df[ ,8:12]))) + 1,
labels = c("edu1", colnames(df[ ,8:12])))
Let me know if this worked.

Related

Replace values in a column unless there's already a "1" there

I have data like this:
df<-structure(list(a = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), b = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0)), row.names = c(NA, -19L), class = c("tbl_df",
"tbl", "data.frame"))
I would like to replace values in column A based on column B. If column B has a "1" in it, I want to replace the row in column A with a 1.
I know this can do that:
df<-df %>%mutate(a=ifelse(str_detect(b,"1"),1,0))
The problem is, this replaces everything in column A based on those rules, overwriting what was already there. I only want to replace A if it didn't already have a "1". So my expected output would be:
We may need just | on the binary column to replace the values in 'a' where 'b' is also 1
library(dplyr)
df %>%
mutate(a = +(a|b))
-output
# A tibble: 19 × 2
a b
<int> <dbl>
1 1 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 0 0
16 1 1
17 0 0
18 0 0
19 0 0
Or in base R
df$a[df$b == 1] <- 1

Issues with plotting network in igraph

I am having some issues in realizing a bipartite network in R with the library igraph. Here is my script:
library(igraph)
library(reshape2)
setwd("....")
getwd()
library(readxl)
network=read_excel("network1.xlsx")
print(network)
subjects=as.character(unlist(network[,1]))
agents=colnames(network[-1])
print(network)
network = network[,-1]
g=graph.incidence(network, weighted = T)
V(g)$type
V(g)$name=c(subjects,agents)
V(g)$color = V(g)$type
V(g)$color=gsub("FALSE","red",V(g)$color)
V(g)$color=gsub("TRUE","lightblue",V(g)$color)
plot(g, edge.arrow.width = 0.3,
vertex.size = 5,
edge.arrow.size = 0.5,
vertex.size2 = 5,
vertex.label.cex = 1,
vertex.label.color="black",
asp = 0.35,
margin = 0,
edge.color="grey",
edge.width=(E(g)$weight),
layout=layout_as_bipartite)
The network is properly plotted
as you can see
however I have two issues
(1) I don't understand the order in which the vertexs are showed in the plot. They are not in the same order of the excel file, neither in alphabetical or numerical order. They seem to be in random order. How could I choose the order in which the vertex should be placed?
(2) I don't understand why some vertex are closer toghether, and some are more far apart. I would all vertexes at the same distance. How could I do it?
Thank you a lot for your invaluable help.
Since you do not provide your data, I will illustrate with a made-up example.
Sample graph data
library(igraph)
set.seed(123)
EL = matrix(c(sample(8,18, replace=T),
sample(LETTERS[1:6], 18, replace=T)), ncol=2)
g = simplify(graph_from_edgelist(EL))
V(g)$type = bipartite_mapping(g)$type
VCol = c("#FF000066", "#0000FF66")[as.numeric(V(g)$type)+1]
plot(g, layout=layout_as_bipartite(g), vertex.color=VCol)
As with your graph, this has two problems. The nodes are ordered arbitrarily
and the lower row is oddly spaced. Let's address those problems one at a time.
To do so, we will need to take control of the layout instead of using any of
the automated layout functions. A layout is simply a vcount(g) * 2 matrix
giving the x-y coordinates of the vertices for plotting. Here, I will put one
type of nodes in the top row by specifying the y coordinate as 1 and the other
nodes in a lower row by specifying y=0. We want to specify the order horizontally
by rank (alphabetically) within each group. So
LO = matrix(0, nrow=vcount(g), ncol=2)
LO[!V(g)$type, 2] = 1
LO[V(g)$type, 1] = rank(V(g)$name[V(g)$type])
LO[!V(g)$type, 1] = rank(V(g)$name[!V(g)$type])
plot(g, layout=LO, vertex.color=VCol)
Now both rows are ordered and evenly spaced, but because there are fewer
vertices in the bottom row, there is an unattractive, unbalanced look. We
can fix that by stretching the bottom row. I find it easier to make the right
scale factor if the coordinates go from 0 to (number of nodes) - 1 rather than
1 to (number of nodes) as above. Doing this, we get
LO[V(g)$type, 1] = rank(V(g)$name[V(g)$type]) - 1
LO[!V(g)$type, 1] = (rank(V(g)$name[!V(g)$type]) - 1) *
(sum(V(g)$type) - 1) / (sum(!V(g)$type) - 1)
plot(g, layout=LO, vertex.color=VCol)
thank you a lot. I performed your very very helpful example, and with the step one I did it work properly with my data, keeping the different thickness of the edges and all as in my plot, but with the proper order. This is very important, thank you a lot. However, I have some troubles in understanding how to rescale properly the top and the bottom row with my data, because they always seem to bee too near. probably I did not understand completly the coordinates on which I have to work. Here are my data.
> `> network=read_excel("network1.xlsx",2)
> dput(network)
structure(list(`NA` = c(2333, 2439, 2450, 2451, 2452, 2453, 2454,
2455, 2456, 2457, 2458, 2459, 2460, 2461, 2480, 2490, 2491, 2492,
2493, 2494, 2495), A = c(12, 2, 2, 5, 2, 0, 5, 3, 0, 0, 7, 0,
0, 0, 6, 2, 10, 7, 1, 2, 5), B = c(0, 1, 0, 1, 0, 0, 2, 0, 0,
0, 0, 0, 1, 0, 5, 0, 2, 0, 0, 0, 0), C = c(0, 0, 0, 0, 1, 0,
4, 0, 0, 0, 0, 1, 0, 0, 2, 0, 4, 4, 2, 1, 0), D = c(2, 0, 0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 1, 4, 0), E = c(11,
2, 3, 3, 3, 8, 3, 6, 4, 1, 1, 0, 12, 0, 5, 0, 4, 6, 4, 8, 9),
F = c(2, 0, 0, 3, 1, 0, 10, 1, 0, 0, 0, 1, 0, 0, 9, 0, 0,
1, 1, 3, 3), G = c(0, 3, 1, 1, 0, 0, 0, 0, 0, 3, 2, 0, 0,
0, 1, 0, 0, 2, 0, 1, 0), H = c(0, 0, 2, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1), I = c(0, 0, 0, 0, 0,
0, 3, 0, 6, 3, 0, 0, 1, 0, 7, 0, 0, 4, 1, 2, 0), J = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-21L), .Names = c(NA, "A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"))
> print(network)
NA A B C D E F G H I J
1 2333 12 0 0 2 11 2 0 0 0 0
2 2439 2 1 0 0 2 0 3 0 0 0
3 2450 2 0 0 0 3 0 1 2 0 0
4 2451 5 1 0 0 3 3 1 0 0 0
5 2452 2 0 1 0 3 1 0 0 0 0
6 2453 0 0 0 0 8 0 0 0 0 1
7 2454 5 2 4 2 3 10 0 1 3 0
8 2455 3 0 0 0 6 1 0 0 0 0
9 2456 0 0 0 0 4 0 0 0 6 0
10 2457 0 0 0 0 1 0 3 0 3 0
11 2458 7 0 0 0 1 0 2 0 0 0
12 2459 0 0 1 0 0 1 0 0 0 0
13 2460 0 1 0 0 12 0 0 0 1 0
14 2461 0 0 0 0 0 0 0 0 0 0
15 2480 6 5 2 7 5 9 1 2 7 1
16 2490 2 0 0 0 0 0 0 0 0 0
17 2491 10 2 4 4 4 0 0 0 0 0
18 2492 7 0 4 0 6 1 2 0 4 0
19 2493 1 0 2 1 4 1 0 0 1 0
20 2494 2 0 1 4 8 3 1 0 2 0
21 2495 5 0 0 0 9 3 0 1 0 0
> `

Create new columns with mutate_if [duplicate]

This question already has an answer here:
Create new variables with mutate_at while keeping the original ones
(1 answer)
Closed 4 years ago.
Let's assume that I have data like below:
structure(list(A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8), B = c(0, 1, 1, 0, 0, 1, 4, 9.2, 9, 0, 0, 1), C = c(2, 9, 0, 0, 0, 9, 0, 0, 0, 0, 0, 8)), .Names = c("A", "B", "C"), row.names = c(NA, -12L), class = "data.frame")
Now I would like to create dummy variables for these columns for which proportion of 0's is greater than 0.5. These dummy variables would have value 0 if there is 0 in original column, and 1 if opposite. How can I accomplish that with dplyr? I was thinking of data %>% mutate_if(~mean(. == 0) > .5, ~ifelse(. == 0, 0, 1)), but this operates in place and I need to create new variables named e.g. A01, C01 and preserve the old ones A and C.
We wrap with the funs and give a different name which will append as suffix
library(dplyr)
library(stringr)
df1 %>%
mutate_if(~mean(. == 0) > .5, funs(`01` = ifelse(. == 0, 0, 1))) %>%
rename_all(str_remove, "_")
# A B C A01 C01
#1 0 0.0 2 0 1
#2 0 1.0 9 0 1
#3 0 1.0 0 0 0
#4 0 0.0 0 0 0
#5 0 0.0 0 0 0
#6 0 1.0 9 0 1
#7 0 4.0 0 0 0
#8 0 9.2 0 0 0
#9 0 9.0 0 0 0
#10 0 0.0 0 0 0
#11 1 0.0 0 1 0
#12 8 1.0 8 1 1
In the newer version of dplyr, we can use mutate with across
df1 %>%
mutate(across(where(~ mean(. == 0) > .5),
~ as.integer(. != 0), .names = '{.col}01'))

cumsum according to certain restricts in r

I have a large data of car accidents and a sample of it is provided below.
accident is a binary variable of whether the accident happens or
not.
shift_number is the number of the shift, 0 means the driver is
taking a rest and not a shift.
time_diff is the amount of time at each observation.
df <- data.frame(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17
)
My question is to measure the total amount of working time since the driver starts this shift for each accident.
wanted <- data.frame
(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17,
cum_time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 75)
)
Does anyone have ideas on solving this problem with R? It's better to have data.table or vectorised solution because I've got huge data to deal with.
df$cum_time = 0
accident = which(df$accident == 1)
df$cum_time[accident] <- sapply(accident, function(x) {
sum(df$time_diff[(which.max(cumsum(df$shift_number[1:x] == 0)) + 1): x])
})
df
# accident shift_number time_diff cum_time
#1 0 1 3 0
#2 0 1 4 0
#3 0 0 5 0
#4 0 0 6 0
#5 0 0 7 0
#6 0 2 8 0
#7 0 2 9 0
#8 0 2 10 0
#9 0 0 11 0
#10 0 0 12 0
#11 0 3 13 0
#12 1 3 14 27
#13 0 3 15 0
#14 0 3 16 0
#15 1 3 17 75
We first make all the values in cum_time variable as 0. We find the indices where accident has occurred. For each of those indices we find the latest 0 in shift_number and calculate the sum of values of time_diff from the latest 0 to x and assign it to its respective indices.
Use the ave function to compute the cumulative sum of time_diff by shift_number:
cumsum_by_shift <- ave(df$time_diff, df$shift_number, FUN=cumsum)
#[1] 3 7 5 11 18 8 17 27 29 41 13 27 42 58 75
Pick out elements of cumsum_by_shift where accidents occur:
cum_time <- ifelse(df$accident == 1, cumsum_by_shift, 0)
#[1] 0 0 0 0 0 0 0 0 0 0 0 27 0 0 75
Note the use of the vectorized ifelse function.

How to convert predicted values into binary variables and save them to a CSV

I have made a decision tree model on test data then used it to predict vales in a test dataset.
dtpredict<-predict(ct1, testdat, type="class")
The output looks like:
1 2 3 4 5 6
Class_2 Class_2 Class_6 Class_2 Class_8 Class_2
I want to write a csv to look like:
id, Class_1, Class_2, Class_3, Class_4, Class_5, Class_6, Class_7, Class_8, Class_9
1, 0, 1, 0, 0, 0, 0, 0, 0, 0
2, 0, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0, 0
4, 0, 1, 0, 0, 0, 0, 0, 0, 0
5, 0, 0, 0, 0, 0, 0, 0, 1, 0
6, 0, 1, 0, 0, 0, 0, 0, 0, 0
There's a package called dummies that does that well...
install.packages("dummies")
library(dummies)
x <- factor(c("Class_2", "Class_2", "Class_6", "Class_2", "Class_8", "Class_2"),
levels = paste("Class", 1:9, sep="_"))
dummy(x, drop = FALSE)
xClass_1 xClass_2 xClass_3 xClass_4 xClass_5 xClass_6 xClass_7 xClass_8 xClass_9
[1,] 0 1 0 0 0 0 0 0 0
[2,] 0 1 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 1 0 0 0
[4,] 0 1 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 1 0
[6,] 0 1 0 0 0 0 0 0 0
All that remains is to get rid of the "x" but this should not be too hard with something like this:
d <- dummy(x,drop = FALSE)
colnames(d) <- sub("x", "", colnames(d))
and then to save to disk:
write.csv(d, "somefile.csv", row.names = FALSE)
Uh, what are the 010101's - logicals? If so they don't make much sense in your example all are class 1 (doesn't correspond to your example dtpredict). If they are logicals....
# if dtpredict is a factor vector, where the values are the classes
# and the names are the boolean values:
values = as.numeric(as.character(names(dtpredict)))
classes = as.character(dtpredict)
x = data.frame(id=names(classes))
for(class in sort(unique(classes)){
x[ , class] = as.numeric(sapply(classes, FUN=function(p) p==class])
}
write.csv(x, 'blah.csv')

Resources