Analyzing covariation of variables over time in R - r

I am using Rstudio to analyze data. And - why else would I post this - I'm getting stuck!
The data is a time-series of the evolution of a community and their characteristics over time. For example, the n_members (the size) can increase over time.
Each combination of the variables 'randomseed' and 'lhsExperimentNumber' is 1 observation. In total, there are 80,000 observations. Each observation has a time-series going from 1 (month) to 204 (months). This time is characterized in the variable X.run.number
What I want:
See covariation of dependent and independent variables over time.
For example: I would like to see how the n_members (size) evolves over time, under different circumstances (exit_window, for example). I normally use dplyr and ggplot2, and I'd like to make some captivating images in ggplot.
My problem:
- I'm just a beginner in R so it's a little tricky for me to see the correct strategy here
- It would be sort of easier if I would just have 1 observation that evolves over time, but now I have 80,000 observations, which each independently evolve over time. So I don't really know how to work the data to make sensible graphs.
How would you guys tackle this? Maybe you can help me out with some tips (that would be so great!)
I used dput to reproduce the first 10 observations of my dataset:
library(ggplot2)
library(dplyr)
coop_dat <- structure(list(X.run.number. = c(2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 3L, 3L), rewire_prop = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1), prior_trust_std = c(0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1), max_trust = c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), maxCR_trust = c(0.2, 0.2, 0.2, 0.2, 0.2, 0.2,
0.2, 0.2, 0.2, 0.2), Ostrom. = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("false", "true"), class = "factor"),
social_network = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "\"small-world\"", class = "factor"),
external_ROI_influence = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("false", "true"), class = "factor"),
randomseed = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), lhsExperimentNumber = c(1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 2L, 2L), X.step. = c(0L, 0L,
0L, 0L, 1L, 1L, 1L, 1L, 2L, 3L), size_start_cooperative = c(12L,
4L, 20L, 17L, 12L, 4L, 20L, 17L, 4L, 4L), degree_prior = c(7L,
21L, 40L, 15L, 7L, 21L, 40L, 15L, 21L, 21L), degree_coop = c(17L,
47L, 20L, 31L, 17L, 47L, 20L, 31L, 47L, 47L), prior_trust_average = c(0.40574414878618,
0.631650829650462, 0.369752783104777, 0.585182260912843,
0.40574414878618, 0.631650829650462, 0.369752783104777, 0.585182260912843,
0.631650829650462, 0.631650829650462), number_of_meetings = c(10L,
5L, 2L, 2L, 10L, 5L, 2L, 2L, 5L, 5L), n_interactions = c(20L,
3L, 9L, 19L, 20L, 3L, 9L, 19L, 3L, 3L), exit_window = c(5L,
1L, 3L, 4L, 5L, 1L, 3L, 4L, 1L, 1L), info_peer_behavior = c(0.396114811487496,
0.627111892865505, 0.148686024846975, 0.281594149058219,
0.396114811487496, 0.627111892865505, 0.148686024846975,
0.281594149058219, 0.627111892865505, 0.627111892865505),
memory = c(8L, 11L, 20L, 1L, 8L, 11L, 20L, 1L, 11L, 11L),
n_shares_total_sum = c(0L, 0L, 0L, 0L, 21L, 13L, 32L, 31L,
13L, 13L), n_shares_total_mean = c(0, 0, 0, 0, 1.75, 3.25,
1.6, 1.82352941176471, 3.25, 3.25), members_at_meeting = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L), trust_coop_total = c(0,
0, 0, 0, 0.0457418505048186, 0.0563109517340759, 0.0349050771003757,
0.0587220121136852, 0.0563109517340759, 0.0563109517340759
), rep_total = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ROI = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), current_strategy = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), price = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), REfocus = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), endpoint = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), HG_membertypes = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
n_members = c(0L, 0L, 0L, 0L, 12L, 4L, 20L, 17L, 4L, 4L),
n_ids = c(0L, 0L, 0L, 0L, 12L, 4L, 20L, 17L, 4L, 4L), n_prods = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), n_cons = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), n_members_exit = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("X.run.number.",
"rewire_prop", "prior_trust_std", "max_trust", "maxCR_trust",
"Ostrom.", "social_network", "external_ROI_influence", "randomseed",
"lhsExperimentNumber", "X.step.", "size_start_cooperative", "degree_prior",
"degree_coop", "prior_trust_average", "number_of_meetings", "n_interactions",
"exit_window", "info_peer_behavior", "memory", "n_shares_total_sum",
"n_shares_total_mean", "members_at_meeting", "trust_coop_total",
"rep_total", "ROI", "current_strategy", "price", "REfocus", "endpoint",
"HG_membertypes", "n_members", "n_ids", "n_prods", "n_cons",
"n_members_exit"), row.names = c(NA, 10L), class = "data.frame")
EDIT:
Right now I have this simple ggplot code:
ggplot(dat) +
geom_line(aes(x=month, y=trust_coop_total))
This produces this graph:
You can see that this is not really a line graph. It's more a graph with a bunch of vertical lines. What I would like is to decipher trends for each individual sample. Each sample is, as mentioned above, a combination of lhsExperimentnumber and X.run.number. There are so many samples that a line graph would be overwhelming, but it might be nice in combination with the alpha=1/10 command of geom_jitter. However, geom_jitter ignores the trend in the sample itself.
What should I do to get a good trend graph?

Related

Calculating optimal number of clusters with Nbclust()

I would like to calculate the optimal number of clusters for a large dataset: 17 columns and >80.000 rows.
This is my code:
1. Definition of the path
setwd("C:/Users/A/Documents/Master BWL/Masterarbeit")
2. Loading the required packages
library(factoextra); library(cluster); library(skmeans); library(mclust);
library(fpc); library(psda); library(simEd); library (ggpubr);
library(dbscan); library(clustertend); library(MASS); library(devtools);
library(ggbiplot);library(NbClust)
3. Import csv file
WKA_ohneJB <- read.csv("WKA_ohneJB_PCA.csv", header=TRUE, sep = ";", stringsAsFactors = FALSE)
WKA_ohneJB_scaled <- scale(WKA_ohneJB)
# NbClust ()
nb <- NbClust(WKA_ohneJB_scaled , distance = "manhattan", min.nc = 2, max.nc = 7, method = "kmeans")
dput(rbind(head(WKA_ohneJB, 10), tail(WKA_ohneJB, 10)))
structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
821039L, 821040L, 821041L, 821042L, 821043L, 821044L, 821045L,
821046L, 821047L, 821048L), BASKETS_NZ = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
LOGONS = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), PIS = c(71L, 39L, 50L, 4L,
13L, 4L, 30L, 65L, 13L, 31L, 111L, 33L, 3L, 46L, 11L, 8L,
17L, 68L, 65L, 15L), PIS_AP = c(14L, 2L, 4L, 0L, 0L, 0L,
1L, 0L, 2L, 1L, 13L, 0L, 0L, 2L, 1L, 0L, 3L, 8L, 0L, 1L),
PIS_DV = c(3L, 19L, 4L, 1L, 0L, 0L, 6L, 2L, 2L, 3L, 38L,
8L, 0L, 5L, 2L, 0L, 1L, 0L, 3L, 2L), PIS_PL = c(0L, 5L, 8L,
2L, 0L, 0L, 0L, 24L, 0L, 6L, 32L, 8L, 0L, 0L, 4L, 0L, 0L,
0L, 0L, 0L), PIS_SDV = c(18L, 0L, 11L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 6L, 0L, 0L, 13L, 0L, 0L, 1L, 15L, 1L, 0L), PIS_SHOPS = c(3L,
24L, 13L, 3L, 0L, 0L, 6L, 28L, 2L, 11L, 71L, 16L, 2L, 5L,
6L, 0L, 1L, 0L, 3L, 2L), PIS_SR = c(19L, 0L, 14L, 0L, 0L,
0L, 2L, 23L, 0L, 3L, 6L, 0L, 0L, 20L, 0L, 0L, 3L, 32L, 1L,
0L), QUANTITY = c(13L, 2L, 18L, 1L, 14L, 1L, 4L, 2L, 5L,
1L, 5L, 2L, 2L, 4L, 1L, 3L, 2L, 8L, 17L, 8L), WKA = c(1L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 0L, 1L, 1L), NEW_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), EXIST_CUST = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), WEB_CUST = c(1L, 0L, 0L, 0L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), MOBILE_CUST = c(0L,
1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 1L, 0L), TABLET_CUST = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L),
LOGON_CUST_STEP2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 821039L, 821040L, 821041L,
821042L, 821043L, 821044L, 821045L, 821046L, 821047L, 821048L
), class = "data.frame")
Error: Error in na.omit(jeu1) : object 'polygons' not found
Simple means of determining number of clusters is to examine the elbow in the plot of within groups sum of squares and/or average width of the silhouette, the code produces simple plots to examine these...
In order to perform clustering, you need to solve the problem of NaNs after scaling...
WKA_ohneJB_scaled <- as.matrix(scale(data[, c(-1, -2, -18)]))
plot_scree_clusters <- function(x) {
wss <- 0
max_i <- 10 # max clusters
for (i in 1:max_i) {
km.model <- kmeans(x, centers = i, nstart = 20)
wss[i] <- km.model$tot.withinss
}
plot(1:max_i, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
}
plot_scree_clusters(WKA_ohneJB_scaled)
plot_sil_width <- function(x) {
sw <- 0
max_i <- 10 # max clusters
for (i in 2:max_i) {
km.model <- cluster::pam(x = pc_comp$x, k = i)
sw[i] <- km.model$silinfo$avg.width
}
sw <- sw[-1]
plot(2:max_i, sw, type = "b",
xlab = "Number of Clusters",
ylab = "Average silhouette width")
}
plot_sil_width(WKA_ohneJB_scaled)
Use the Elbow Method, as alluded to by knytt. Here are a couple references that describe the technique.
https://www.r-bloggers.com/finding-optimal-number-of-clusters/
https://uc-r.github.io/kmeans_clustering#elbow
Also, consider using the Affinity Propogation library. The AP library will automatically determine the optimal number of clusters for you. Check out the siple example below.
install.packages("apcluster")
library("apcluster")
c1 <- cbind(rnorm(30,.3,.5),rnorm(30.7,.4))
c2 <- cbind(rnorm(30,.7,.4),rnorm(30.4,.5))
x1 <- rbind(c1,c2)
plot(x1, xlab="", ylab="", pch=19, cex=.8)
apresia <- apcluster(negDistMat(r=2),x1)
s1 <- negDistMat(x1,r=2)
apres1b <- apcluster(s1)
apresia
plot(apresia, x1)
Resource:
https://cran.r-project.org/web/packages/apcluster/vignettes/apcluster.pdf

How to mutate a column using dplyr with a value when any of the columns contain a 1 otherwise 0

events <- structure(list(ID = c(3049951, 3085397, 3204081, 3262134,
3467254), TVTProcedureStartDate = structure(c(16210, 16238, 16322,
16420, 16546), class = "Date"), DCDate = structure(c(16213, 16250,
16326, 16426, 16560), class = "Date"), CE_EventOccurred = c(0L,
0L, 0L, 0L, 0L), CE_EventDate = c(0L, 0L, 0L, 0L, 0L), `Annular Dissection (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Aortic Dissection (In Hospital)` = c(0L, 0L,
0L, 1L, 0L), `Atrial Fibrillation (In Hospital)` = c(0L, 1L,
0L, 0L, 1L), `Bleeding at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Cardiac Arrest (In Hospital)` = c(1L, 0L, 0L,
0L, 0L), `Conduction/Native Pacer Disturbance Req ICD (In Hospital)` = c(0L,
0L, 1L, 0L, 0L), `Conduction/Native Pacer Disturbance Req Pacer (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Endocarditis (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `GI Bleed (In Hospital)` = c(0L, 0L, 0L, 0L, 0L), `Hematoma at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Ischemic Stroke (In Hospital)` = c(0L, 0L,
0L, 0L, 0L), `Major Vascular Complications (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Minor Vascular Complication (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Leaflet Injury - detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Subvalvular Injury -detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `New Requirement for Dialysis (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Other Bleed (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `Perforation with or w/o Tamponade (In Hospital)` = c(1L,
0L, 0L, 0L, 0L), `Retroperitoneal Bleeding (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Single Leaflet Device Attachment (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Other Cardiac Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Vascular Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 1L, 0L)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254
)), class = "data.frame", row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254,
3467324, 3510387, 3586037, 3661089, 3668621, 3679485, 3737916,
3738064, 3960141, 4006862, 4018241, 4019056, 4025174, 4027490,
4050900, 4051101, 4096816, 4097119, 4097146, 4097180, 4098426,
4106410, 4109968, 4147466, 4198427, 4198450, 4198458, 4204554,
4208053, 4213116, 4218802, 4218854, 4223378, 4223415, 4243959,
4316979, 4341660, 4348676, 4413567, 4419513, 4421948, 4422768,
4426483, 4430159, 4431211, 4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", labels = structure(list(NCDRPatientID = c(3049951,
3085397, 3204081, 3262134, 3467254, 3467324, 3510387, 3586037,
3661089, 3668621, 3679485, 3737916, 3738064, 3960141, 4006862,
4018241, 4019056, 4025174, 4027490, 4050900, 4051101, 4096816,
4097119, 4097146, 4097180, 4098426, 4106410, 4109968, 4147466,
4198427, 4198450, 4198458, 4204554, 4208053, 4213116, 4218802,
4218854, 4223378, 4223415, 4243959, 4316979, 4341660, 4348676,
4413567, 4419513, 4421948, 4422768, 4426483, 4430159, 4431211,
4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", drop = TRUE), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10:12, 13L, 14L, 15L,
16:17, 18L, 19:21, 22L, 23L, 24L, 25:26, 27L, 28L, 29:30,
31L, 32:33, 34L, 35:38, 39L, 40:41, 42L, 43L, 44L, 45L, 46L,
47L, 48:50, 51:53, 54L, 55L, 56L, 57L, 58L, 59:60, 61L, 62L,
63:64, 65:66, 67:68, 69L, 70L, 71:72, 73L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 4L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L), biggest_group_size = 4L), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,
51L, 52L), drop = TRUE, group_sizes = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), biggest_group_size = 1L), indices = list(0L, 1L, 2L, 3L, 4L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L), biggest_group_size = 1L)
From this data, I need to create a column that has value 1 if any of the columns which ends in (in-hospital) contains 1 else 0.
I tried multiple things but either doesn't work or displays error
Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
event %>% mutate(TR = rowSums(select_(.,6:n)))
Error in mutate_impl(.data, dots) : Column `TR` must be length 1 (the group size), not 53
event %>% mutate(TR = rowSums(.[6:ncol(.)]))
And some other variations of it to see if I can understand or make some sense, but it keeps running into the similar errors and problems
Another thing i tried was the following which seems to do the row sums, but it also adds the ID even when I'm doing the following:
event %>% select(6:27) %>% rowSums()
but it added the ID with the 1s and 0s from columns 6 to 27 for each row. Not sure why it's doing this.
I want the results as a data frame with the same data, but also a column with 1s if any of the columns from 6 to 27 contains 1 otherwise 0
Before I developed my solution, I ran the following code to ungroup your data.
library(dplyr)
events <- events %>% ungroup()
Solution 1: rowSums with selected columns
The idea of this solution is to use rowSums to add all the numbers from the selected columns, determine if the sum is larger than 0, and then convert the logical vector to an integer vector (with 1 or 0).
There are many ways to select the columns. We can select based on column numbers.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., 6:27)) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use ends_with.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., ends_with("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use matches. The regular expression \\(In Hospital\\)$ indicates the string at the end.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., matches("\\(In Hospital\\)$"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use contains, but notice that the target string does not need to be in the end of the column names.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., contains("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
Solution 2: apply with max
Since the numbers from the target columns are all 1 or 0, we can use apply with max to get the maximum, which will be 1 if there ara any 1, or 0. All the ways to use the select function as was shown above will also work here. Below I presented one way to do this.
events2 <- events %>% mutate(Col = apply(select(., ends_with("(In Hospital)")), 1, max))
events2$Col
# [1] 1 1 1 1 1
It is not a dplyr way, but it also works:
events$new_col <- 0
events$new_col[rowSums(events[, grep("In Hospital", colnames(events))]) >= 1] <- 1
A solution from base R using apply()
cols <- grep("in hospital", colnames(events), ignore.case = T)
apply(events[, cols], 1, function(x) ifelse(any(x == 1), 1, 0))
# [1] 1 1 1 1 1

How to create a column in a data frame that bins n rows of the data frame consecutively

I have a data frame
structure(list(id = c(31068L, 4741L, 31068L, 48783L, 48783L,
4741L, 40866L, 602L, 43498L, 602L, 47365L, 27643L, 30657L, 47365L,
2456L, 30657L, 48783L, 40866L, 31068L, 31068L), chrom = structure(c(10L,
2L, 10L, 18L, 18L, 2L, 1L, 12L, 15L, 12L, 3L, 14L, 13L, 3L, 5L,
13L, 18L, 1L, 10L, 10L), .Label = c("chr1", "chr10", "chr11",
"chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18",
"chr19", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8",
"chr9", "chrM", "chrUn", "chrX", "chrY"), class = "factor"),
strand = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("-",
"+"), class = "factor"), position = c(32233920L, 63320175L,
32233915L, 105252695L, 105252690L, 63320180L, 160222550L,
60750690L, 111386910L, 60750685L, 89019895L, 155820080L,
146439265L, 89019890L, 56738465L, 146439260L, 105252700L,
160222545L, 32233910L, 32233925L), state = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), probability = c(0.0609068958401659, 0.0652616906258002,
0.0692068771991058, 0.0692812165331229, 0.0701354955671266,
0.0703768347393804, 0.0711103024384338, 0.0741845396230543,
0.0743938664894217, 0.0749688062517259, 0.079612884618025,
0.0813520378059492, 0.0826542716746912, 0.0846949902404464,
0.0861396527789957, 0.0864410597153605, 0.0871058913414357,
0.087430325878555, 0.0878512688974869, 0.0880022007028806
), differential = c(-0.23435362433765, -0.23435362433765,
-0.171785943061141, 0.196483908128415, 0.251738193935959,
-0.23435362433765, -0.23435362433765, -0.278594311118605,
0.242648431724363, -0.278594311118605, -0.278594311118605,
0.209140348612477, -0.28152682524624, -0.234047089198976,
0.290374904597332, -0.242344380735883, 0.196483908128415,
-0.23435362433765, -0.171785943061141, -0.155502764975432
), tag1 = c(0L, 0L, 0L, 15L, 15L, 0L, 0L, 5L, 30L, 5L, 5L,
10L, 2L, 3L, 15L, 7L, 15L, 0L, 0L, 0L), mut1 = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), tag2 = c(15L, 15L, 11L, 5L, 2L, 15L, 15L, 25L,
15L, 25L, 25L, 0L, 21L, 19L, 0L, 25L, 5L, 15L, 11L, 10L),
mut2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), group = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("id",
"chrom", "strand", "position", "state", "probability", "differential",
"tag1", "mut1", "tag2", "mut2", "group"), class = c("data.table",
"data.frame"), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x102819778>)
and I would like to create a new column df$bin such that the bin column creates bins of length n of consecutive values.
For example, if n = 2, I would want there to be 2 bins where the bin column looks like bin = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
such that the values in for df$bin would be 1:n and each value would be repeated nrow(df)/n times.
You can use the base R function cut for this:
df$bin = cut(1:nrow(df), breaks = 2, labels = FALSE)

R: Recoding multiple dummy variables into a single variable and replacing the corresponding dummy value with the variable name

I have a dataset with 14 mutually exclusive categories of call type all coded as dummy variables. Here is a small sample:
dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS",
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "data.frame", row.names = c(NA,
-10L))
I want to combine each of the dummy variables into a single new variable called "QUEUE" that replaces the value of "1" with the name of the dummy variable its corresponding dummy variable. Here is an example of what this would look like:
dput(df2)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), QUEUE = structure(c(1L, 4L, 2L, 4L, 1L, 3L,
3L, 5L, 5L, 4L), .Label = c("CLAIMS", "CONTENT", "CREDIT_CARD",
"DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "QUEUE"), class = "data.frame", row.names = c(NA,
-10L))
Edit in response to having question marked down: This is what I had tried this afternoon on recommendation with a slightly different sample dataframe:
df$Queue <- as.factor(df$CONTENT + df$CLAIMS*2 + df$CREDIT_CARD*3 + df$DEDUCT_BILL*4 + df$HCREFORM*5)
levels(df$Queue) <- c("CONTENT", "CLAIMS", "CREDIT_CARD","DEDUCT_BILL","HCREFORM")
View(df)
But I received a column of NA's in the Queue column. So, I recreated another sample dataset here. This dataframe is adequately representative of what I'll receive in reality, except I'll have about 40 variables and 2 million rows. When I run what I tried above on "df" above I get the following incorrect result:
dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Queue = structure(c(2L,
1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CONTENT",
"CLAIMS", "CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS",
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM", "Queue"), row.names = c(NA,
-10L), class = "data.frame")
I also tried:
df3 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))
but received the following error: "Error in data.frame("CLAIMS", character(0), character(0), "DEDUCT_BILL", :
arguments imply differing number of rows: 1, 0:
You could use max.col to get the column index that have a value of '1' in each row for columns 5 to 9. (The 'df' example is not correct as most of the rows were all 0s. The corrected one is below).
df$QUEUE <- names(df)[-c(1:4)][max.col(df[-c(1:4)])]
Or you can do
df$QUEUE <- names(df)[-(1:4)][(as.matrix(df[-(1:4)]) %*%
seq_along(df[-(1:4)]))[,1]]
Update
Based on the edit dataset 'df', some rows are all '0's for the columns 5:9, and in the expected result, it is showed that 'QUEUE' as 'CONTENT'. In that case, we can first modify the 'CONTENT' column to change the values where rows are all 0's and then apply either of the code above
df$CONTENT[!rowSums(df[5:9])] <- 1
df$QUEUE1 <- names(df)[5:9][max.col(df[5:9])]
df$QUEUE1
#[1] "CLAIMS" "CONTENT" "CONTENT" "DEDUCT_BILL" "CONTENT"
#[6] "CONTENT" "CONTENT" "CONTENT" "CONTENT" "CONTENT"
data
df <- structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0), CLAIMS = c(1,
0, 0, 0, 1, 0, 0, 0, 0, 0), CREDIT_CARD = c(0, 0, 0, 0, 0, 1,
1, 0, 0, 0), DEDUCT_BILL = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 1),
HCREFORM = c(0,
0, 0, 0, 0, 0, 0, 1, 1, 0)), .Names = c("MON1_12", "WEEK1_53",
"AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", "CREDIT_CARD",
"DEDUCT_BILL", "HCREFORM"), row.names = c(NA, -10L), class = "data.frame")
This should produce the desired result:
df2 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))
provided that only one and exactly one of the dummy variables is 1 in any of the rows (which is not true in your original sample of df).
Explanation: df[1:4] selects the columns one through four to be preserved in the output. It is then column bound to QUEUE using cbind function. QUEUE is obtained by iterating through the dummy variables (columns five through nine), row-wise over the data set df and selecting the column-name that contains the value one.

R create variable IF ELSE leads to wrong values

I have a dataframe with:
"serial" the number of households, each one with a variable number of components "head, spouse, parent and child or grandchild" and total number of children in the house "nchild"
I want to create a new variable (in the dput I added an example for clarity: withCM 'living with male child' and withCF). I have tried various combinations but I cannot discriminate on the sex of the child within the same "serial", so that for withCM=1 only when relate=="child"&sex==1, but the 1 would appear on a different row (that of the head, spouse or parent)
mydata$withCM<- ifelse(mydata$nchild>0&mydata$relate!="child",1,0)
mydata <- structure(list(serial = c(12345L, 12345L, 12345L, 12345L, 12346L,
12346L, 12347L, 12347L, 12347L, 12348L, 12348L, 12348L, 12348L,
12348L, 12348L, 12348L, 12349L, 12350L, 12350L, 12351L, 12351L,
12351L, 12352L, 12352L, 12352L, 12352L, 12352L, 12353L, 12354L,
12354L), age = c(45L, 44L, 13L, 11L, 29L, 28L, 65L, 61L, 35L,
68L, 61L, 35L, 34L, 6L, 2L, 1L, 62L, 54L, 52L, 67L, 67L, 12L,
49L, 50L, 28L, 21L, 22L, 70L, 89L, 55L), sex = c(1L, 2L, 2L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,
1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), relate = structure(c(4L,
7L, 1L, 1L, 4L, 7L, 6L, 6L, 4L, 4L, 7L, 1L, 2L, 3L, 3L, 3L, 4L,
4L, 7L, 4L, 7L, 3L, 4L, 7L, 1L, 5L, 5L, 4L, 6L, 4L), .Label = c("child",
"childinlaw", "grandchild", "head", "nonrelative", "parent",
"spouse"), class = "factor"), nchild = c(2L, 2L, 0L, 0L, 0L,
0L, 1L, 1L, 0L, 1L, 1L, 3L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), conhija = c(1L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), conhijo = c(1L,
1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("serial",
"age", "sex", "relate", "nchild", "conhija", "conhijo"), class = "data.frame", row.names = c(NA,
-30L))
You can tabulate the gender, family, and role-within-family as:
xtab <- table(mydata$serial, mydata$sex, mydata$relate)
And then choose the heads of the families (or, in the commented line, anyone who has the specific relationship), and alter their tallies as follows:
mydata$sex1 <- 0
mydata$sex2 <- 0
ind <- mydata$relate=="head"
#ind <- mydata$relate %in% c("head","spouse","parent")
mydata$sex1[ind] <- xtab[as.character(mydata$serial[ind]), "1", "child"]
mydata$sex2[ind] <- xtab[as.character(mydata$serial[ind]), "2", "child"]
Use lapply to split into families, then test if they are an adult, and there is at least one male child in the unit.
lives_with_boy <- function(serial)
{
unit <- mydata[mydata$serial==serial,]
as.character(unit$relate) %in% c("head","spouse","parent") & any(unit$relate == "child" & unit$sex==1)
}
mydata$withCM <- unlist(lapply(unique(mydata$serial),lives_with_boy ))

Resources