How to create a sequence column based on sequences' starts and ends - r

I've got a two columns that contain information about sequences' starts and ends. I want to create a sequence column from that, i.e. each sequence starts when a seq_start is 1 and ends in first row appearing after seq_start = 1 in which seq_end = 1. How can I do it with tidyverse? The data is shown below, where seq is expected output. Please note that when seq_end = 1 and seq_start = 1 within the same rows this produces the sequence of length one.
structure(list(seq_start = c(NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA,
NA, 1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1,
NA), seq_end = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, 1L, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L,
1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, NA, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, NA, 1L, 1L,
1L), seq = c(NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
NA, 3L, NA, NA, NA, NA, NA, NA, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
7L, 7L, 7L, 8L, NA, NA, NA, 9L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 10L, 10L, NA, NA, NA, NA, NA, NA, NA, 11L,
NA)), .Names = c("seq_start", "seq_end", "seq"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L))

Here's a solution that makes heavy use of dplyr package's lag() function, along with cumsum() from the base package, to produce the expected result. It's probably not the most succinct solution out there, but I do think it's reasonably intuitive to understand:
d <- d %>%
# new.seq.starts starts from 0, and increments by 1 every time seq_starts takes on
# the value 1, like this: 0, 0, 0, 1, 1, 1, 1, 2, 2, ...
# Rows with the same new.seq.starts value are thus part of the same "run".
mutate(new.seq.starts = cumsum(!is.na(seq_start))) %>%
# group by each "run"
group_by(new.seq.starts) %>%
# any.ending.so.far counts whether there has been ANY seq_end == 1 within the run yet.
# first.ending is TRUE only if it's the first row (within the run) to have an ending.
mutate(any.ending.so.far = cumsum(!is.na(seq_end)),
first.ending = any.ending.so.far == 1 &
(is.na(lag(any.ending.so.far)) | lag(any.ending.so.far) < 1)) %>%
ungroup() %>%
# result keeps the new.seq.starts values only if there's no ending yet (i.e.
# any.ending.so.far == 0), or only just ended (first.ending == TRUE). Otherwise,
# it takes on the value NA.
mutate(result = ifelse(new.seq.starts > 0 &
(any.ending.so.far == 0 | first.ending),
new.seq.starts, NA)) %>%
# Remove helper variables as they are no longer needed.
select(-c(new.seq.starts, any.ending.so.far, first.ending))
> all.equal(d$seq, d$result)
[1] TRUE

Related

Filling a dataframe with a dummy value data based on specific col in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a data frame like this:
df <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
)
base on member variable, I need to fill corresponding variables with dummy data.For example if member = 2 then q_c3_2:q_c7_2 should have dummy values --> q_c3 = some character like "Arne", q_c4 with 1 and q_c5 with 1900 and q_c6 and q_c7 with 0 , if member == 3 then q_c3_2:q_c7_2 and q_c3_3:q_c7_3 should have dummy values (same as dummy values as above) and so on. How may i do this and efficiently with tidyverse? Thanks
My desire output shall be like this data frame
df2 <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, "Arne", NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_2 = c(NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_2 = c(NA, NA, 1900L, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", NA, "Arne"),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, NA, 1900L),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, "Arne"),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1900L),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L)
)
With the assumption that it does not matter what the dummy variables are and using dplyr:
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = "dummy",
compare = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
Edit: With dummy variables as requested.
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable)),
dummy_match = as.numeric(gsub("q_c(\\d)_\\d", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = case_when(dummy_match == 4 ~ 1,
dummy_match == 5 ~ 1900,
dummy_match >= 6 ~ 0,
T ~ 9999),
compare = NULL,
dummy_match = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
df[df == 9999] <- "Arne"

For loop to extract data scattered across multiple columns in another R dataframe

I have a survey question in which respondents could select multiple answers (for 16 possible combinations, e.g. "Which color do you like?" can result in responses "red, blue, green, yellow" or "red, blue, green, black" etc.
These 16 possible combinations are contained in a spreadsheet:
Image 1: First two rows of the spreadsheet (full spreadsheet has 16 rows)
Example 1:
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue", "violet"), class = "factor"),
V3 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green", "turqoise"), class = "factor"),
V4 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black", "yellow"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -16L
))
The dataframe with responses has sixteen columns for this question (one column per every simple combination of colors). If respondent 1 selected the first combination, only the first column contains data; similarly, if respondent 2 selected the second combination, the second column contains data. The other are empty:
Image 2: The first two columns of the dataframe
Example 2:
structure(list(respondentID = 1:16, v1 = c(1L, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v2 = c(NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v3 = c(NA,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA),
v4 = c(NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
NA, NA, NA), v5 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), v6 = c(NA, 1L, NA, NA, NA, NA, NA,
NA, NA, 1L, NA, NA, NA, NA, NA, NA), v7 = c(NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v8 = c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), v9 = c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
NA, NA, NA, NA), v10 = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), v11 = c(NA, NA, NA, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA), v12 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA
), v13 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, NA, NA), v14 = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), v15 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v16 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L
)), .Names = c("respondentID", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
"v16"), class = "data.frame", row.names = c(NA, -16L))
(Of course, in practice respondent 1 didn't necessarily choose combination 1).
All the information in the dataframe is the number "1", which corresponds to appropriate combination in the spreadsheet.
In order to analyze responses to the question, I need to extract the combination from the spreadsheet and import it into the dataframe with responses, so that I get four new columns in the dataframe with the combination of colors chosen by a respondent (e.g. red, blue, green, yellow for respondent 1).
I don't think there's any way to do this using apply, so I guess I need to write a for loop to extract and import the data. Any advice on how to do this?
If you put the second data frame into a long shape, you can filter for just the combinations each person chose, and then join the second data frame with the first. The two data frames have combination labels that can be reconciled between the two to join on.
Note that I changed the column names in the first data frame, df1_with_id, to be color1, etc, only because otherwise you would have v1, v2, ... in one data frame, and V1, V2, ... representing something different in the other. Not a necessary change, but it's good to keep from confusing what different variables mean.
library(tidyverse)
df1 <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"), class = "factor"),V2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue", "violet"), class = "factor"),V3 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green", "turqoise"), class = "factor"),V4 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black", "yellow"), class = "factor")), .Names = c("V1","V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -16L))
df2 <- structure(list(respondentID = 1:16, v1 = c(1L, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v2 = c(NA, 1L, NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v3 = c(NA,NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA),v4 = c(NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,NA, NA, NA), v5 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA), v6 = c(NA, 1L, NA, NA, NA, NA, NA,NA, NA, 1L, NA, NA, NA, NA, NA, NA), v7 = c(NA, NA, NA, NA,1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v8 = c(NA,NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v9 = c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,NA, NA, NA, NA), v10 = c(NA, NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, NA), v11 = c(NA, NA, NA, NA,NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA), v12 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA), v13 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,NA, 1L, NA, NA), v14 = c(NA, NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, NA), v15 = c(NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v16 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L)), .Names = c("respondentID", "v1", "v2", "v3", "v4", "v5","v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15","v16"), class = "data.frame", row.names = c(NA, -16L))
df1_with_id <- df1 %>%
setNames(paste0("color", 1:4)) %>%
mutate(combo = paste0("v", row_number()))
head(df1_with_id)
#> color1 color2 color3 color4 combo
#> 1 red blue green yellow v1
#> 2 red blue green black v2
#> 3 red blue turqoise yellow v3
#> 4 red blue turqoise black v4
#> 5 red violet green yellow v5
#> 6 red violet green black v6
df2 %>%
gather(key = combo, value = val, -respondentID) %>%
filter(!is.na(val)) %>%
left_join(df1_with_id, by = "combo")
#> respondentID combo val color1 color2 color3 color4
#> 1 1 v1 1 red blue green yellow
#> 2 2 v2 1 red blue green black
#> 3 7 v3 1 red blue turqoise yellow
#> 4 4 v4 1 red blue turqoise black
#> 5 11 v4 1 red blue turqoise black
#> 6 12 v4 1 red blue turqoise black
#> 7 3 v5 1 red violet green yellow
#> 8 2 v6 1 red violet green black
#> 9 10 v6 1 red violet green black
#> 10 5 v7 1 red violet turqoise yellow
#> 11 6 v8 1 red violet turqoise black
#> 12 8 v9 1 ruby blue green yellow
#> 13 9 v11 1 ruby blue turqoise yellow
#> 14 13 v12 1 ruby blue turqoise black
#> 15 14 v13 1 ruby violet green yellow
#> 16 16 v16 1 ruby violet turqoise black
Created on 2018-05-08 by the reprex package (v0.2.0).
I'm not sure to understand what you want to do. Using the tidyverse packages and the melt() function of the reshape2 package, you might try
df_respondent_combination <-
df_respondent %>%
melt(measure.vars = c(2:ncol(.)), na.rm = T) %>%
cbind(df_combination) %>%
select(-variable, -value) %>%
arrange(respondentID)
With df_respondent_combination the new dataframe expected, df_respondent your example 2 and df_combination your example 1.

How to create a function for multiple columns of a Data Frame in R

I am trying to create a function for making tables using columns of data frame:
Freq_table=function(x){
x<-factor(x)
T<-table(STI_IPD$Q19_1,x,exclude = NULL)
T<- data.frame(T)
library(reshape2)
T_x<-dcast(T, Var1~Var2)
T_x<-T_x%>%select(-starts_with("NA"),-ends_with("NA"))
}
here STI_IPD is my Dataframe, and x should be any column which I'm using to create tables with another column Q19_1
This is throwing error:
Error in FUN(X[[i]], ...) : object 'Var2' not found
Data.frame(T) output is:
Var1 Var2 Freq
1 Consumer Goods 1 1
2 Life Sciences 1 0
3 Chemicals 1 0
4 Other Manufacturing 1 0
5 High Tech 1 0
6 Energy 1 0
7 Mining & Metals 1 0
8 Retail & Wholesale 1 0
9 Banking/Financial Services 1 0
10 Insurance/Reinsurance 1 0
11 Services (Non-Financial) 1 0
12 Logistics 1 0
13 Other Non-Manufacturing 1 0
14 Consumer Goods <NA> 1
15 Life Sciences <NA> 1
16 Chemicals <NA> 1
17 Other Manufacturing <NA> 4
18 High Tech <NA> 1
19 Energy <NA> 5
20 Mining & Metals <NA> 0
21 Retail & Wholesale <NA> 1
22 Banking/Financial Services <NA> 5
23 Insurance/Reinsurance <NA> 3
24 Services (Non-Financial) <NA> 5
25 Logistics <NA> 2
26 Other Non-Manufacturing <NA> 3
output of dput(head(STI_IPD, 30)) is below:
structure(list(Q18_1 = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), Q19_1 = structure(c(9L, 13L, 1L, 9L,
2L, 6L, 4L, 13L, 9L, 11L, 12L, 4L, 10L, 10L, 11L, 1L, 13L, 11L,
3L, 6L, 5L, 6L, 6L, 8L, 11L, 12L, 4L, 11L, 4L, 10L), .Label = c("Consumer Goods",
"Life Sciences", "Chemicals", "Other Manufacturing", "High Tech",
"Energy", "Mining & Metals", "Retail & Wholesale", "Banking/Financial Services",
"Insurance/Reinsurance", "Services (Non-Financial)", "Logistics",
"Other Non-Manufacturing"), class = "factor"), Q46_21_4 = c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA, NA, NA), Q46_21_5 = c(NA,
NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, 1L,
NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, NA, NA, NA, 1L), Q46_21_6 = c(NA,
NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_21_7 = c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_22_4 = c(NA,
NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, 1L, NA, NA, NA), Q46_22_5 = c(1L,
1L, 1L, NA, 1L, NA, 1L, NA, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_22_6 = c(NA,
NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_22_7 = c(NA,
NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_23_4 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA), Q46_23_5 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L), Q46_23_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L), Q46_23_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L), Q46_24_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_24_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_24_6 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_24_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_25_4 = c(1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, NA), Q46_25_5 = c(1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, 1L), Q46_25_6 = c(1L,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, 1L), Q46_25_7 = c(1L,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA, 1L), Q46_26_4 = c(1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, 1L,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_26_5 = c(1L,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_26_6 = c(1L,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_26_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA, 1L,
1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_27_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L), Q46_27_5 = c(NA,
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_27_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_27_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_28_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), Q46_28_5 = c(NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, 1L), Q46_28_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, 1L), Q46_28_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, 1L), Q46_29_4 = c(NA,
NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_29_5 = c(NA,
1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_29_6 = c(NA,
NA, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_29_7 = c(NA,
NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_30_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_30_5 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, 1L, NA,
1L, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_30_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, 1L, NA,
1L, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_30_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_31_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_31_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_31_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_31_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_32_4 = c(NA,
1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_32_5 = c(NA,
1L, 1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_32_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_32_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_33_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_33_5 = c(NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, NA), Q46_33_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, NA), Q46_33_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA), Q46_34_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_34_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA), Q46_34_6 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA), Q46_34_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Q18_1",
"Q19_1", "Q46_21_4", "Q46_21_5", "Q46_21_6", "Q46_21_7", "Q46_22_4",
"Q46_22_5", "Q46_22_6", "Q46_22_7", "Q46_23_4", "Q46_23_5", "Q46_23_6",
"Q46_23_7", "Q46_24_4", "Q46_24_5", "Q46_24_6", "Q46_24_7", "Q46_25_4",
"Q46_25_5", "Q46_25_6", "Q46_25_7", "Q46_26_4", "Q46_26_5", "Q46_26_6",
"Q46_26_7", "Q46_27_4", "Q46_27_5", "Q46_27_6", "Q46_27_7", "Q46_28_4",
"Q46_28_5", "Q46_28_6", "Q46_28_7", "Q46_29_4", "Q46_29_5", "Q46_29_6",
"Q46_29_7", "Q46_30_4", "Q46_30_5", "Q46_30_6", "Q46_30_7", "Q46_31_4",
"Q46_31_5", "Q46_31_6", "Q46_31_7", "Q46_32_4", "Q46_32_5", "Q46_32_6",
"Q46_32_7", "Q46_33_4", "Q46_33_5", "Q46_33_6", "Q46_33_7", "Q46_34_4",
"Q46_34_5", "Q46_34_6", "Q46_34_7"), class = c("data.table",
"data.frame"), row.names = c(NA, -30L), .internal.selfref = <pointer: 0x0000000000090788>)
Maybe something like the following.
library(reshape2)
library(tidyverse)
Freq_table <- function(x){
dat <- data.frame(Q19_1 = STI_IPD$Q19_1, STI_IPD[[x]])
names(dat)[2] <- x
m <- melt(dat, id.vars = "Q19_1")
result <- tryCatch(dcast(m, Q19_1 ~ variable), error = function(e) message(e))
result <- result %>% select(-starts_with("NA"),-ends_with("NA"))
result
}
Freq_table("Q46_22_5")
Freq_table("Q46_34_4")
Note that you pass to the function the names of the columns you want, not the columns themselves.
EDIT.
To answer to a request of the OP in a comment, the following code will apply the function above to all but the two first columns of the input dataframe STI_IPD and then merge all the results into one df. The Reduce/mergecode is the answer by Hong Ooi to this question.
lst <- lapply(names(STI_IPD[-(1:2)]), Freq_table)
lst <- lst[!sapply(lst, is.null)]
merge.all <- function(x, y) {
merge(x, y, all = TRUE, by = "Q19_1")
}
output <- Reduce(merge.all, lst)

Split dataframes according to a vector of positions

I want to split a dataframe into a list of 6 different-sized dataframes. The positions where to split I specify by a vector, posns below. I've tried using split but instead of the desired output I get 6 equally-sized dataframes.
How can I do this?
posns = c(4,50,68,81,90)
df1 = structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), snp_pos = c(948921L,
949608L, 949654L, 1227249L, 1254841L, 1262966L, 1263144L, 1263362L,
1288583L, 1455652L, 1571066L, 1571464L, 1571470L, 1571802L, 1599812L,
1599888L, 1630271L, 1647814L, 1647814L, 1647871L, 1647871L, 1650787L,
1650787L, 1650797L, 1650797L, 1650801L, 1650801L, 1650807L, 1650807L,
1650845L, 1650845L, 1670432L, 1670432L, 1670432L, 1671087L, 1671087L,
1671087L, 1683565L, 1683565L, 1683565L, 1684169L, 1684169L, 1684169L,
1684472L, 1684472L, 1684472L, 1686040L, 1686040L, 1686040L, 1718435L,
1718435L, 2125172L, 2441358L, 2488153L, 2490942L, 2494330L, 2494785L,
3545250L, 3551792L, 6694574L, 6694927L, 6695331L, 7841330L, 8022824L,
8412935L, 8412989L, 8413839L, 8425900L, 9811541L, 10218439L,
10240094L, 10473196L, 10473200L, 10479791L, 10708142L, 11082919L,
11114822L, 11114940L, 11132217L, 11736131L, 11810354L, 11847759L,
11983206L, 11985396L, 12009956L, 12012753L, 12024235L, 12025648L,
12071680L, 16890415L, 16890421L, 16890428L, 16890441L, 16890558L,
16890559L, 16891333L, 16891340L, 16891365L, 16893721L, 16893736L
), Q.x = c(0.741961301980865, 1, 0.720109026807207, 0.000379926095791477,
1, 0.569157762597131, 0.0448134555282655, 0.263705838768648,
1, 3.9401608189424e-08, NA, NA, NA, NA, 0.141036658207429, 4.84068069656854e-08,
4.43661413003932e-11, 0.916059828440023, 0.916059828440023, 0.659922962581594,
0.659922962581594, 0.413553370535633, 0.413553370535633, 0.714246817533455,
0.714246817533455, 0.721981775878533, 0.721981775878533, 1, 1,
0.0014954358811119, 0.0014954358811119, 8.83093446255536e-14,
8.83093446255536e-14, 8.83093446255536e-14, 0.281581364975761,
0.281581364975761, 0.281581364975761, 1, 1, 1, 1, 1, 1, 0.0415833199080577,
0.0415833199080577, 0.0415833199080577, 0.0446393461337085, 0.0446393461337085,
0.0446393461337085, NA, NA, 0.0955715926532034, 0.538378452872325,
0.0534014601577661, 0.335721613890647, 0.10791993889237, 0.856046745017246,
0.0630351159601902, 0.00172714428632725, 0.440712852235607, 0.00599466402196809,
0.0572560467887719, NA, NA, 4.15876549078e-05, NA, 0.0198308292795067,
0.201292584136377, NA, 1, 0.227189739568257, 0.00172103054903301,
0.0031569678468897, 0.112209415561467, 0.214802908052941, 5.08875303388692e-05,
NA, NA, NA, NA, NA, 0.0165387785489721, 0.0124037431571059, 7.3978214204246e-34,
0.326191223745559, NA, 0.0701742102840443, NA, 0.351069598560997,
2.3479965234952e-12, 3.46177972593879e-06, 1, 0.0974541314547016,
1, 7.50982175368481e-08, 0.000151416356355741, 1, 1.76165018835578e-17,
3.10961711424869e-22, 8.29863562348751e-23), Q.y = c(NA, NA,
NA, NA, NA, NA, NA, NA, 1, NA, 0.192489461231087, 0.00296682751485515,
0.000175053346844423, 0.0013758526261836, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.160250237971167,
0.817597720785312, NA, NA, NA, 0.926435352180301, NA, NA, 0.95226758057333,
NA, NA, NA, NA, 0.646154538622465, 0.747932105441424, 0.539645992048171,
1, 1, NA, NA, NA, NA, 1, 0.387507157909907, 0.827583128653863,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q = c(NA, 1,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1,
1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.988419768874236,
NA, 0.05888784043377, 0.65213668882967, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("chrom",
"snp_pos", "Q.x", "Q.y", "Q"), row.names = c(NA, 100L), class = "data.frame")
Or a possible vectorized approach utilizing the findInterval function
res <- split(df1, findInterval(1:nrow(df1), posns + 1))
Validating results
lapply(res, dim)
# $`0`
# [1] 4 5
#
# $`1`
# [1] 46 5
#
# $`2`
# [1] 18 5
#
# $`3`
# [1] 13 5
#
# $`4`
# [1] 9 5
#
# $`5`
# [1] 10 5
Try this:
#positions
posns <- c(4,50,68,81,90)
#add last position
posns <- c(posns, nrow(df1))
#make start end positions
x <- cbind(c(0,head(posns,-1))+1, posns)
x
# posns
# [1,] 1 4
# [2,] 5 50
# [3,] 51 68
# [4,] 69 81
# [5,] 82 90
# [6,] 91 100
res <- lapply(1:nrow(x),
function(i) df1[ x[i, 1]:x[i, 2], ]
)
#check result
lapply(res, dim)
# [[1]]
# [1] 4 5
#
# [[2]]
# [1] 46 5
#
# [[3]]
# [1] 18 5
#
# [[4]]
# [1] 13 5
#
# [[5]]
# [1] 9 5
#
# [[6]]
# [1] 10 5

Is it possible to get a p-value for nodes in a categorical tree analysis with R?

Is it possible to get a p-value for nodes in a categorical tree analysis with R? I am using rpart and can't locate a p-value for each node. Maybe this is only possible with a regression and not categories.
structure(list(subj = c(702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L), visit = c(4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L), run = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("A", "B", "C", "D", "E", "xdur", "xend60", "xpre"
), class = "factor"), ho = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hph = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), longexer = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("10min", "60min"), class = "factor"),
esq_sick = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA), esq_sick2 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), ll_sick = c(NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA,
0L, NA, NA, NA, NA, NA), ll_sick2 = c(NA, NA, 0L, NA, NA,
NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA), esq_01 = c(NA, NA, 2L, NA, NA, NA, NA, NA,
NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
NA), esq_02 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 2L,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_03 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_04 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_05 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA), esq_06 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA),
esq_07 = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_08 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_09 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_10 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA)), .Names = c("subj", "visit", "run", "ho", "hph",
"longexer", "esq_sick", "esq_sick2", "ll_sick", "ll_sick2", "esq_01",
"esq_02", "esq_03", "esq_04", "esq_05", "esq_06", "esq_07", "esq_08",
"esq_09", "esq_10"), row.names = 7:30, class = "data.frame")
alldata = read.table('symptomology CSV2.csv',header=TRUE,sep=",")
library(rpart)
fit <- rpart(esq_sick2~esq_01_bin + esq_02_bin + esq_03_bin + esq_04_bin + esq_05_bin + esq_06_bin + esq_07_bin + esq_08_bin + esq_09_bin + esq_10_bin + esq_11_bin + esq_12_bin + esq_13_bin + esq_14_bin + esq_15_bin + esq_16_bin + esq_17_bin + esq_18_bin + esq_19_bin + esq_20_bin, method="class", data=alldata)
plot(fit, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0.1, minbranch = 0.3)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
Here's an example that might help you. I'm using the built-in airquality data set and the example provided in the help for ctree:
library(partykit)
# For the sctest function to extract p-values (see help for ctree and sctest)
library(strucchange)
# Data we'll use
airq <- subset(airquality, !is.na(Ozone))
# Build the tree
airct <- ctree(Ozone ~ ., data = airq)
Look at the tree:
airct
Model formula:
Ozone ~ Solar.R + Wind + Temp + Month + Day
Fitted party:
[1] root
| [2] Temp <= 82
| | [3] Wind <= 6.9: 55.600 (n = 10, err = 21946.4)
| | [4] Wind > 6.9
| | | [5] Temp <= 77: 18.479 (n = 48, err = 3956.0)
| | | [6] Temp > 77: 31.143 (n = 21, err = 4620.6)
| [7] Temp > 82
| | [8] Wind <= 10.3: 81.633 (n = 30, err = 15119.0)
| | [9] Wind > 10.3: 48.714 (n = 7, err = 1183.4)
Extract the p-values:
sctest(airct)
$`1`
Solar.R Wind Temp Month Day
statistic 13.34761286 4.161370e+01 5.608632e+01 3.1126596 0.02011554
p.value 0.00129309 5.560572e-10 3.468337e-13 0.3325881 0.99998175
$`2`
Solar.R Wind Temp Month Day
statistic 5.4095322 12.968549828 11.298951405 0.2148961 2.970294
p.value 0.0962041 0.001582833 0.003871534 0.9941976 0.357956
$`3`
NULL
$`4`
Solar.R Wind Temp Month Day
statistic 9.547191843 2.307676 11.598966936 0.06604893 0.2513143
p.value 0.009972755 0.497949 0.003295072 0.99965679 0.9916670
$`5`
Solar.R Wind Temp Month Day
statistic 6.14094026 1.3865355 1.9986304 0.8268341 1.3580462
p.value 0.06432172 0.7447599 0.5753799 0.8952749 0.7528481
$`6`
Solar.R Wind Temp Month Day
statistic 5.1824354 0.02060939 0.9270013 0.165171 4.6220522
p.value 0.1089932 0.99998062 0.8705785 0.996871 0.1481643
$`7`
Solar.R Wind Temp Month Day
statistic 0.8083249 11.711564549 6.77148538 0.1307643 0.03992875
p.value 0.8996614 0.003101788 0.04546281 0.9982052 0.99990034
$`8`
Solar.R Wind Temp Month Day
statistic 0.9056479 3.1585094 2.9285252 0.008106707 0.008686293
p.value 0.8759687 0.3247585 0.3657072 0.999998099 0.999997742
$`9`
NULL

Resources