Arrange and Grouping data in R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 months ago.
I have the following sample data which I have provided here using dput() . It includes values for countries over few years:
structure(list(ï..Country = c("Kyrgyzstan", "North Vietnam",
"Slovakia", "Belgian Congo", "Barbados", "Netherlands Antilles",
"Bosnia and Herzegovina", "Federated States of Micronesia", "Kuwait",
"Russian Federation"), X1949 = c(NA, NA, 1L, 1L, NA, NA, NA,
NA, NA, NA), X1950 = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X1951 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), X1952 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X1953 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X1954 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA), X1955 = c(NA, NA, 1L, NA,
NA, 1L, NA, NA, NA, NA), X1956 = c(NA, NA, 4L, NA, NA, NA, NA,
NA, NA, NA), X1957 = c(NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA
), X1958 = c(NA, NA, 3L, NA, NA, NA, NA, NA, NA, NA), X1959 = c(NA,
NA, NA, NA, 1L, 1L, NA, NA, NA, NA), X1960 = c(NA, NA, 3L, NA,
NA, 1L, NA, NA, NA, NA), X1961 = c(NA, NA, 1L, NA, NA, 2L, NA,
NA, NA, NA), X1962 = c(NA, NA, 1L, NA, NA, 1L, NA, NA, 1L, NA
), X1963 = c(NA, NA, 1L, NA, NA, 3L, NA, NA, NA, NA), X1964 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X1965 = c(NA,
1L, 1L, NA, NA, 2L, NA, NA, NA, NA), X1966 = c(NA, 1L, NA, NA,
NA, 1L, NA, NA, NA, NA), X1967 = c(NA, NA, NA, NA, 9L, NA, NA,
NA, NA, NA), X1968 = c(NA, NA, 1L, NA, 1L, 1L, NA, NA, NA, NA
), X1969 = c(NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA), X1970 = c(NA,
1L, NA, NA, 8L, 2L, NA, NA, NA, NA), X1971 = c(NA, NA, NA, NA,
6L, 2L, NA, NA, NA, NA), X1972 = c(NA, 1L, NA, NA, 3L, 2L, NA,
NA, NA, NA), X1973 = c(NA, NA, NA, NA, 14L, 1L, NA, 1L, NA, NA
), X1974 = c(NA, 1L, NA, NA, 8L, 1L, NA, NA, NA, NA), X1975 = c(NA,
NA, NA, NA, 6L, 1L, NA, NA, NA, NA), X1976 = c(NA, NA, NA, NA,
7L, 1L, NA, NA, NA, NA), X1977 = c(NA, NA, NA, NA, 27L, 1L, NA,
NA, 2L, NA), X1978 = c(NA, NA, NA, NA, 31L, 2L, NA, NA, 1L, NA
), X1979 = c(NA, 1L, NA, NA, 12L, NA, NA, NA, NA, NA), X1980 = c(NA,
NA, NA, NA, 12L, 1L, NA, NA, 2L, NA), X1981 = c(NA, 3L, NA, 1L,
15L, NA, NA, NA, 1L, NA), X1982 = c(NA, NA, NA, 1L, 21L, 1L,
NA, NA, 2L, NA), X1983 = c(NA, NA, NA, NA, 19L, NA, NA, NA, NA,
NA), X1984 = c(NA, NA, NA, NA, 14L, NA, NA, NA, 1L, NA), X1985 = c(NA,
NA, NA, NA, 19L, NA, NA, NA, NA, NA), X1986 = c(NA, NA, NA, NA,
10L, 1L, NA, NA, 1L, NA), X1987 = c(NA, NA, NA, NA, 8L, 2L, NA,
NA, NA, NA), X1988 = c(NA, NA, NA, NA, 7L, 1L, NA, NA, NA, NA
), X1989 = c(NA, NA, NA, NA, 3L, 2L, NA, NA, 1L, NA), X1990 = c(NA,
NA, NA, 1L, 1L, 1L, NA, NA, 1L, NA), X1991 = c(NA, NA, NA, 1L,
2L, NA, NA, NA, 1L, NA), X1992 = c(NA, NA, NA, NA, 4L, NA, NA,
NA, 2L, NA), X1993 = c(NA, 1L, NA, 2L, 1L, NA, NA, 1L, 8L, NA
), X1994 = c(NA, NA, NA, NA, 2L, NA, NA, NA, 9L, NA), X1995 = c(NA,
10L, NA, NA, NA, NA, NA, NA, 7L, NA), X1996 = c(NA, 14L, NA,
1L, 2L, NA, NA, NA, 10L, NA), X1997 = c(NA, 6L, NA, 1L, 1L, 3L,
NA, NA, 28L, NA), X1998 = c(NA, 3L, 1L, NA, 1L, NA, NA, NA, 44L,
NA), X1999 = c(2L, 16L, 2L, 3L, 1L, NA, NA, NA, 102L, NA), X2000 = c(NA,
23L, 2L, NA, NA, 1L, NA, NA, 55L, NA), X2001 = c(NA, 31L, NA,
1L, NA, NA, NA, NA, 92L, NA), X2002 = c(2L, 5L, 1L, NA, NA, NA,
NA, NA, 63L, NA), X2003 = c(1L, NA, 9L, NA, 1L, NA, NA, NA, 48L,
NA), X2004 = c(7L, NA, 25L, 1L, 1L, 1L, NA, NA, 69L, NA), X2005 = c(7L,
NA, 16L, NA, 4L, 1L, NA, NA, 57L, NA), X2006 = c(3L, NA, 12L,
1L, 1L, NA, NA, NA, 74L, NA), X2007 = c(3L, NA, 17L, NA, 1L,
4L, NA, NA, 51L, NA), X2008 = c(NA, NA, 5L, NA, NA, NA, NA, NA,
21L, NA), X2009 = c(1L, NA, 3L, NA, 2L, 4L, NA, NA, 17L, NA),
X2010 = c(NA, NA, 3L, NA, 1L, NA, NA, NA, 22L, NA), X2011 = c(6L,
NA, 8L, NA, 2L, NA, NA, NA, 20L, NA), X2012 = c(1L, NA, 5L,
NA, 3L, 1L, NA, NA, 22L, NA), X2013 = c(2L, NA, 9L, NA, NA,
1L, NA, NA, 18L, NA), X2014 = c(4L, NA, 14L, NA, 5L, NA,
3L, NA, 11L, 9L), X2015 = c(2L, NA, 17L, NA, 2L, NA, 3L,
NA, 16L, 10L), X2016 = c(4L, NA, 19L, NA, 1L, 5L, 2L, 2L,
18L, 29L), X2017 = c(4L, NA, 12L, 1L, 6L, NA, 5L, NA, 28L,
27L), X2018 = c(1L, NA, 16L, 1L, 2L, NA, 1L, NA, 27L, 34L
), X2019 = c(7L, NA, 14L, NA, 4L, NA, 2L, NA, 28L, 36L),
X2020 = c(2L, NA, 8L, NA, 2L, NA, 4L, NA, 14L, 43L)), row.names = c(155L,
204L, 261L, 27L, 22L, 190L, 36L, 94L, 153L, 244L), class = "data.frame")
The current format as you can see in the above data is something like this :
1949 1950 1951
Country A 1 2 0
Country B 0 1 3
Country C 1 0 2
and so on.
I want the data in the following format:
Year Country value
1949 A 1
1950 A 2
1951 A 0
Need to arrange and then group by year? Any help appreciated.

You can pivot to long format with tidyr::pivot_longer
library(tidyverse)
df2 <- pivot_longer(df, -1, names_to = 'Year') %>%
rename(Country = ï..Country) %>%
mutate(Year = as.numeric(substr(Year, 2, 5)))
df2
#> # A tibble: 720 x 3
#> Country Year value
#> <chr> <dbl> <int>
#> 1 Kyrgyzstan 1949 NA
#> 2 Kyrgyzstan 1950 NA
#> 3 Kyrgyzstan 1951 NA
#> 4 Kyrgyzstan 1952 NA
#> 5 Kyrgyzstan 1953 NA
#> 6 Kyrgyzstan 1954 NA
#> 7 Kyrgyzstan 1955 NA
#> 8 Kyrgyzstan 1956 NA
#> 9 Kyrgyzstan 1957 NA
#> 10 Kyrgyzstan 1958 NA
#> # ... with 710 more rows
#> # i Use `print(n = ...)` to see more rows
ggplot(df2, aes(Year, value, color = Country)) + geom_line()

Related

Filling a dataframe with a dummy value data based on specific col in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a data frame like this:
df <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
)
base on member variable, I need to fill corresponding variables with dummy data.For example if member = 2 then q_c3_2:q_c7_2 should have dummy values --> q_c3 = some character like "Arne", q_c4 with 1 and q_c5 with 1900 and q_c6 and q_c7 with 0 , if member == 3 then q_c3_2:q_c7_2 and q_c3_3:q_c7_3 should have dummy values (same as dummy values as above) and so on. How may i do this and efficiently with tidyverse? Thanks
My desire output shall be like this data frame
df2 <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, "Arne", NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_2 = c(NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_2 = c(NA, NA, 1900L, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", NA, "Arne"),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, NA, 1900L),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, "Arne"),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1900L),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L)
)
With the assumption that it does not matter what the dummy variables are and using dplyr:
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = "dummy",
compare = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
Edit: With dummy variables as requested.
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable)),
dummy_match = as.numeric(gsub("q_c(\\d)_\\d", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = case_when(dummy_match == 4 ~ 1,
dummy_match == 5 ~ 1900,
dummy_match >= 6 ~ 0,
T ~ 9999),
compare = NULL,
dummy_match = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
df[df == 9999] <- "Arne"

How to create a sequence column based on sequences' starts and ends

I've got a two columns that contain information about sequences' starts and ends. I want to create a sequence column from that, i.e. each sequence starts when a seq_start is 1 and ends in first row appearing after seq_start = 1 in which seq_end = 1. How can I do it with tidyverse? The data is shown below, where seq is expected output. Please note that when seq_end = 1 and seq_start = 1 within the same rows this produces the sequence of length one.
structure(list(seq_start = c(NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA,
NA, 1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1,
NA), seq_end = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, 1L, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L,
1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, NA, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, NA, 1L, 1L,
1L), seq = c(NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
NA, 3L, NA, NA, NA, NA, NA, NA, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
7L, 7L, 7L, 8L, NA, NA, NA, 9L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 10L, 10L, NA, NA, NA, NA, NA, NA, NA, 11L,
NA)), .Names = c("seq_start", "seq_end", "seq"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L))
Here's a solution that makes heavy use of dplyr package's lag() function, along with cumsum() from the base package, to produce the expected result. It's probably not the most succinct solution out there, but I do think it's reasonably intuitive to understand:
d <- d %>%
# new.seq.starts starts from 0, and increments by 1 every time seq_starts takes on
# the value 1, like this: 0, 0, 0, 1, 1, 1, 1, 2, 2, ...
# Rows with the same new.seq.starts value are thus part of the same "run".
mutate(new.seq.starts = cumsum(!is.na(seq_start))) %>%
# group by each "run"
group_by(new.seq.starts) %>%
# any.ending.so.far counts whether there has been ANY seq_end == 1 within the run yet.
# first.ending is TRUE only if it's the first row (within the run) to have an ending.
mutate(any.ending.so.far = cumsum(!is.na(seq_end)),
first.ending = any.ending.so.far == 1 &
(is.na(lag(any.ending.so.far)) | lag(any.ending.so.far) < 1)) %>%
ungroup() %>%
# result keeps the new.seq.starts values only if there's no ending yet (i.e.
# any.ending.so.far == 0), or only just ended (first.ending == TRUE). Otherwise,
# it takes on the value NA.
mutate(result = ifelse(new.seq.starts > 0 &
(any.ending.so.far == 0 | first.ending),
new.seq.starts, NA)) %>%
# Remove helper variables as they are no longer needed.
select(-c(new.seq.starts, any.ending.so.far, first.ending))
> all.equal(d$seq, d$result)
[1] TRUE

"Error: Duplicate identifiers for rows" in a dataframe and some of its subsets, but not other subsets

I've encountered a rather puzzling tidyr::spread() error. When I tried to run code (example below) in the full dataframe, I got the "Duplicate identifiers for rows" error.
I subset the (very large) dataframe to investigate and re-ran the code. This time it worked (see subset1df dput below). Then I tried it again with a different subset (subset2df), and I got the error message again. I honestly have no idea how to make sense of this and will greatly appreciate any help.
Reproducible code below:
subset1df:
structure(list(v1 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, NA, NA), .Label = "2", class = "factor"),
v2 = structure(c(1L, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA), .Label = "2", class = "factor"),
v3 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v4 = structure(c(NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v5 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v6 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L), .Label = "2", class = "factor"),
v7 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v8 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v9 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v10 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v11 = structure(c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v12 = structure(c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v13 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v14 = structure(c(NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA,
NA, NA, 1L, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v15 = structure(c(NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v16 = structure(c(NA, NA, 2L, NA, NA, 2L, NA, NA, NA, NA,
NA, NA, NA, NA, 1L, NA), .Label = c("1", "2"), class = "factor"),
respondentID = structure(c(7L, 7L, 7L, 5L, 6L, 6L, 4L, 4L,
4L, 3L, 3L, 3L, 2L, 2L, 2L, 1L), .Label = c("EO15", "EO17",
"EO19", "EO21", "Eo23", "EO23", "EO24"), class = "factor")), .Names = c("v1",
"v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10", "v11",
"v12", "v13", "v14", "v15", "v16", "respondentID"), row.names = c(NA,
-16L), class = "data.frame")
subset2df:
structure(list(v2 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v4 = structure(c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v5 = structure(c(NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("1",
"2"), class = "factor"), v6 = structure(c(NA, 1L, NA, NA,
2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v9 = structure(c(NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v11 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v12 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, NA), .Label = c("1",
"2"), class = "factor"), v13 = structure(c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 2L, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v14 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v15 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v16 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
respondentID = structure(c(21L, 20L, 20L, 19L, 18L, 18L,
1L, 1L, 16L, 16L, 16L, 10L, 10L, 17L, 15L, 15L, 15L, 14L,
14L, 14L, 13L, 12L, 12L, 11L, 11L, 11L, 8L, 9L, 9L, 6L, 7L,
7L, 3L, 2L, 3L, 4L, 4L, 4L, 5L), .Label = c("EO11", "Eo14",
"EO14", "EO16", "EO18", "EO26", "EO27", "Eo28", "EO28", "EO3",
"Eo30", "EO32", "EO331", "EO35", "EO37", "EO4", "EO41", "EO6",
"EO6 ", "EO7", "EO7 "), class = "factor")), .Names = c("v2",
"v4", "v5", "v6", "v9", "v11", "v12", "v13", "v14", "v15", "v16",
"respondentID"), row.names = c(NA, -39L), class = "data.frame")
combodf_id (needed to execute code):
structure(list(color1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"
), class = "factor"), color2 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue",
"violet"), class = "factor"), color3 = structure(c(1L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green",
"turqoise"), class = "factor"), color4 = structure(c(2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black",
"yellow"), class = "factor"), combo = c("v1", "v2", "v3", "v4",
"v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14",
"v15", "v16")), class = "data.frame", .Names = c("color1", "color2",
"color3", "color4", "combo"), row.names = c(NA, -16L))
code:
result_df <- subset1df %>%
gather(key = combo, value = val, -respondentID) %>%
filter(!is.na(val)) %>%
left_join(combodf_id, by = "combo") %>%
arrange(respondentID) %>%
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1, 0)) %>%
select(respondentID, flower, color1, color2, choice = val)
The solution by #Tung below is very similar to the one in Spread with duplicate identifiers (using tidyverse and %>%)
, but neither of them quite solves the problem.
In your subset2df...
you have removed some of the response columns (e.g. v1, v3, v7, etc.)... that's probably why you have a bunch of rows/respondants without any response (all NA)
your respondentID column has levels with whitespace in some values that will probably mess things up later on (eg. "EO7" and "EO7 ")
there are duplicated rows, e.g. subset2df[c(15, 17), ]
all of your columns are factors... particularly with the response columns with integer values, I find that strange. tidyr functions gather and spread will take into account the levels of a factor, which can seem especially strange when you have subset a factor variable and not all of the levels are represented in the data.
You should probably fix those problems first, because they are what likely lead to your problems later, however... the reason you get the error "Duplicate identifiers for rows" is because the data frame you're passing to spread(color, value) has duplicate rows. You can force this to work by adding distinct() %>% one line before it, but be aware that the only reason you have to do that is because of the other problems before.
subset2df %>%
as_tibble() %>%
mutate_at(vars(v2:v16), as.integer) %>%
gather(key = combo, value = val, -respondentID, na.rm = T) %>%
filter(!is.na(val)) %>%
left_join(combodf_id, by = "combo") %>%
arrange(respondentID) %>%
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
distinct() %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1, 0)) %>%
select(respondentID, flower, color1, color2, choice = val)
I would strongly suggest fixing all the above problems first though, like so (notice you won't need the distinct command further down in the chain because you will have already applied that to the original data)...
subset2df %>%
as_tibble() %>% # tibble has better printing methods
mutate_at(vars(-respondentID), as.integer) %>% # convert response to numeric
mutate(respondentID = as.character(respondentID)) %>% # convert to char
mutate(respondentID = trimws(respondentID)) %>% # remove whitespace
distinct() %>% # remove duplicate rows
gather(key = combo, value = val, -respondentID, na.rm = T) %>%
left_join(combodf_id, by = "combo") %>%
mutate_at(vars(color1:color4), as.character) %>% # convert colors to char
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1L, 0L)) %>%
select(respondentID, flower, color1, color2, choice = val)

How to create a function for multiple columns of a Data Frame in R

I am trying to create a function for making tables using columns of data frame:
Freq_table=function(x){
x<-factor(x)
T<-table(STI_IPD$Q19_1,x,exclude = NULL)
T<- data.frame(T)
library(reshape2)
T_x<-dcast(T, Var1~Var2)
T_x<-T_x%>%select(-starts_with("NA"),-ends_with("NA"))
}
here STI_IPD is my Dataframe, and x should be any column which I'm using to create tables with another column Q19_1
This is throwing error:
Error in FUN(X[[i]], ...) : object 'Var2' not found
Data.frame(T) output is:
Var1 Var2 Freq
1 Consumer Goods 1 1
2 Life Sciences 1 0
3 Chemicals 1 0
4 Other Manufacturing 1 0
5 High Tech 1 0
6 Energy 1 0
7 Mining & Metals 1 0
8 Retail & Wholesale 1 0
9 Banking/Financial Services 1 0
10 Insurance/Reinsurance 1 0
11 Services (Non-Financial) 1 0
12 Logistics 1 0
13 Other Non-Manufacturing 1 0
14 Consumer Goods <NA> 1
15 Life Sciences <NA> 1
16 Chemicals <NA> 1
17 Other Manufacturing <NA> 4
18 High Tech <NA> 1
19 Energy <NA> 5
20 Mining & Metals <NA> 0
21 Retail & Wholesale <NA> 1
22 Banking/Financial Services <NA> 5
23 Insurance/Reinsurance <NA> 3
24 Services (Non-Financial) <NA> 5
25 Logistics <NA> 2
26 Other Non-Manufacturing <NA> 3
output of dput(head(STI_IPD, 30)) is below:
structure(list(Q18_1 = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), Q19_1 = structure(c(9L, 13L, 1L, 9L,
2L, 6L, 4L, 13L, 9L, 11L, 12L, 4L, 10L, 10L, 11L, 1L, 13L, 11L,
3L, 6L, 5L, 6L, 6L, 8L, 11L, 12L, 4L, 11L, 4L, 10L), .Label = c("Consumer Goods",
"Life Sciences", "Chemicals", "Other Manufacturing", "High Tech",
"Energy", "Mining & Metals", "Retail & Wholesale", "Banking/Financial Services",
"Insurance/Reinsurance", "Services (Non-Financial)", "Logistics",
"Other Non-Manufacturing"), class = "factor"), Q46_21_4 = c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA, NA, NA), Q46_21_5 = c(NA,
NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, 1L,
NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, NA, NA, NA, 1L), Q46_21_6 = c(NA,
NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_21_7 = c(NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_22_4 = c(NA,
NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, 1L, NA, NA, NA), Q46_22_5 = c(1L,
1L, 1L, NA, 1L, NA, 1L, NA, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_22_6 = c(NA,
NA, 1L, NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_22_7 = c(NA,
NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_23_4 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA), Q46_23_5 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L), Q46_23_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, 1L, 1L, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L), Q46_23_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L), Q46_24_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_24_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_24_6 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_24_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_25_4 = c(1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, NA), Q46_25_5 = c(1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, 1L), Q46_25_6 = c(1L,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, 1L), Q46_25_7 = c(1L,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA, 1L), Q46_26_4 = c(1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, 1L,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_26_5 = c(1L,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_26_6 = c(1L,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_26_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA, 1L,
1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_27_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L), Q46_27_5 = c(NA,
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_27_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_27_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_28_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), Q46_28_5 = c(NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, 1L), Q46_28_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, 1L), Q46_28_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, 1L), Q46_29_4 = c(NA,
NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_29_5 = c(NA,
1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_29_6 = c(NA,
NA, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_29_7 = c(NA,
NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_30_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_30_5 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, 1L, NA,
1L, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_30_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, 1L, NA,
1L, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_30_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, 1L, NA, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_31_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), Q46_31_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_31_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_31_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, 1L, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_32_4 = c(NA,
1L, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_32_5 = c(NA,
1L, 1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_32_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, 1L), Q46_32_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 1L), Q46_33_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_33_5 = c(NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, NA), Q46_33_6 = c(NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, 1L, NA,
1L, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, NA), Q46_33_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA), Q46_34_4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Q46_34_5 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA), Q46_34_6 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA), Q46_34_7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Q18_1",
"Q19_1", "Q46_21_4", "Q46_21_5", "Q46_21_6", "Q46_21_7", "Q46_22_4",
"Q46_22_5", "Q46_22_6", "Q46_22_7", "Q46_23_4", "Q46_23_5", "Q46_23_6",
"Q46_23_7", "Q46_24_4", "Q46_24_5", "Q46_24_6", "Q46_24_7", "Q46_25_4",
"Q46_25_5", "Q46_25_6", "Q46_25_7", "Q46_26_4", "Q46_26_5", "Q46_26_6",
"Q46_26_7", "Q46_27_4", "Q46_27_5", "Q46_27_6", "Q46_27_7", "Q46_28_4",
"Q46_28_5", "Q46_28_6", "Q46_28_7", "Q46_29_4", "Q46_29_5", "Q46_29_6",
"Q46_29_7", "Q46_30_4", "Q46_30_5", "Q46_30_6", "Q46_30_7", "Q46_31_4",
"Q46_31_5", "Q46_31_6", "Q46_31_7", "Q46_32_4", "Q46_32_5", "Q46_32_6",
"Q46_32_7", "Q46_33_4", "Q46_33_5", "Q46_33_6", "Q46_33_7", "Q46_34_4",
"Q46_34_5", "Q46_34_6", "Q46_34_7"), class = c("data.table",
"data.frame"), row.names = c(NA, -30L), .internal.selfref = <pointer: 0x0000000000090788>)
Maybe something like the following.
library(reshape2)
library(tidyverse)
Freq_table <- function(x){
dat <- data.frame(Q19_1 = STI_IPD$Q19_1, STI_IPD[[x]])
names(dat)[2] <- x
m <- melt(dat, id.vars = "Q19_1")
result <- tryCatch(dcast(m, Q19_1 ~ variable), error = function(e) message(e))
result <- result %>% select(-starts_with("NA"),-ends_with("NA"))
result
}
Freq_table("Q46_22_5")
Freq_table("Q46_34_4")
Note that you pass to the function the names of the columns you want, not the columns themselves.
EDIT.
To answer to a request of the OP in a comment, the following code will apply the function above to all but the two first columns of the input dataframe STI_IPD and then merge all the results into one df. The Reduce/mergecode is the answer by Hong Ooi to this question.
lst <- lapply(names(STI_IPD[-(1:2)]), Freq_table)
lst <- lst[!sapply(lst, is.null)]
merge.all <- function(x, y) {
merge(x, y, all = TRUE, by = "Q19_1")
}
output <- Reduce(merge.all, lst)

Is it possible to get a p-value for nodes in a categorical tree analysis with R?

Is it possible to get a p-value for nodes in a categorical tree analysis with R? I am using rpart and can't locate a p-value for each node. Maybe this is only possible with a regression and not categories.
structure(list(subj = c(702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L), visit = c(4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L), run = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("A", "B", "C", "D", "E", "xdur", "xend60", "xpre"
), class = "factor"), ho = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hph = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), longexer = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("10min", "60min"), class = "factor"),
esq_sick = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA), esq_sick2 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), ll_sick = c(NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA,
0L, NA, NA, NA, NA, NA), ll_sick2 = c(NA, NA, 0L, NA, NA,
NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA), esq_01 = c(NA, NA, 2L, NA, NA, NA, NA, NA,
NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
NA), esq_02 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 2L,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_03 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_04 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_05 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA), esq_06 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA),
esq_07 = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_08 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_09 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_10 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA)), .Names = c("subj", "visit", "run", "ho", "hph",
"longexer", "esq_sick", "esq_sick2", "ll_sick", "ll_sick2", "esq_01",
"esq_02", "esq_03", "esq_04", "esq_05", "esq_06", "esq_07", "esq_08",
"esq_09", "esq_10"), row.names = 7:30, class = "data.frame")
alldata = read.table('symptomology CSV2.csv',header=TRUE,sep=",")
library(rpart)
fit <- rpart(esq_sick2~esq_01_bin + esq_02_bin + esq_03_bin + esq_04_bin + esq_05_bin + esq_06_bin + esq_07_bin + esq_08_bin + esq_09_bin + esq_10_bin + esq_11_bin + esq_12_bin + esq_13_bin + esq_14_bin + esq_15_bin + esq_16_bin + esq_17_bin + esq_18_bin + esq_19_bin + esq_20_bin, method="class", data=alldata)
plot(fit, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0.1, minbranch = 0.3)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
Here's an example that might help you. I'm using the built-in airquality data set and the example provided in the help for ctree:
library(partykit)
# For the sctest function to extract p-values (see help for ctree and sctest)
library(strucchange)
# Data we'll use
airq <- subset(airquality, !is.na(Ozone))
# Build the tree
airct <- ctree(Ozone ~ ., data = airq)
Look at the tree:
airct
Model formula:
Ozone ~ Solar.R + Wind + Temp + Month + Day
Fitted party:
[1] root
| [2] Temp <= 82
| | [3] Wind <= 6.9: 55.600 (n = 10, err = 21946.4)
| | [4] Wind > 6.9
| | | [5] Temp <= 77: 18.479 (n = 48, err = 3956.0)
| | | [6] Temp > 77: 31.143 (n = 21, err = 4620.6)
| [7] Temp > 82
| | [8] Wind <= 10.3: 81.633 (n = 30, err = 15119.0)
| | [9] Wind > 10.3: 48.714 (n = 7, err = 1183.4)
Extract the p-values:
sctest(airct)
$`1`
Solar.R Wind Temp Month Day
statistic 13.34761286 4.161370e+01 5.608632e+01 3.1126596 0.02011554
p.value 0.00129309 5.560572e-10 3.468337e-13 0.3325881 0.99998175
$`2`
Solar.R Wind Temp Month Day
statistic 5.4095322 12.968549828 11.298951405 0.2148961 2.970294
p.value 0.0962041 0.001582833 0.003871534 0.9941976 0.357956
$`3`
NULL
$`4`
Solar.R Wind Temp Month Day
statistic 9.547191843 2.307676 11.598966936 0.06604893 0.2513143
p.value 0.009972755 0.497949 0.003295072 0.99965679 0.9916670
$`5`
Solar.R Wind Temp Month Day
statistic 6.14094026 1.3865355 1.9986304 0.8268341 1.3580462
p.value 0.06432172 0.7447599 0.5753799 0.8952749 0.7528481
$`6`
Solar.R Wind Temp Month Day
statistic 5.1824354 0.02060939 0.9270013 0.165171 4.6220522
p.value 0.1089932 0.99998062 0.8705785 0.996871 0.1481643
$`7`
Solar.R Wind Temp Month Day
statistic 0.8083249 11.711564549 6.77148538 0.1307643 0.03992875
p.value 0.8996614 0.003101788 0.04546281 0.9982052 0.99990034
$`8`
Solar.R Wind Temp Month Day
statistic 0.9056479 3.1585094 2.9285252 0.008106707 0.008686293
p.value 0.8759687 0.3247585 0.3657072 0.999998099 0.999997742
$`9`
NULL

Resources