plm regression fixed effects on two variables, - r
I have the following simplified df:
problem <- data.frame(
stringsAsFactors = FALSE,
fkeycompany = c("0000001961",
"0000003570","0000003570","0000003570",
"0000003570","0000003570","0000003570",
"0000003570","0000004187","0000004187","0000004187",
"0000004187","0000016058","0000022872",
"0000022872","0000022872","0000022872","0000024071",
"0000050471","0000052971","0000052971",
"0000056679","0000058592","0000058592","0000058592",
"0000063330","0000099047","0000099047",
"0000099047","0000316206","0000316537",
"0000319697","0000351917","0000351917","0000351917",
"0000356037","0000356037","0000356037",
"0000700815","0000700815","0000700815","0000700815",
"0000704415","0000704415","0000704415",
"0000705003","0000720154","0000720154","0000720154",
"0000720154"),
fiscalyear = c(2018,2002,
2002,2004,2006,2007,2007,2014,2005,2005,
2009,2017,2003,2002,2004,2004,2010,2002,
2016,2008,2008,2002,2005,2005,2010,2014,
2000,2005,2005,2002,2002,2001,2005,2005,
2006,2007,2012,2015,2006,2006,2007,2008,
2003,2014,2014,2000,2004,2006,2008,2013),
zmijewskiscore = c(-0.295998372490631,-3.0604522838509,-3.0604522838509,
-9.70437199970406,-0.836774487816746,
0.500903351523752,0.500903351523752,-1.29210741224579,
-1.96529713996165,-1.96529713996165,
-1.60831783946871,-2.12343231229296,-3.99767761748961,
0.561261861396196,4.13793269655047,4.13793269655047,
5.61803398400963,-0.000195582736436772,
-3.93766039340527,-0.540037039625719,
-0.540037039625719,-1.93767533120689,-4.54446419505987,
-4.54446419505987,1.94389244672183,
0.941272649148121,-3.88427264672157,-0.342812414189714,
-0.342812414189714,-1.35074505582686,
-4.52746658422071,-0.130671284507204,-0.223517713694019,
-0.223517713694019,0.0149617517859735,
-2.95100357094774,-2.55146691134187,-1.86846592111008,
2.92283100206773,2.92283100206773,
4.65325023636937,6.1585365469118,-4.54449586848866,
-1.49969162335521,-1.49969162335521,-3.34071706450412,
-1.72382101559976,-1.53076052307727,
-1.77582320023177,-1.57280701642882),
lloss = c(0,1,1,1,1,
1,1,1,0,0,0,1,0,0,1,1,1,1,0,1,1,
1,0,0,1,0,0,1,1,0,0,1,1,1,1,0,0,
1,1,1,1,1,0,1,1,0,1,1,1,0),
GCO_prev = c(1,1,1,0,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,0),
GCO = c(0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,
0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,
0,0,0,1,1,0,0,0,0,0,0,0,0),
industry = c(9,5,5,5,5,
5,5,5,6,6,6,6,9,9,9,9,9,6,9,6,6,
9,8,8,8,8,9,9,9,9,8,9,5,5,5,9,9,
9,6,6,6,6,9,9,9,9,9,9,9,9))
I would like to run a plm regression on this with fixed effects on year and industry.
library(plm)
summary(plm(GCO ~ GCO_prev + lloss + zmijewskiscore, index=c("fiscalyear", "industry"), data=problem, model="within" ))
However, I get this error while running:
Error in pdim.default(index[[1L]], index[[2L]]) :
duplicate couples (id-time)
In addition: Warning message:
In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
to find out which, use, e.g., table(index(your_pdataframe), useNA = "ifany")
I do not quite know how to fix this. If I am assuming correctly, it has something to do with there being more companies (fkeycompany code) than 1 that have for example for industry = 9, fiscalyear = 2003 for example. So for some industries, lets say 9, there are more rows (fkeycompanies, in this example 0000016058 & 0000704415) which contain the year 2003 (or at least, thats what I think is the issue, or am I wrong?). This is with more industries and years the issue I believe in my main dataset. How do I fix this error message?
Also, besides this issue, is the code correctly that I am running? Am I indeed regressing with year and industry effects?
Given your data, the observational unit for panel data is firms (fkeycompany). You might want to add the industry as another fixed effect, but it cetainly is not the time index (the time index goes into the 2nd slot of argument index and I will assume it is fiscalyear). There are plenty of questions with answers to the topic. Also, do read the packages first vignette where the data specification for the index argument is explained.
I suggest to convert to pdata.frame first.
However, there are double constellations of fkeycompany and fiscal year, see below code where the output of table with a value > 1 hints you to the combinations.
library(plm)
pdat.problem <- pdata.frame(problem, index = c("fkeycompany", "fiscalyear"))
#> Warning in pdata.frame(problem, index = c("fkeycompany", "fiscalyear")): duplicate couples (id-time) in resulting pdata.frame
#> to find out which, use, e.g., table(index(your_pdataframe), useNA = "ifany")
table(index(pdat.problem), useNA = "ifany")
#> fiscalyear
#> fkeycompany 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2012 2013
#> 0000001961 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000003570 0 0 2 0 1 0 1 2 0 0 0 0 0
#> 0000004187 0 0 0 0 0 2 0 0 0 1 0 0 0
#> 0000016058 0 0 0 1 0 0 0 0 0 0 0 0 0
#> 0000022872 0 0 1 0 2 0 0 0 0 0 1 0 0
#> 0000024071 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000050471 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000052971 0 0 0 0 0 0 0 0 2 0 0 0 0
#> 0000056679 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000058592 0 0 0 0 0 2 0 0 0 0 1 0 0
#> 0000063330 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000099047 1 0 0 0 0 2 0 0 0 0 0 0 0
#> 0000316206 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000316537 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000319697 0 1 0 0 0 0 0 0 0 0 0 0 0
#> 0000351917 0 0 0 0 0 2 1 0 0 0 0 0 0
#> 0000356037 0 0 0 0 0 0 0 1 0 0 0 1 0
#> 0000700815 0 0 0 0 0 0 2 1 1 0 0 0 0
#> 0000704415 0 0 0 1 0 0 0 0 0 0 0 0 0
#> 0000705003 1 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000720154 0 0 0 0 1 0 1 0 1 0 0 0 1
#> fiscalyear
#> fkeycompany 2014 2015 2016 2017 2018
#> 0000001961 0 0 0 0 1
#> 0000003570 1 0 0 0 0
#> 0000004187 0 0 0 1 0
#> 0000016058 0 0 0 0 0
#> 0000022872 0 0 0 0 0
#> 0000024071 0 0 0 0 0
#> 0000050471 0 0 1 0 0
#> 0000052971 0 0 0 0 0
#> 0000056679 0 0 0 0 0
#> 0000058592 0 0 0 0 0
#> 0000063330 1 0 0 0 0
#> 0000099047 0 0 0 0 0
#> 0000316206 0 0 0 0 0
#> 0000316537 0 0 0 0 0
#> 0000319697 0 0 0 0 0
#> 0000351917 0 0 0 0 0
#> 0000356037 0 1 0 0 0
#> 0000700815 0 0 0 0 0
#> 0000704415 2 0 0 0 0
#> 0000705003 0 0 0 0 0
#> 0000720154 0 0 0 0 0
Once fixed, you will be able to run a model along these lines.
For a time-fixed-effects model:
model <- plm(GCO ~ GCO_prev + lloss + zmijewskiscore, data = pdat.problem, model="within", effect = "time")
Or time-fixed-effects with industry as an additional fixed effect:
model2 <- plm(GCO ~ GCO_prev + lloss + zmijewskiscore + factor(industry), data = pdat.problem, model="within", effect = "time")
Related
Multiplying multiple columns with each other into a new dataframe in R
I want to multiply many of my binary variables into new columns, so called interactive variables. My dataset is structured like this: YearCountry <- data.frame( Time = c("2000","2001", "2002", "2003", "2000","2001", "2002", "2003", "2000","2001", "2002", "2003"), AL = c(1,1,1,1,0,0,0,0,0,0,0,0), FR = c(0,0,0,0,1,1,1,1,0,0,0,0), UK = c(0,0,0,0,0,0,0,0,1,1,1,1), Y2000d = c(1,0,0,0,1,0,0,0,1,0,0,0), Y2001d = c(0,1,0,0,0,1,0,0,0,1,0,0), Y2002d = c(0,0,1,0,0,0,1,0,0,0,1,0), Y2003d = c(0,0,0,1,0,0,0,1,0,0,0,1)) YearCountry Time AL FR UK Y2000d Y2001d Y2002d Y2003d 1 2000 1 0 0 1 0 0 0 2 2001 1 0 0 0 1 0 0 3 2002 1 0 0 0 0 1 0 4 2003 1 0 0 0 0 0 1 5 2000 0 1 0 1 0 0 0 6 2001 0 1 0 0 1 0 0 7 2002 0 1 0 0 0 1 0 8 2003 0 1 0 0 0 0 1 9 2000 0 0 1 1 0 0 0 10 2001 0 0 1 0 1 0 0 11 2002 0 0 1 0 0 1 0 12 2003 0 0 1 0 0 0 1 I need to multiply the binary variable for each of the countries (AL,FR,UK) with each of the binary variables for a given year so that I get #country x #year new variables. In this case I have three countries and four years which gives 12 new variables. My full data contains 105 countries/regions and stretches over twenty years. I therefore need a general formula. I want data that looks like this Interact <- data.frame(Time = c("2000","2001", "2002", "2003", "2000","2001", "2002", "2003", "2000","2001", "2002", "2003"), Y2000xAL = c(1,0,0,0,0,0,0,0,0,0,0,0), Y2001xAL = c(0,1,0,0,0,0,0,0,0,0,0,0), Y2002xAL = c(0,0,1,0,0,0,0,0,0,0,0,0), Y2003xAL = c(0,0,0,1,0,0,0,0,0,0,0,0), Y2000xFR = c(0,0,0,0,1,0,0,0,0,0,0,0), Y2001xFR = c(0,0,0,0,0,1,0,0,0,0,0,0), Y2002xFR = c(0,0,0,0,0,0,1,0,0,0,0,0), Y2003xFR = c(0,0,0,0,0,0,0,1,0,0,0,0), Y2000xUk = c(0,0,0,0,0,0,0,0,1,0,0,0), Y2001xUK = c(0,0,0,0,0,0,0,0,0,1,0,0), Y2002xUK = c(0,0,0,0,0,0,0,0,0,0,1,0), Y2003xUK = c(0,0,0,0,0,0,0,0,0,0,0,1)) Interact Time Y2000xAL Y2001xAL Y2002xAL Y2003xAL Y2000xFR Y2001xFR Y2002xFR Y2003xFR Y2000xUk Y2001xUK Y2002xUK Y2003xUK 1 2000 1 0 0 0 0 0 0 0 0 0 0 0 2 2001 0 1 0 0 0 0 0 0 0 0 0 0 3 2002 0 0 1 0 0 0 0 0 0 0 0 0 4 2003 0 0 0 1 0 0 0 0 0 0 0 0 5 2000 0 0 0 0 1 0 0 0 0 0 0 0 6 2001 0 0 0 0 0 1 0 0 0 0 0 0 7 2002 0 0 0 0 0 0 1 0 0 0 0 0 8 2003 0 0 0 0 0 0 0 1 0 0 0 0 9 2000 0 0 0 0 0 0 0 0 1 0 0 0 10 2001 0 0 0 0 0 0 0 0 0 1 0 0 11 2002 0 0 0 0 0 0 0 0 0 0 1 0 12 2003 0 0 0 0 0 0 0 0 0 0 0 1
Here's an approach with dplyr::across. We can make the final result into a plain data.frame with purrr:invoke as demonstrated in this answer. library(dplyr) library(purrr) YearCountry %>% mutate(across(AL:UK, ~ . * select(cur_data(), Y2000d:Y2003d))) %>% select(-(Y2000d:Y2003d)) %>% invoke(.f = data.frame) %>% rename_with(~str_replace(.,"\\.","")) Time ALY2000d ALY2001d ALY2002d ALY2003d FRY2000d FRY2001d FRY2002d FRY2003d UKY2000d UKY2001d UKY2002d UKY2003d 1 2000 1 0 0 0 0 0 0 0 0 0 0 0 2 2001 0 1 0 0 0 0 0 0 0 0 0 0 3 2002 0 0 1 0 0 0 0 0 0 0 0 0 4 2003 0 0 0 1 0 0 0 0 0 0 0 0 5 2000 0 0 0 0 1 0 0 0 0 0 0 0 6 2001 0 0 0 0 0 1 0 0 0 0 0 0 7 2002 0 0 0 0 0 0 1 0 0 0 0 0 8 2003 0 0 0 0 0 0 0 1 0 0 0 0 9 2000 0 0 0 0 0 0 0 0 1 0 0 0 10 2001 0 0 0 0 0 0 0 0 0 1 0 0 11 2002 0 0 0 0 0 0 0 0 0 0 1 0 12 2003 0 0 0 0 0 0 0 0 0 0 0 1
1) model.matrix We split the names by the number of characters in them (the countries have 2 characters in their names and the years have 6) and paste pluses in each. (Alternately use Plus(grep("^..$", nms, value = TRUE)) to get the country names and use that in place of spl["2"] and similarly Plus(grep("^Y....d$", nms, value = TRUE)) in place of spl["6"].) c(`2` = "AL+FR+UK", `6` = "Y2000d+Y2001d+Y2002d+Y2003d") and from that the formula: ~(AL + FR + UK):(Y2000d + Y2001d + Y2002d + Y2003d) + 0 and then compute its model matrix. The formula could also be expanded to one accepted by lm by modifying the sprintf format so we might not even need to create the model matrix. For example, if we had a response vector R then we could write: s <- sprintf("R ~ (%s)*(%s)", spl["2"], spl["4"]); fo <- formula(s); lm(fo, YearCountry) to include all variables and the interactions of countries and year as well as an intercept. Plus <- function(x) paste(x, collapse = "+") nms <- names(YearCountry)[-1] spl <- sapply(split(nms, nchar(nms)), Plus) s <- sprintf("~ (%s):(%s)+0", spl["2"], spl["6"]) fo <- formula(s) model.matrix(fo, YearCountry) giving this matrix: AL:Y2000d AL:Y2001d AL:Y2002d AL:Y2003d FR:Y2000d FR:Y2001d FR:Y2002d FR:Y2003d UK:Y2000d UK:Y2001d UK:Y2002d UK:Y2003d 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0 0 0 1 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 0 0 8 0 0 0 0 0 0 0 1 0 0 0 0 9 0 0 0 0 0 0 0 0 1 0 0 0 10 0 0 0 0 0 0 0 0 0 1 0 0 11 0 0 0 0 0 0 0 0 0 0 1 0 12 0 0 0 0 0 0 0 0 0 0 0 1 attr(,"assign") [1] 1 2 3 4 5 6 7 8 9 10 11 12 Alternately we can write it compactly like this: Plus <- function(x) paste(x, collapse = "+") nms <- names(YearCountry) s <- sprintf("~ (%s):(%s)+0", Plus(nms[2:4]), Plus(nms[5:8])) fo <- formula(s) model.matrix(fo, YearCountry) 2) eList Another approach is to use list comprehensions. With the eList package we can do this: library(eList) DF(for(i in YearCountry[2:4]) for(j in YearCountry[5:8]) i*j) giving this data frame. Use as.matrix(...) on it if you want a matrix. AL.Y2000d AL.Y2001d AL.Y2002d AL.Y2003d FR.Y2000d FR.Y2001d FR.Y2002d FR.Y2003d UK.Y2000d UK.Y2001d UK.Y2002d UK.Y2003d 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0 0 0 1 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 0 0 8 0 0 0 0 0 0 0 1 0 0 0 0 9 0 0 0 0 0 0 0 0 1 0 0 0 10 0 0 0 0 0 0 0 0 0 1 0 0 11 0 0 0 0 0 0 0 0 0 0 1 0 12 0 0 0 0 0 0 0 0 0 0 0 1 3) listcompr listcompr is another list comprehension package. Note that the development version of this package is needed in order to use bycol=. Replace gen.named.matrix with gen.named.data.frame if you want a data frame. # devtools::github_github("patrickroocks/listcompr") library(listcompr) nms <- names(YearCountry) gen.named.matrix("{nms[i]}.{nms[j]}", YearCountry[[i]] * YearCountry[[j]], i = 2:4, j = 5:8, bycol = TRUE)
How to write nested for loop so that the inner loop does not overwrite the first loop values
New poster on Stackoverflow but long time viewer. I could not find any previous posts that get at my specific question. Basically, I am struggling with how to make use of a nested for loop for my problem. The issue is that the number of variables and outcomes will change with the use case, so I want a solution that is flexible for various permutations. I am not sure that apply would help me because I don't know in advance how many variables and outcomes will exist in any given use case. The goal is to classify whether the outcome is correctly predicted by the variable (tp = true positive, etc). The problem is that the inner loop causes the outer loop values to be overwritten, but what I want is for each outcome to be evaluated over each variable once independently. Not sure what the best way to do this is and any advice appreciated. #Repex code #Generate variable variable <- c(1,2,3) df <- as.data.frame(matrix(0, ncol = 0, nrow = 30)) for(i in 1:length(variable)){ df[,c(paste0("variable",variable[i]))]<-as.vector(sample(c(0,1), replace=TRUE, size=30)) } df #Generate outcome outcome <- c(1,2,3) df2 <- as.data.frame(matrix(0, ncol = 0, nrow = 30)) for(i in 1:length(outcome)){ df2[,c(paste0("outcome",outcome[i]))]<-as.vector(sample(c(0,1), replace=TRUE, size=30)) } df2 #Generate performance metrics of outcome and predictor for (i in variable){ for(j in 1:length(df2)){ df[, c(paste0("tp.",variable[i]))] <- as.vector(ifelse(df[, c(paste0("variable",variable[i]))]==1 & df2[j]==1,1,0)) df[, c(paste0("tn.",variable[i]))] <- as.vector(ifelse(df[, c(paste0("variable",variable[i]))]==0 & df2[j]==0,1,0)) df[, c(paste0("fp.",variable[i]))] <- as.vector(ifelse(df[, c(paste0("variable",variable[i]))]==1 & df2[j]==0,1,0)) df[, c(paste0("fn.",variable[i]))] <- as.vector(ifelse(df[, c(paste0("variable",variable[i]))]==0 & df2[j]==1,1,0)) } } df #bind the data for comparison and spot checking df3 <- cbind(df2,df) #here we see that only the final inner loop data are correct df3
The problem is that you have 3 different variables that you want to compare to 3 different outcomes, so you are making 9 comparisons. However, since you are labelling your columns only according to the variable, you only have three unique numeric suffixes (one for each value of i) pasted on to each statistic (tp, tn, fp and fn). You therefore only have 12 distinct column names. At no point are you labelling the columns according to both the variable and the outcome. That means that every time your inner loop increments to the next outcome variable, you are over-writing the column in df that you wrote in the previous iteration of the loop. In any case, how would you intend to keep track of which comparison you are making unless you use both the variable number and the outcome number to label your columns? So you could do it this way: for (i in variable) { V <- c(paste0("variable", i)) for(j in seq_along(df2)) { comp <- paste0(i, ".vs.", j) df[paste0("tp.", comp)] <- as.numeric(df[V] == 1 & df2[j] == 1) df[paste0("tn.", comp)] <- as.numeric(df[V] == 0 & df2[j] == 0) df[paste0("fp.", comp)] <- as.numeric(df[V] == 1 & df2[j] == 0) df[paste0("fn.", comp)] <- as.numeric(df[V] == 0 & df2[j] == 1) } } df3 <- cbind(df2, df) Which will give you the structure you were looking for. It's a large data frame, so we'll just peek at it with str: str(df3) #> 'data.frame': 30 obs. of 42 variables: #> $ outcome1 : num 0 1 1 1 1 0 1 0 1 1 ... #> $ outcome2 : num 0 0 0 0 1 0 0 1 1 1 ... #> $ outcome3 : num 1 1 0 0 0 0 0 0 1 0 ... #> $ variable1: num 0 1 0 0 1 1 0 1 1 1 ... #> $ variable2: num 1 1 0 0 0 0 0 0 0 0 ... #> $ variable3: num 1 0 0 0 1 0 0 1 0 0 ... #> $ tp.1.vs.1: num 0 1 0 0 1 0 0 0 1 1 ... #> $ tn.1.vs.1: num 1 0 0 0 0 0 0 0 0 0 ... #> $ fp.1.vs.1: num 0 0 0 0 0 1 0 1 0 0 ... #> $ fn.1.vs.1: num 0 0 1 1 0 0 1 0 0 0 ... #> $ tp.1.vs.2: num 0 0 0 0 1 0 0 1 1 1 ... #> $ tn.1.vs.2: num 1 0 1 1 0 0 1 0 0 0 ... #> $ fp.1.vs.2: num 0 1 0 0 0 1 0 0 0 0 ... #> $ fn.1.vs.2: num 0 0 0 0 0 0 0 0 0 0 ... #> $ tp.1.vs.3: num 0 1 0 0 0 0 0 0 1 0 ... #> $ tn.1.vs.3: num 0 0 1 1 0 0 1 0 0 0 ... #> $ fp.1.vs.3: num 0 0 0 0 1 1 0 1 0 1 ... #> $ fn.1.vs.3: num 1 0 0 0 0 0 0 0 0 0 ... #> $ tp.2.vs.1: num 0 1 0 0 0 0 0 0 0 0 ... #> $ tn.2.vs.1: num 0 0 0 0 0 1 0 1 0 0 ... #> $ fp.2.vs.1: num 1 0 0 0 0 0 0 0 0 0 ... #> $ fn.2.vs.1: num 0 0 1 1 1 0 1 0 1 1 ... #> $ tp.2.vs.2: num 0 0 0 0 0 0 0 0 0 0 ... #> $ tn.2.vs.2: num 0 0 1 1 0 1 1 0 0 0 ... #> $ fp.2.vs.2: num 1 1 0 0 0 0 0 0 0 0 ... #> $ fn.2.vs.2: num 0 0 0 0 1 0 0 1 1 1 ... #> $ tp.2.vs.3: num 1 1 0 0 0 0 0 0 0 0 ... #> $ tn.2.vs.3: num 0 0 1 1 1 1 1 1 0 1 ... #> $ fp.2.vs.3: num 0 0 0 0 0 0 0 0 0 0 ... #> $ fn.2.vs.3: num 0 0 0 0 0 0 0 0 1 0 ... #> $ tp.3.vs.1: num 0 0 0 0 1 0 0 0 0 0 ... #> $ tn.3.vs.1: num 0 0 0 0 0 1 0 0 0 0 ... #> $ fp.3.vs.1: num 1 0 0 0 0 0 0 1 0 0 ... #> $ fn.3.vs.1: num 0 1 1 1 0 0 1 0 1 1 ... #> $ tp.3.vs.2: num 0 0 0 0 1 0 0 1 0 0 ... #> $ tn.3.vs.2: num 0 1 1 1 0 1 1 0 0 0 ... #> $ fp.3.vs.2: num 1 0 0 0 0 0 0 0 0 0 ... #> $ fn.3.vs.2: num 0 0 0 0 0 0 0 0 1 1 ... #> $ tp.3.vs.3: num 1 0 0 0 0 0 0 0 0 0 ... #> $ tn.3.vs.3: num 0 0 1 1 0 1 1 0 0 1 ... #> $ fp.3.vs.3: num 0 0 0 0 1 0 0 1 0 0 ... #> $ fn.3.vs.3: num 0 1 0 0 0 0 0 0 1 0 ... The other (and perhaps more sensible) way to do it is to have 3 data frames, one for each variable, and each with twelve columns (three sets of tp, tn, fp, fn). You can do this easily using lapply: df_list <- lapply(df, function(x) { dfs <- list() for(j in seq_along(df2)) { dfs[[j]] <- data.frame(ifelse(x == 1 & df2[j] == 1, 1, 0), ifelse(x == 0 & df2[j] == 0, 1, 0), ifelse(x == 1 & df2[j] == 0, 1, 0), ifelse(x == 0 & df2[j] == 1, 1, 0)) } setNames(do.call("cbind", dfs), paste0(c("tp.", "tn.", "fp.", "fn."), rep(seq_along(df2), each = 4))) }) Which gives you: df_list #> $variable1 #> tp.1 tn.1 fp.1 fn.1 tp.2 tn.2 fp.2 fn.2 tp.3 tn.3 fp.3 fn.3 #> 1 0 1 0 0 0 1 0 0 0 0 0 1 #> 2 1 0 0 0 0 0 1 0 1 0 0 0 #> 3 0 0 0 1 0 1 0 0 0 1 0 0 #> 4 0 0 0 1 0 1 0 0 0 1 0 0 #> 5 1 0 0 0 1 0 0 0 0 0 1 0 #> 6 0 0 1 0 0 0 1 0 0 0 1 0 #> 7 0 0 0 1 0 1 0 0 0 1 0 0 #> 8 0 0 1 0 1 0 0 0 0 0 1 0 #> 9 1 0 0 0 1 0 0 0 1 0 0 0 #> 10 1 0 0 0 1 0 0 0 0 0 1 0 #> 11 0 1 0 0 0 0 0 1 0 0 0 1 #> 12 0 0 1 0 1 0 0 0 1 0 0 0 #> 13 0 0 0 1 0 1 0 0 0 1 0 0 #> 14 0 1 0 0 0 0 0 1 0 0 0 1 #> 15 0 0 0 1 0 1 0 0 0 0 0 1 #> 16 0 1 0 0 0 1 0 0 0 0 0 1 #> 17 0 0 1 0 1 0 0 0 1 0 0 0 #> 18 0 0 1 0 0 0 1 0 0 0 1 0 #> 19 0 0 1 0 0 0 1 0 0 0 1 0 #> 20 1 0 0 0 0 0 1 0 1 0 0 0 #> 21 0 1 0 0 0 1 0 0 0 1 0 0 #> 22 1 0 0 0 0 0 1 0 1 0 0 0 #> 23 0 0 0 1 0 1 0 0 0 0 0 1 #> 24 0 0 0 1 0 1 0 0 0 0 0 1 #> 25 0 0 1 0 0 0 1 0 0 0 1 0 #> 26 0 0 1 0 0 0 1 0 0 0 1 0 #> 27 1 0 0 0 1 0 0 0 0 0 1 0 #> 28 0 0 1 0 0 0 1 0 1 0 0 0 #> 29 0 0 0 1 0 1 0 0 0 1 0 0 #> 30 0 0 1 0 0 0 1 0 0 0 1 0 #> #> $variable2 #> tp.1 tn.1 fp.1 fn.1 tp.2 tn.2 fp.2 fn.2 tp.3 tn.3 fp.3 fn.3 #> 1 0 0 1 0 0 0 1 0 1 0 0 0 #> 2 1 0 0 0 0 0 1 0 1 0 0 0 #> 3 0 0 0 1 0 1 0 0 0 1 0 0 #> 4 0 0 0 1 0 1 0 0 0 1 0 0 #> 5 0 0 0 1 0 0 0 1 0 1 0 0 #> 6 0 1 0 0 0 1 0 0 0 1 0 0 #> 7 0 0 0 1 0 1 0 0 0 1 0 0 #> 8 0 1 0 0 0 0 0 1 0 1 0 0 #> 9 0 0 0 1 0 0 0 1 0 0 0 1 #> 10 0 0 0 1 0 0 0 1 0 1 0 0 #> 11 0 1 0 0 0 0 0 1 0 0 0 1 #> 12 0 0 1 0 1 0 0 0 1 0 0 0 #> 13 0 0 0 1 0 1 0 0 0 1 0 0 #> 14 0 0 1 0 1 0 0 0 1 0 0 0 #> 15 1 0 0 0 0 0 1 0 1 0 0 0 #> 16 0 0 1 0 0 0 1 0 1 0 0 0 #> 17 0 1 0 0 0 0 0 1 0 0 0 1 #> 18 0 0 1 0 0 0 1 0 0 0 1 0 #> 19 0 0 1 0 0 0 1 0 0 0 1 0 #> 20 1 0 0 0 0 0 1 0 1 0 0 0 #> 21 0 1 0 0 0 1 0 0 0 1 0 0 #> 22 0 0 0 1 0 1 0 0 0 0 0 1 #> 23 0 0 0 1 0 1 0 0 0 0 0 1 #> 24 1 0 0 0 0 0 1 0 1 0 0 0 #> 25 0 0 1 0 0 0 1 0 0 0 1 0 #> 26 0 0 1 0 0 0 1 0 0 0 1 0 #> 27 1 0 0 0 1 0 0 0 0 0 1 0 #> 28 0 0 1 0 0 0 1 0 1 0 0 0 #> 29 0 0 0 1 0 1 0 0 0 1 0 0 #> 30 0 0 1 0 0 0 1 0 0 0 1 0 #> #> $variable3 #> tp.1 tn.1 fp.1 fn.1 tp.2 tn.2 fp.2 fn.2 tp.3 tn.3 fp.3 fn.3 #> 1 0 0 1 0 0 0 1 0 1 0 0 0 #> 2 0 0 0 1 0 1 0 0 0 0 0 1 #> 3 0 0 0 1 0 1 0 0 0 1 0 0 #> 4 0 0 0 1 0 1 0 0 0 1 0 0 #> 5 1 0 0 0 1 0 0 0 0 0 1 0 #> 6 0 1 0 0 0 1 0 0 0 1 0 0 #> 7 0 0 0 1 0 1 0 0 0 1 0 0 #> 8 0 0 1 0 1 0 0 0 0 0 1 0 #> 9 0 0 0 1 0 0 0 1 0 0 0 1 #> 10 0 0 0 1 0 0 0 1 0 1 0 0 #> 11 0 1 0 0 0 0 0 1 0 0 0 1 #> 12 0 1 0 0 0 0 0 1 0 0 0 1 #> 13 1 0 0 0 0 0 1 0 0 0 1 0 #> 14 0 1 0 0 0 0 0 1 0 0 0 1 #> 15 0 0 0 1 0 1 0 0 0 0 0 1 #> 16 0 0 1 0 0 0 1 0 1 0 0 0 #> 17 0 0 1 0 1 0 0 0 1 0 0 0 #> 18 0 1 0 0 0 1 0 0 0 1 0 0 #> 19 0 1 0 0 0 1 0 0 0 1 0 0 #> 20 1 0 0 0 0 0 1 0 1 0 0 0 #> 21 0 0 1 0 0 0 1 0 0 0 1 0 #> 22 0 0 0 1 0 1 0 0 0 0 0 1 #> 23 1 0 0 0 0 0 1 0 1 0 0 0 #> 24 0 0 0 1 0 1 0 0 0 0 0 1 #> 25 0 1 0 0 0 1 0 0 0 1 0 0 #> 26 0 1 0 0 0 1 0 0 0 1 0 0 #> 27 1 0 0 0 1 0 0 0 0 0 1 0 #> 28 0 0 1 0 0 0 1 0 1 0 0 0 #> 29 1 0 0 0 0 0 1 0 0 0 1 0 #> 30 0 0 1 0 0 0 1 0 0 0 1 0
How do I change color of interactions based on interaction value using an ifelse() statement in an plotweb bipartite?
Hi I am having trouble trying to get ifelse statements to work in a plotweb fuction (from bipartite) to color interaction based on the total quantity of interaction of each cell in the matrix. I had the same problem with the high bar colors, but since there were only a few values and one vector, it was easy to manually code. Here is the code I am using, I want to color interactions greater than 15 as dark turquoise and keep the rest as default grey (grey80). I have tried many different statements but I cant seem how to figure out what to put in the [,] to signify for the function to go through every individual cell and apply the statement instead of summing them, elem,elem also doesn't seem to work. Attached is a picture of the function's output currently plotweb(LadyNet, abuns.type='additional', arrow="up.center", text.rot=90, col.low=c("olivedrab3"), col.interaction =(ifelse(LadyNet[,] < 15,'grey80','darkturquoise')), col.high = c("grey10","#FF0000","grey10","#FF0000","grey10","#FF0000","grey10","grey10","grey10"), high.lab.dis = 0, ybig=1.2, y.width.high = .06, high.spacing = 0.011, y.lim = c(-1,2)) COCCAL COCSEP CYCPOL CYCSAN EXOFAS HIPCON PSYVIG SCY1 SCYMAR Acmispon glaber 0 1 0 1 0 0 0 0 0 Ambrosia psilostachya 1 36 0 24 0 6 0 0 0 Artemisia douglasiana 0 0 0 1 0 1 0 0 0 Asclepias fascicularis 0 5 0 4 0 2 0 0 0 Avena fatua 6 10 0 0 0 4 0 0 0 Baccharis pilularis 9 76 0 38 0 27 0 1 0 Baccharis salicifolia 0 2 0 0 0 0 0 0 0 Bromus diandrus 1 8 0 0 0 4 0 0 0 Capsicum annuum 0 0 0 0 0 0 0 0 1 Chenopodium murale 0 1 0 0 0 0 0 0 0 Croton californicus 3 20 0 13 0 54 4 0 0 DEAD WOOD 0 1 0 0 0 0 0 0 0 Distichilis spicata 0 1 0 0 0 0 0 0 0 Echium candicans 0 1 0 3 0 0 0 0 0 Eleocharis acicularis 0 1 0 0 0 0 0 0 0 Encelia californica 1 1 0 3 0 2 0 0 0 Epilobium canum 0 0 0 1 0 0 0 0 0 Erigeron bonariensis 0 4 0 0 0 0 0 0 0 Erigeron canadensis 0 17 0 10 0 2 0 0 0 Erigeron sumatrensis 0 13 0 0 0 1 0 0 0 Eriophyllum confertiflorum 1 10 0 0 0 1 0 0 0 Fence 0 0 0 1 0 0 0 0 0 Festuca perennis 0 1 0 0 0 2 0 0 0 Gambelium speciosa 0 0 0 0 0 1 0 0 0 Geranium dissectum 0 0 0 3 0 0 0 0 0 GROUND 0 1 0 1 0 0 0 0 0 Helminthotheca echioides 0 1 2 17 0 1 0 0 0 Heterotheca grandiflora 2 92 0 12 0 7 1 0 0 Hirschfieldia incana 0 3 0 0 0 1 0 0 0 Juncus patens 0 1 0 0 0 0 0 0 0 Laennecia coulteri 1 65 0 2 0 3 0 0 0 Lobularia maritima 1 1 0 0 0 0 0 0 0 Morus sp. 0 0 0 1 0 0 0 0 0 NoPicture 4 3 0 3 3 2 3 0 0 Oxalis pes-caprae 4 6 0 0 0 2 0 0 0 Pennisetum clandestinum 1 5 0 0 0 0 0 0 0 Polygonum arenastrum 0 1 0 0 0 0 0 0 0 Raphanus sativus 0 1 0 0 0 0 0 0 0 ROCK 0 0 0 1 0 0 0 0 0 Rumex crispus 0 1 0 0 0 0 0 0 0 Rumex salicifolius 0 0 0 3 0 0 0 0 0 Salsola tragus 1 6 0 1 0 1 0 0 0 Salvia leucophylla 0 1 0 0 0 1 0 0 0 Schenoplectus americanus 0 1 0 0 0 0 0 0 0 Solanum nigrum 0 0 0 0 0 1 0 0 0 Sonchus arvensis 0 1 0 0 0 0 0 0 0 Spinacia oleracea 0 0 0 0 0 0 1 0 0 Stipa pulchra 0 1 0 0 0 0 0 0 0 Symphiotrichum subulatum 0 88 0 7 0 3 0 0 0 THATCH 1 3 0 0 0 4 0 0 0 Verbena lasiostachys 1 9 0 0 0 2 0 0 0 For Reference, I have gotten the ifelse statement to function properly in the plotweb function when there was only one species in the lower level attached is an example along with the code: plotweb(rnet, abuns.type='additional', arrow="down.center", text.rot=90, col.low=c("olivedrab3"), col.interaction =(ifelse(rnet[1,] < 12,'grey80','darkturquoise')), col.high = (ifelse(rnet[1,] < 12,'grey10','darkturquoise')), high.lab.dis = 0, ybig=1.2, y.width.high = .06, high.spacing = 0.011)
One thing to note is that the col.interaction color matrix should be transposed. Here is an example that I trust you will find useful: library(bipartite) library(grDevices) plotweb(df, abuns.type='additional', arrow="up.center", text.rot=90, col.low=c("olivedrab3"), col.interaction = t(ifelse(df[,] < 15, adjustcolor('grey80', alpha.f = 0.5), #add alpha to colors adjustcolor('darkturquoise', alpha.f = 0.5))), col.high = c("grey10", "#FF0000", "grey10", "#FF0000", "grey10", "#FF0000", "grey10", "grey10", "grey10"), bor.col.interaction = NA, #remove the black border color high.lab.dis = 0, ybig=1.2, y.width.high = .06, high.spacing = 0.011, y.lim = c(-1,2))
Filling a table with additional columns if they don't exist
I've the following difficult problem. Here short example of my data. Assume that I've two data sets (my real example has something about 20). The data frames result as a list computed by a self written function with lapply. So, I put the data frames in my example in a list, too. Then I "rbind" them to compute a frequency table. df1 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T))) colnames(df1) <- c("k", "a") df2 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T))) colnames(df2) <- c("k", "a") list_df <- list(df1,df2) df_combine<- plyr::ldply(list_df, rbind) freq_foo <- table(df_combine$k,df_combine$a) I get a frequency table of the following form. a=0 a=11 a=12 a=2 a=5 a=6 a=7 a=8 a=3 a=9 1 1 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0 0 0 0 0 1 3 1 0 0 0 0 1 0 0 0 0 4 0 0 0 1 0 1 0 0 0 0 5 0 0 0 1 1 0 0 0 0 0 6 0 0 0 0 0 0 1 0 0 1 7 0 1 1 0 0 0 0 0 0 0 8 1 0 0 0 0 1 0 0 0 0 9 0 0 0 0 0 0 2 0 0 0 10 0 0 1 0 1 0 0 0 0 0 11 1 1 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 1 0 1 0 13 1 0 1 0 0 0 0 0 0 0 I want to extend and manipulate my table in the following way: First the table should go over a range of a=0 to a=15. So if there is a missing column, it should be added. And 2nd) I want to order the columns from 0 to 15. For the first problem I tried if(freq_foo$paste0("a=",0:15) == F){freq_foo$paste("a=",0:15) <- 0} but this should work only for data frames and not for tables. Also. i've no idea how to order the columns with an ascending order. The data type isnt important to me because I just want to use the output for further calculations. So, it can also be a data frame instead of a table.
#convert freq_foo table to dataframe df <- as.data.frame.matrix(freq_foo) #add all zeros column for missing column name in 0:15 series df[, paste0("a=", c(0:15)[!(c(0:15) %in% as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))])] <- 0 #order columns from 0 to 15 df <- df[, order(as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))] Output is: a=0 a=1 a=2 a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10 a=11 a=12 a=13 a=14 a=15 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 7 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 8 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 10 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 11 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 12 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 13 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 (Edit: Updated code after getting a requirement clarification from OP)
Loosing observation when I use reshape in R
I have data set > head(pain_subset2, n= 50) PatientID RSE SE SECODE 1 1001-01 0 0 0 2 1001-01 0 0 0 3 1001-02 0 0 0 4 1001-02 0 0 0 5 1002-01 0 0 0 6 1002-01 1 2a 1 7 1002-02 0 0 0 8 1002-02 0 0 0 9 1002-02 0 0 0 10 1002-03 0 0 0 11 1002-03 0 0 0 12 1002-03 1 1 1 > dim(pain_subset2) [1] 817 4 > table(pain_subset2$RSE) 0 1 788 29 > table(pain_subset2$SE) 0 1 2a 2b 3 4 5 788 7 5 1 6 4 6 > table(pain_subset2$SECODE) 0 1 788 29 I want to create matrix with n * 6 (n :# of PatientID, column :6 levels of SE) I use reshape, I lost many observations > dim(p) [1] 246 9 My code: p <- reshape(pain_subset2, timevar = "SE", idvar = c("PatientID","RSE"),v.names = "SECODE", direction = "wide") p[is.na(p)] <- 0 > table(p$RSE) 0 1 226 20 Compare with table of RSE, I lost 9 patients having 1. This is out put I have PatientID RSE SECODE.0 SECODE.2a SECODE.1 SECODE.5 SECODE.3 SECODE.2b SECODE.4 1 1001-01 0 0 0 0 0 0 0 0 3 1001-02 0 0 0 0 0 0 0 0 5 1002-01 0 0 0 0 0 0 0 0 6 1002-01 1 0 1 0 0 0 0 0 7 1002-02 0 0 0 0 0 0 0 0 10 1002-03 0 0 0 0 0 0 0 0 12 1002-03 1 0 0 1 0 0 0 0 13 1002-04 0 0 0 0 0 0 0 0 15 1003-01 0 0 0 0 0 0 0 0 18 1003-02 0 0 0 0 0 0 0 0 21 1003-03 0 0 0 0 0 0 0 0 24 1003-04 0 0 0 0 0 0 0 0 27 1003-05 0 0 0 0 0 0 0 0 30 1003-06 0 0 0 0 0 0 0 0 32 1003-07 0 0 0 0 0 0 0 0 35 1004-01 0 0 0 0 0 0 0 0 36 1004-01 1 0 0 0 1 0 0 0 40 1004-02a 0 0 0 0 0 0 0 0 Anyone knows what happens, I really appreciate. Thanks for your help, best.
Try: library(dplyr) library(tidyr) pain_subset2 %>% spread(SE, SECODE)