str(tidy_factors)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 70650 obs. of 4 variables:
$ date : Date, format: "1992-06-01" "1992-06-02" ...
$ Factor : Factor w/ 5 levels "CMA","HML","MKT",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Variable: Factor w/ 2 levels "Centrality","Return": 2 2 2 2 2 2 2 2 2 2 ...
$ Value : num -0.0012 -0.0022 -0.0012 -0.0029 0.0003 -0.0043 -0.0037 -0.0038 0.0026 -0.0024 ...
I would like to understand the pattern in the Value that the Factor takes over time (date).
library(tidyverse)
tidy_factors %>% filter(Variable=="Centrality")%>%
group_by(date) %>%
ggplot(aes(x=date,y=Factor, fill=Value))+
geom_bar(stat="identity")
I get to visualize it in a nice manner but the dates on the x axis are indistinguishable. When I try to scale_x_date to get a better understanding of the values that it takes according to different periods I get the following error:
tidy_factors %>% filter(Variable=="Centrality")%>%
group_by(date) %>%
ggplot(aes(x=date,y=Factor, fill=Value))+
geom_bar(stat="identity")+
scale_x_date(date_breaks = "1 year", date_labels="%Y")
Error in charToDate(x) :
character string is not in a standard unambiguous format
I also tried "1 years","1 month" ecc...
The dates are already unique for each Factor level. Can you tell me what is the problem?
Related
*I have a large data set including 2000 variables, including factors and continuous variables.
For example:
library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)
summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)
But as soon as running the above code, I got the following error:
Error in dplyr::summarise():
! Problem while computing ..1 = ...$p.value.
Caused by error in fisher.test():
! 'x' and 'y' must have at least 2 levels
*In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.
If the variable includes just one level
If the variable includes more than one level but the frequency of just one level is no-zero.
if all values of the variable are missing*
Update (partial answer):
With this code we can remove factors with only one level and keep other non factor variables:
x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]
The OP's code does work with the data provided
library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor",
explanatory =explanatory ,
add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
Dependent: Perforation No Yes p fit_id index
Age (years) Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542 age 1
Age <40 years 68 (7.5) 2 (7.4) 1.000 age.factor<40 years 2
40-59 years 334 (37.0) 10 (37.0) age.factor40-59 years 3
60+ years 500 (55.4) 15 (55.6) age.factor60+ years 4
Sex Female 432 (47.9) 13 (48.1) 1.000 sex.factorFemale 5
Male 470 (52.1) 14 (51.9) sex.factorMale 6
Obstruction No 715 (81.2) 17 (63.0) 0.026 obstruct.factorNo 7
Yes 166 (18.8) 10 (37.0) obstruct.factorYes 8
The strcture of data shows the factor variables to have more than 1 level
> str(colon_s[c(explanatory, dependent)])
'data.frame': 929 obs. of 5 variables:
$ age : num 43 63 71 66 69 57 77 54 46 68 ...
..- attr(*, "label")= chr "Age (years)"
$ age.factor : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
..- attr(*, "label")= chr "Age"
$ sex.factor : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
..- attr(*, "label")= chr "Sex"
$ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Obstruction"
$ perfor.factor : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Perforation"
Regarding selection of factor variables with the condition mentioned, we could use
library(dplyr)
colon_s_sub <- colon_s %>%
select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
sum(complete.cases(.x)) > 0))
folks...
I am having trouble with date/time showing up properly in lubridate.
Here's my code:
Temp.dat <- read_excel("Temperature Data.xlsx", sheet = "Sheet1", na="NA") %>%
mutate(Treatment = as.factor(Treatment),
TempC=as.factor(TempC),
TempF=as.factor(TempF),
Month=as.factor(Month),
Day=as.factor(Day),
Year=as.factor(Year),
Time=as.factor(Time))%>%
select(TempC, Treatment, Month, Day, Year, Time)%>%
mutate(Measurement=make_datetime(Month, Day, Year, Time))
Here's what it spits out:
tibble [44 x 7] (S3: tbl_df/tbl/data.frame)
$ TempC : Factor w/ 38 levels "15.5555555555556",..: 31 32 29 20 17 28 27 26 23 24 ...
$ Treatment : Factor w/ 2 levels "Grass","Soil": 1 1 1 1 2 2 2 2 2 2 ...
$ Month : Factor w/ 1 level "6": 1 1 1 1 1 1 1 1 1 1 ...
$ Day : Factor w/ 2 levels "15","16": 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Factor w/ 1 level "2022": 1 1 1 1 1 1 1 1 1 1 ...
$ Time : Factor w/ 3 levels "700","1200","1600": 3 3 3 3 3 3 3 3 3 3 ...
**$ Measurement: POSIXct[1:44], format: "0001-01-01 03:00:00" "0001-01-01 03:00:00" "0001-01-01 03:00:00" "0001-01-01 03:00:00" ...**
I've put asterisks by the problem result. It should spit out June 16th at 0700 or something like that, but instead it's defaulting to January 01, 1AD for some reason. I've tried adding colons to the date in excel, but that defaults to a 12-hour timecycle and I'd like to keep this at 24 hours.
What's going on here?
This will work as long as the format in the excel file for date is set to time, and it imports as a date-time object that lubridate can interpret.
library(dplyr)
library(lubridate)
Temp.dat <- read_excel("t.xlsx", sheet = "Sheet1", na="NA") %>%
mutate(Treatment = as.factor(Treatment),
TempC = as.numeric(TempC),
TempF = as.numeric(TempF),
Month = as.numeric(Month),
Day = as.numeric(Day),
Year = as.numeric(Year),
Hour = hour(Time),
Minute = minute(Time)) %>%
select(TempC, Treatment, Month, Day, Year, Hour, Minute) %>%
mutate(Measurement = make_datetime(year = Year,
month = Month,
day = Day,
hour = Hour,
min = Minute))
Notice the value for the arguments for make_datetime() are set to numeric, which is what the function expects. If you pass factors, the function gives you the weird dates you were seeing.
No need to convert Time to string and extract hours and minutes, as I suggested in the comments, since you can use lubridate's minute() and hour() functions.
EDIT
In order to be able to use lubridate's functions Time needs to be a date-time object. You can check that it is by looking at what read_excel() produces
> str(read_excel("t.xlsx", sheet = "Sheet1", na="NA"))
tibble [2 × 7] (S3: tbl_df/tbl/data.frame)
$ Treatment: chr [1:2] "s" "c"
$ TempC : num [1:2] 34 23
$ TempF : num [1:2] 99 60
$ Month : num [1:2] 5 4
$ Day : num [1:2] 1 15
$ Year : num [1:2] 2020 2021
$ Time : POSIXct[1:2], format: "1899-12-31 04:33:23" "1899-12-31 03:20:23"
See that Time is type POSIXct, a date-time object. If it is not, then you need to convert it into one if you want to use lubridate's minute() and hour() functions. If it cannot be converted, there are other solutions, but they depend on what you have.
'm trying to remove special characters e.g. "-","/",")","(" etc entirely from my dataframe. However my dataframe only contains one observation as it's feeding into a model that will be used in production. I've defined the factor levels explicitly for the data frame.
I've tried the following:
sanitize_string <- function(string){
gsub('\\s+', "_", string) %>%
gsub("[(]", "_", .) %>%
gsub("[)]", "_", .) %>%
gsub("[/]", "_", .) %>%
gsub("[-]", "_", .)}
and then:
df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)
But when I do this, I'm loose my factor levels, it just sees every factor as having one level, which causes problems later when I try to get predictions from my model as the sparse.model.matrix needs 2 or more levels for each factor, but really in production, it will only be sent one observation.
Thanks.
Here is my dataframe:
$ children_under16 : Factor w/ 2 levels "No","Yes": 1
$ ft_employment_status : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
$ fuel_type : Factor w/ 2 levels "D","P": 2
$ homeowner : Factor w/ 2 levels "FALSE","TRUE": 2
$ marital_status : Factor w/ 6 levels "Married","Separated",..: 1
$ overnight_loc : Factor w/ 7 levels "In a private Driveway",..: NA
$ usage_type : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
$ licence_type : Factor w/ 3 levels "UK","European",..: 1
$ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
$ A : Factor w/ 7 levels "1","2","5","3",..: 1
$ B : Factor w/ 19 levels "C","E","Q","D",..: 1
$ C : Factor w/ 63 levels "11","19","58",..: 1
$ region : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
$ D : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
$ E : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
$ F : Factor w/ 9 levels "Suburbanites",..: 1
$ industry_band : Factor w/ 18 levels "13","14","15",..: 14
$ occ_band_goco : Factor w/ 17 levels "0","1","2","3",..: 2
$ transmission : Factor w/ 2 levels "A","M": 2
$ vehicle_make : Factor w/ 19 levels "OTHER","AUDI",..: 1
$ vehicle_type : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
$ rural_urban : Factor w/ 19 levels "Urban major conurbation",..: 2
$ water_company : Factor w/ 23 levels "Affinity Water",..: 23
$ seats : Factor w/ 6 levels "-99","2","4",..: ```
You can sanitize the levels of the factor, rather than the column. This will preserve the order the levels are in---though it will create an error if your sanitization takes two levels that were different and makes them the same. I would just do a for loop:
for (i in 1:ncol(df)) {
if(is.factor(df[[i]])) {
levels(df[[i]]) = sanitize_string(levels(df[[i]]))
}
}
I can't test this on the structure you've posted, but if you have problems please share some data with dput() so I can copy/paste it (e.g., dput(df[1:10, ]), or some other small subset that illustrates the problem) and I'll be happy to test and refine.
I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
I'm trying to convert a piped operator sequence into an R function or similar (ideally using Tidyverse).
The input is a tidy dataframe with the responses to 15 questions as variables and each of the 10 observations are one of 4 standard responses (e.g. agree, disagree, etc.)
The output should be a summary of responses with both a count and percentage of the distribution of responses/observations for each question/variable.
To avoid copying and pasting and to improve the code, I would like to wrap a function or similar to calculate the count and percentages in a loop, Purr map or similar to iterate over the 15 questions.
Thank you for your suggestions.
The below code works as expected and responds with a table of Question , Count and Percentage with values for Agree, etc. This is ultimately what I am trying to achieve on scale and elegantly.
DF %>%
select(question) %>%
group_by(question) %>%
summarise(Count = n()) %>%
mutate (Percentage = round(100 * Count / sum(Count),0))
Background
I start with a tidy dataframe:
*'data.frame': 10 obs. of 15 variables:
$ Question1 : Factor w/ 4 levels "Agree","Neither agree nor disagree",..: 1 1 1 1 1 1 1 1 1 1 ...*
The following gets me close, without the percentages:
DF_as_list <- DF %>%
map(summary)
by creating
*List of 15
$ Question1 : Named int [1:4] 10 0 0 0
..- attr(*, "names")= chr [1:4] "Agree" "Neither agree nor disagree" "Disagree" "Don't know"*
And the less helpful
> DF_from_list<- data.frame(matrix(unlist(DF_as_list),
> nrow=length(DF_as_list), byrow=T))
creates:
*'data.frame': 15 obs. of 4 variables:
$ X1: int 10 10 ...
$ X2: int 0 0 ...
$ X3: int 0 0 ...
$ X4: int 0 0 ...*
Finally,
DF_as_tibble <- as_tibble(DF_as_list)
produces a helpful summary tibble
*Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 15 variables:
$ Question1 : int 10 0 0 0
$ Question2 : int 10 0 0 0*
and
DF_as_tibble %>%
map(summary)
produces useful summary statistics (min, median, mean, max, 1st and 3rd Qu) but not the percentage distribution of responses.