How do I dichotomise efficiently
How to one hot encode several categorical variables in R
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
calculate the mean for each column of a matrix in R
I have a data frame and I want to calculate the mean of all columns and save it into a new dataframe. I found this solution calculate the mean for each column of a matrix in R however, this is only for matrix and not dataframe
structure(list(TotFlArea = c(1232, 596, 708, 1052, 716), logg_weighted_assess = c(13.7765298160156,
13.1822275291412, 13.328376420438, 13.3076293132057, 13.5164823091252
), TypeDwel1.2.Duplex = c(0, 0, 0, 0, 0), TypeDwelApartment.Condo = c(0,
1, 1, 1, 1), TypeDwelTownhouse = c(1, 0, 0, 0, 0), Age_new.70 = c(0,
0, 0, 0, 0), Age_new0.1 = c(0, 0, 0, 0, 0), Age_new16.40 = c(1,
1, 0, 1, 0), Age_new2.5 = c(0, 0, 0, 0, 0), Age_new41.70 = c(0,
0, 0, 0, 0), Age_new6.15 = c(0, 0, 1, 0, 1), LandFreehold = c(1,
1, 1, 0, 1), LandLeasehold.prepaid = c(0, 0, 0, 1, 0), LandOthers = c(0,
0, 0, 0, 0), cluster_K_mean.1 = c(0, 0, 0, 0, 0)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
Can you please advise how I can do this?
Note: my data frame can have NA values which should be excluded from mean calculation
As #akrun pointed out. Also another alternative
apply(df, 2, mean)
where 2 means by column and 1 is by row.
However, besides its flexibility (e.g. changing from mean to mode or applying to selected columns only apply(df[,c('a', 'b')], 2, mean)) below shows the disadvantage to using apply (in terms of speed)
# dummy data
x <- 1e7
df <- data.table(a = 1:x )
y <- letters[2:10]
df[, (y) := lapply(2:10, \(i) a+i)]
# benchmark
z <-
microbenchmark(colMeans = {colMeans(df)}
, apply = {apply(df, 2, mean)}
, times = 30
I'm trying to use ggplot, and am hoping to create a boxplot that has four categories on the x axis for suspension data (low, lowish, highish, high) and farms on the y-axis.
I have I think broken the suspension column into four groups. But ggplot is upset with me. Here is the error:
Error in if (is.double(data$x) && !has_groups(data) && any(data$x != data$x[1L])) { : missing value where TRUE/FALSE needed
Here is my code:
# To break suspension_rate_total_pct data into groups for clearer visualization, I found the min, and max
merged_data$suspension_rate_total_pct <-
max(merged_data$suspension_rate_total_pct, na.rm=TRUE)
min(merged_data$suspension_rate_total_pct, na.rm=TRUE)
low_suspension <- merged_data$suspension_rate_total_pct > 0 & merged_data$suspension_rate_total_pct < 0.5
low_ish_suspension <- merged_data$suspension_rate_total_pct > 0.5 & merged_data$suspension_rate_total_pct < 1
high_ish_suspension <- merged_data$suspension_rate_total_pct > 1 & merged_data$suspension_rate_total_pct < 1.5
high_suspension <- merged_data$suspension_rate_total_pct > 1.5 & merged_data$suspension_rate_total_pct < 2
ggplot(merged_data, aes(x = suspension_rate_total_pct , y = farms_pct)) +
Here is the Data:
merged_data <- structure(list(schid = c("1030642", "1030766", "1030774", "1030840",
"1130103", "1230150"), enrollment = c(159, 333, 352, 430, 102,
193), farms = c(132, 116, 348, 406, 68, 130), foster = c(2, 0,
1, 8, 1, 4), homeless = c(14, 0, 8, 4, 1, 4), migrant = c(0,
0, 0, 0, 0, 0), ell = c(18, 12, 114, 45, 7, 4), suspension_rate_total = c(NA,
20, 0, 0, 95, 5), suspension_violent = c(NA, 9, 0, 0, 20, 2),
suspension_violent_no_injury = c(NA, 6, 0, 0, 47, 1), suspension_weapon = c(NA,
0, 0, 0, 8, 0), suspension_drug = c(NA, 0, 0, 0, 9, 1), suspension_defiance = c(NA,
1, 0, 0, 9, 1), suspension_other = c(NA, 4, 0, 0, 2, 0),
farms_pct = c(0.830188679245283, 0.348348348348348, 0.988636363636364,
0.944186046511628, 0.666666666666667, 0.673575129533679),
foster_pct = c(0.0125786163522013, 0, 0.00284090909090909,
0.0186046511627907, 0.00980392156862745, 0.0207253886010363
), migrant_pct = c(0, 0, 0, 0, 0, 0), ell_pct = c(0.113207547169811,
0.036036036036036, 0.323863636363636, 0.104651162790698,
0.0686274509803922, 0.0207253886010363), homeless_pct = c(0.0880503144654088,
0, 0.0227272727272727, 0.00930232558139535, 0.00980392156862745,
0.0207253886010363), suspension_rate_total_pct = c(NA, 2,
1, 1, 2, 2)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
If you can, please help me appease ggplot so that it will give me with beautiful visualization. Currently, this feels like a one-sided, emotional rollercoaster of a relationship.
Just a short answer, i am sure you can figure out the rest by yourself, (otherwise post a followup question.)
Since the data you provided has some NA's in the first row in several columns, i can only demonstrate you the principle on how to get your desired result by using the merged_data$homless value as group-input for our boxplots , the data (y-value) will be still Farms .
# first we create our groups of low, middle & high amount of homeless
merged_data2<- merged_data %>% mutate(homelessgroup= ifelse(homeless < 4, "low",
ifelse(homeless <= 8, "middle",
ifelse(homeless > 8, "high",NA ))))
## then we plot the data using ggplot
I think you can just use cut() with your data to partition into 4 groups. Then you can use that variable with the plot
merged_data <- transform(merged_data,
group = cut(
c(0, .5, 1, 1.5, 2),
include.lowest = TRUE,
labels = c("low", "lowish", "highish", "high")))
ggplot(merged_data, aes(x = group , y = farms_pct)) +
For an assignment, I am applying mixture modeling with the mixtools package on R. When I try to figure out the optimal amount of components with bootstrap. I get the following error
Error in boot.comp(y, x, N = NULL, max.comp = 2, B = 5, sig = 0.05, arbmean = TRUE, :
Number of trials must be specified!
I found out that I have to fill an N: An n-vector of number of trials for the logistic regression type logisregmix. If
NULL, then N is an n-vector of 1s for binary logistic regression.
But, I don't know how to find out what the N is in fact to make my bootstrap working.
Link to my codes:
My codes:
data <- read.csv("Desktop/WA_Fn-UseC_-Telco-Customer-Churn.csv", stringsAsFactors = FALSE,
na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))
data <- droplevels(na.omit(data))
data <- data[c(1:5032),]
testdf <- data[c(5033:7032),]
data <- subset(data, select = -customerID)
data$Churn <- revalue(data$Churn, c("Yes"=1, "No"=0))
y <- as.numeric(data$Churn)
x <- model.matrix(Churn ~ . , data = data)
x <- x[, -1] #remove intercept
x <-x[,-c(7, 11, 13, 15, 17, 19, 21)] #multicollinearity
a <- boot.comp(y, x, N = NULL, max.comp = 2, B = 100,
sig = 0.05, arbmean = TRUE, arbvar = TRUE,
mix.type = "logisregmix", hist = TRUE)
Below there is more information about my predictors:
structure(c(0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
34, 2, 45, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 1, 0, 29.85, 56.95, 53.85, 42.3, 29.85, 1889.5, 108.15,
1840.75), .Dim = c(4L, 23L), .Dimnames = list(c("1", "2", "3",
"4"), c("genderMale", "SeniorCitizen", "PartnerYes", "DependentsYes",
"tenure", "PhoneServiceYes", "MultipleLinesYes", "InternetServiceFiber optic",
"InternetServiceNo", "OnlineSecurityYes", "OnlineBackupYes",
"DeviceProtectionYes", "TechSupportYes", "StreamingTVYes", "StreamingMoviesYes",
"ContractOne year", "ContractTwo year", "PaperlessBillingYes",
"PaymentMethodCredit card (automatic)", "PaymentMethodElectronic check",
"PaymentMethodMailed check", "MonthlyCharges", "TotalCharges"
My response variable is binary
I hope you guys can help me out!
Looking in the source code of mixtools::boot.comp, which is scary as it is over 800 lines long and in serious need of refactoring, the offending lines are:
if (mix.type == "logisregmix") {
if (is.null(N))
stop("Number of trials must be specified!")
Despite what the documentation says, N must be specified.
Try to set it to a vector of 1s: N = rep(1, length(y)) or N = rep(1, nrow(x))
In fact, if you look in mixtools::logisregmixEM, the internal function called by boot.comp, you'll see how N is set if NULL:
n <- length(y)
if (is.null(N)) {
N = rep(1, n)
Too bad this is never reached if N is NULL since it stops with an error before. This is a bug.
I would like to convert wide data to long data in R, and my data set is for cross-classified models, exploring participants’ response to each target item that has different characteristics.
condition is one of the two conditions where participants were
assigned to.
The participants were tested twice: t1 and t2.
As for item-level predictor variables, x1 and x2, are coded.
As for response, whether participants’ response to the item was right or wrong was coded.
two test formats were administered, test1 and test2.
Although there are so many tutorials for a wide to long conversion, I could not find a one specifically explaining conversion for cross-classified models.
I would like to use tidyverse if possible for the sake of consistency.
My sample data is the following:
structure(list(item_name = c("x1", "x2", "participant_id", "1",
"2", "3", "4", "5", "6", "7"), participant_variable_1 = c(NA,
NA, NA, 20, 23, 21, 20, 19, 22, 30), condition = c(NA, NA, NA,
"A", "B", "A", "B", "A", "B", "A"), t1.item1.test1 = c(1, 3,
NA, 0, 1, 0, 1, 0, 0, 1), t1.item2.test1 = c(2, 2, NA, 0, 0,
0, 1, 1, 0, 1), t1.item3.test1 = c(1, 3, NA, 0, 0, 0, 1, 0, 0,
0), t1.item4.test1 = c(3, 1, NA, 1, 0, 0, 0, 1, 1, 0), t2.item1.test1 = c(1,
3, NA, 0, 1, 1, 0, 1, 1, 1), t2.item2.test1 = c(2, 2, NA, 1,
0, 1, 0, 1, 0, 1), t2.item3.test1 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t2.item4.test1 = c(3, 1, NA, 1, 1, 0, 1, 1, 1, 0), t1.item1.test2 = c(1,
3, NA, 0, 1, 0, 1, 0, 0, 1), t1.item2.test2 = c(2, 2, NA, 0,
0, 0, 1, 1, 0, 1), t1.item3.test2 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t1.item4.test2 = c(3, 1, NA, 1, 0, 0, 0, 1, 1, 0), t2.item1.test2 = c(1,
3, NA, 0, 1, 1, 0, 1, 1, 1), t2.item2.test2 = c(2, 2, NA, 1,
0, 1, 0, 1, 0, 1), t2.item3.test2 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t2.item4.test2 = c(3, 1, NA, 1, 1, 0, 1, 1, 1, 0)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I would like to have a long data, which looks like the following:
Please and thank you for your guidance!
This answer requires heavy use of the new pivot_ functions in the dev version of tidyr. You can install that with devtools::install_github("tidyverse/tidyr") if you're willing to run the dev version.
First we split the data into item and participant info - you're not really getting any benefit from storing both in the same table:
item_info = dat[1:2, ]
participant_info = dat[4:nrow(dat), ] %>%
rename(participant_id = item_name)
Then it's time for a lot of pivoting:
# I have the dev version of tidyr so that is being loaded
item_long = item_info %>%
select(-participant_variable_1, -condition) %>%
cols = t1.item1:t2.item4,
names_to = c("time", "item"),
names_pattern = "t(\\d)\\.(item\\d)",
) %>%
pivot_wider(names_from = item_name, values_from = value)
participant_long = participant_info %>%
cols = t1.item1:t2.item4,
names_to = c("time", "item"),
names_pattern = "t(\\d)\\.(item\\d)",
values_to = "response"
combined = participant_long %>%
left_join(item_long, by = c("item", "time"))
> combined
# A tibble: 56 x 8
participant_id participant_variable_1 condition time item response x1 x2
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 20 A 1 item1 0 1 3
2 1 20 A 1 item2 0 2 2
3 1 20 A 1 item3 0 1 3
4 1 20 A 1 item4 1 3 1