How to create a new variable using the Mutate function? - r

I need to add to my data a new variable, but I would like to do it using the mutate function. How can I do it? ISLR library
Create a new variable called "HighVol" that has the classes "yes" and "no"
to indicate whether the location sold 10,000 units or more in the past year.
How many stores produced a high volume?
Example below.
carseats.df$HighVol <- factor(carseats.df$HighVol,
levels = c(0,1),
labels = c("No", "Yes"))

You are going to include the entire data frame if you use mutate. You'll want to whole data frame if the assignment of yes or no is conditionally based on sales.
library(tidyverse)
# create carseats.df
set.seed(39582) # make it repeatable
carseats.df <- data.frame(sales = rnorm(100, 10000, 505))
# now create conditional variable
carseats.df <- carseats.df %>%
mutate(HighVol = ifelse(sales > 10000, # true or false
"yes", # result if true
"no") %>%
as.factor()) # result if false
head(carseats.df)
# sales HighVol
# 1 9992.190 yes
# 2 10077.482 no
# 3 9507.145 yes
# 4 10780.788 no
# 5 10433.133 no
# 6 10907.665 no
It looks like you're fairly new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. This includes sample data like the output from dput(head(dataObject))) and any libraries you are using. Check it out: making R reproducible questions.
The reason you haven't seen any help is most likely due to the lag of meaningful tags. You only have the tag tree which isn't meaningful. At a minimum, you would want to include a tag for the programming language: r. You could also add things like mutate or the library it's derived from, dplyr.

Related

Extracting from the data frame produced using GageRR/GageRRDesign in R

How do I extract the 'VarCompContrib" column in the data frame produced using the gageRR function in R?
This is for a GageRR analysis of a measurement system. I'm trying to make a very user friendly program where other people can just enter the information required, like number of operators, parts, and measurements, as well as the measurements themselves, and output the correct analysis. I'm gonna use an if-statement later on to do the "analysis" portion, but I am having trouble actually managing the data frame produced with gageRR.
library(MASS)
library(Rsolnp)
library(qualityTools)
design = gageRRDesign(Operators=3, Parts=10, Measurements=2, randomize=FALSE)
response(design) = c(23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,
23,24,23,24,24,22,22,22,24,23,22,24,20,20,25,24,22,24,21,20,21,22,21,22,21,
21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23)
gdo=gageRR(design)
plot(gdo)
I am looking to get a 7 number column vector under VarCompContrib
For starters, you can look at the structure of gdo with str(gdo). From there, we see that Varcomp is a slot, so we can access it with gdo#Varcomp and just convert it to a data.frame:
library(qualityTools)
design <- gageRRDesign(Operators = 3, Parts = 10, Measurements = 2, randomize = FALSE)
response(design) <- c(
23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,23,24,23,24,24,22,22,22,24,23,22,24,
20,20,25,24,22,24,21,20,21,22,21,22,21,21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23
)
gdo <- gageRR(design)
data.frame(gdo#Varcomp)
# totalRR repeatability reproducibility a a_b bTob totalVar
# 1 1.66441 1.209028 0.4553819 0.4553819 0 1.781211 3.445621

Custom Descriptive Statistics Table Export

I have spent a few weeks looking for solutions to my problem of not only finding but creating a descriptive statistics summary table on my data that is exportable to xlsx (ideally). I have only found partial answers and my knowledge in R and R packages is still basic enough to limit my progress. My data set is time series data with 6 columns that have 50,000+ rows.
My DF information:
DateTime:POSIXCT format "YYYY-MM-DD HH:MM:SS"
Var1: num
Var2: num
Var3: num
Var4: factor w/ 2 levels "A","B"
Var5: factor w/ 4 levels "S1","S2","S3","S4"
My objectives are as follows:
Manipulate my data frame using tidyverse to subset my data
Take the subsetted data to create 1 summary table( i.e. in a tibble or data.frame format) with 2 sub factors (Var4 and/or Var5) for Var1, Var2, and Var3. Below is a simplified, visual example of the table I am aiming for:
Export the summary table (or summary tables if one table is not possible) to xlxs (ideally), .CSV, or .TXT to be used in Excel for stylistic table edits. At the moment, "writexl" package works very well for me as I have problems with "xlsx" and "openxlsx" packages. Here is the code needed to export to xlsx using the openxlsx package: write_xlsx(dataframe, path = "C:/Users/user/Desktop"). Note for MacOS users, path = /Users/admin/yoursubfolder/yoursubfolder.... (fill in "yoursubfolder" with actual folder name on your computer)
What I have done:
Used dplyr and the %>% function to manipulate the data without and with factor Var4 or Var5
Tried to create a summary table with Var4 as a factor for Var1,Var2, and Var3 (partial success; style is not what I want or it is not exportable to excel)
Looked in multiple StackOverflow questions and Google searches with no success to find code that works for my particular case. I've tried qwraps2 to create one and looking into the following packages for something pre-made: psych, stargazer, and HMSIC. I do not like their table styles and they do not all have the option to just show N, mean, StDev, SEM, Min, and Max.
I know SEM is not a standard function in most packages; thus, I borrowed this function from an answer on stack overflow because I do not know how to create functions. here is the borrowed code: SEM <- function(x) sd(x)/sqrt(length(x))
Since I cannot attach sample data and my coding is very basic, here is what I could come up with:
Example data:
Unfortunately, I cannot attach sample data for testing. Also due to my limited knowledge of R, I cannot make a perfect data frame. Below is a sample data frame, but I cannot get the factor to be evenly distributed in their respective columns (Sorry). Here is my code:
df <- data.frame(
"DateTime" = seq(c(ISOdate(2018,03,01)), by = "day", length.out = 100),
"Var1" = rnorm(1:100),
"Var2" = rnorm(1:100),
"Var3" = rnorm(1:100),
"Var4" = c("A", "B"),
"Var5" = c("S1","S2", "S3", "S4"))
I was trying this:
"S1"[(1:25)],
"S2"[(26:50)],
"S3"[(51:75)],
"S4"[(76:100)] # and
"A"[(1:50], "B"[(51:100)] #but that didn't work, so sorry again.
Despite my lack of proper coding, any guidance, tips, and suggestions from anyone with more experience in R would be greatly appreciated as I do like R and all the capabilities of the software, but I find it very inconvenient that there is no simple, straightforward way to export tables in the console to copy and paste into useful forms like Excel spreadsheets or Word documents instead of standard exporting in LaTex format (which I do not understand at all btw). I know this topic has been discussed in different forums and others share my sentiment on how terrible it is especially for people who need it for data processing instead of document creations like Rmarkdown.
Some example with your df:
library(dplyr)
library(tidyr)
SEM_function <- function(x){sd(x)/sqrt(length(x))}
df %>% as_tibble() %>%
gather("Var_num", "value",Var1:Var3) %>%
group_by(Var_num, Var4,Var5) %>%
summarise("N" = n(),
"mean" = mean(value),
"StDev" = sd(value),
"SEM" = SEM_function(value) ,
"min" = min(value),
"max" = max(value))
Hope this helps

Using value-labels in R with sjlabelled package

Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R

Visualizing hierarchical data with circle packing in ggplot2?

I have some hierarchical data, e.g.,
> library(dplyr)
> df <- data_frame(id = 1:6, parent_id = c(NA, 1, 1, 2, 2, 5))
> df
Source: local data frame [6 x 2]
id parent_id
(int) (dbl)
1 1 NA
2 2 1
3 3 1
4 4 2
5 5 2
6 6 5
I would like to plot the tree in a "top down" view through a circle packing plot:
http://bl.ocks.org/mbostock/4063530
The above link is for a d3 library. Is there an equivalent that allows me to make such a plot in ggplot2?
(I want this plot in a shiny app, which does support d3, but I haven't used d3 before and am unsure about the learning curve. If d3 is the obvious choice, I will try to get that working instead. Thanks.)
There were two steps: (1) aggregate the data, then (2) convert to json. After that, all the javascript has been written in that example page, so you can just plug in the resulting json data.
Since the aggregated data should have a similar structure to a treemap, we can use the treemap package to do the aggregation (could also use a loop with successive aggregation). Then, d3treeR (from github) is used to convert the treemap data to a nested list, and jsonlite to convert the list to json.
I'm using some example data GNI2010, found in the d3treeR package. You can see all of the source files on plunker.
library(treemap)
library(d3treeR) # devtools::install_github("timelyportfolio/d3treeR")
library(data.tree)
library(jsonlite)
## Get treemap data using package treemap
## Using example data GNI2010 from d3treeR package
data(GNI2010)
## aggregate by these: continent, iso3,
## size by population, and color by GNI
indexList <- c('continent', 'iso3')
treedat <- treemap(GNI2010, index=indexList, vSize='population', vColor='GNI',
type="value", fun.aggregate = "sum",
palette = 'RdYlBu')
treedat <- treedat$tm # pull out the data
## Use d3treeR to convert to nested list structure
## Call the root node 'flare' so we can just plug it into the example
res <- d3treeR:::convert_treemap(treedat, rootname="flare")
## Convert to JSON using jsonlite::toJSON
json <- toJSON(res, auto_unbox = TRUE)
## Save the json to a directory with the example index.html
writeLines(json, "d3circle/flare.json")
I also replaced the source line in the example index.html to
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
Then fire up the index.html and you should see
To create the shiny bindings should be doable using htmlwidgets and following some examples (the d3treeR source has some). Note that certain things aren't working, like the coloring. The json that gets stored here actually contains a lot of information about the nodes (all the data aggregated using the treemap) that you could leverage in the figure.

Data cleaning in Excel sheets using R

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

Resources