Creating dummy variables for a dataset [closed]

Creating dummy variables for a dataset [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am new to r and have a data set containing a column with 3 states (1,2,3). The problem is i dont know to split the data set with respective dummy variables as to create box plots and ultimately a linear model.
PLease help!! :'(

So I think you can specify which feature is categorical.
Say
data<- read.csv(filename)
data$feature <- factor(data$feature)
Where feature is the feature you want to convert to categorical data?
Is that what you are looking for?

If I get your problem, you have 2 columns, one with factor levels (1, 2, 3) in your example, and another response variable. Is there it? (An example with part of your data would be very helpful). In any case, if your data has this structure you don't need to split it. For a boxplot just run
boxplot(data$variable~data$factor)
You can use the same approach for a linear model:
lm(data$variable~data$factor)
If your data has other structure, you will need to explain it before someone can give further help...

Related

can anyone explain to me the tsSmooth function in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
can anyone explain to me the tsSmooth function in R?
I would like to use it to obtain a univariate time series with a linear trend

Please note that on your code:
x<-rt(n=30,df=3, ncp=10)
y<-rt(n=20,df=3,ncp=20)
myseries<-c(x,y)
tsSmooth<-c(x,y)
newseries<-tsSmooth
you didn't apply the tsSmooth() function to your data. You simply created a vector named tsSmooth and another vector named newseries
tsSmooth() function uses a specific data input and doesn't provide much explanation.
There is this discussion that might help https://stats.stackexchange.com/questions/125946/generate-a-time-series-comprising-seasonal-trend-and-remainder-components-in-r
In addition, you could generate a simple trend using moving average. But I am not sure if it has all the statistical features you are looking for.
library("TTR")
plot.ts(myseries)
trendSMA <- SMA(myseries)
plot.ts(trendSMA)

Use data.frame to analyse data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I need to use this data.frame to analyse data Investment using package lattice.
Requirement is to use data.frame. Instead of the variable Investment I need to put variables which have influence to the Investment. I need to draw different graphs. I tried to draw one, but it's not what I need, because my code doesn't use data.frame at all.
library(lattice)
xyplot(Investment~GNP,data=Investment)
is.data.frame(Investment)
How can I change my code according to requirments? Thank you in advance.

data(Investment, package="sandwich")
Investment <- as.data.frame(Investment)
Here you have converted your data into data frame (as.data.frame)
if you wnat to access different columns in data frame you can select by using operator $
example:
Investment$column_name
plotting can be done by using plot function in which also you can select variables by using $ operator like
plot(Investment$column1,Investment$column2)
as.data.frame converts other formats of data into data frame
data.frame itself intializes a data frame
For creating new data frame you can use data.frame()
Hope it helps

Statistical Programme R- using one category of continous data? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm using R, trying to use data I've been given to plot a boxplot. One of the categories is "sex", with (obviously) either "M" or "F". How do I exclude females, so I can use the male data to create a boxplot? I'm new to R, any help would be appreciated!
EDIT: Okay, I managed to make a vector(??) only including male data using newdata<-subset(olympic, sex=="M"). Now how do I do the same but with two subsets of a different category of continous data? Is it similar notation? (E.g. say the category is "hair" with different colours, and I want "blonde" and "ginger")

Use boxplot(, subset = )
I don't have data, so only direct you to subset argument.
I think as a general guideline, check documentation page of an R function first before asking on SO. R functions, especially those producing summary plots, are very powerful as they often have many arguments offering very flexible options. See `?boxplot' in this case.

try to boxplot on male_data
male_data <- data[data$sex == "M",]

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter

see varImp() in library caret, which supports all the ML algorithms you referenced.

R normalization with all samples, or just the part that i need? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am using the edgeR and Limma packages to analyse a RNA-seq count data table.
I only need a subset of the data file, therefore my question is: Do I need to normalize my data within all the samples, or is it better to subset my data first and normalize the data then.
Thank you.
Regards Lisanne

I think it depends on what you want to proof/show. If you also want to take into account your "darkcounts" than you should normalize it at first such that you also take into account the percentage in which your experiment fails. Here your total number of experiments ( good and bad results) sums up to one.
If you want to find out the distribution of your "good events" than you should first produce your subset of good samples and normalize afterwards. In this case your number of good events sums up to 1
So once again, it depends on what you want to proof. As a physicist I would prefer the first method since we do not remove bad data points.
Cheers TL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating dummy variables for a dataset [closed] - r

So I think you can specify which feature is categorical. Say data<- read.csv(filename) data$feature <- factor(data$feature) Where feature is the feature you want to convert to categorical data? Is that what you are looking for?

Related

can anyone explain to me the tsSmooth function in R? [closed]

Use data.frame to analyse data [closed]

Statistical Programme R- using one category of continous data? [closed]

Machine learning - Calculating the importance of a "value" in a variable [closed]

R normalization with all samples, or just the part that i need? [closed]

Categories

Resources