df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.
Related
If I have a dataset that looks like the following, looking at species richness of spiders in different habitats of a garden.
'data.frame': 6 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6
$ species_count: num 10 13 15 17 22 9
$ habitat_type : Factor w/ 2 levels "wall","tree": 1 2 1 2 1 2
$ wall_height : num 153 NA 160 NA 170 NA
$ tree_diameter: num NA 48 NA 52 NA 71
I want to create a lm with species_count as the dependent variable and habitat_type, wall_height and tree_diameter as the independent variables, however the NA's are tricky.
lm.1 <- lm(species_count ~ habitat_type + wall_height + tree_diameter,
data = DF, na.action = na.exclude)
throws up the following error:
Error in contrasts<-(tmp, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
as na.exclude and na.omit delete the entire rows.
Using:
DF$wall_height <- na.exclude(DF$wall_height)
and
DF$tree_diameter <- na.exclude(DF$tree_diameter)
just repeats the values, giving tree_diameter values to wall and vice versa, like so:
DF[1,]
ID species_count habitat_type wall_height tree_diameter
1 1 10 wall 153 48
Is there a way to omit NA values only whilst retaining the rest of the information within the row, or will I have to use separate linear models?
Thanks in advance for any help and hope that I've been clear enough in explaining the issue.
The fundamental problem is that
wall_height doesn't apply to the tree obs and vice versa.
So there is nothing to be gained by trying to analyze the data from wall and tree habitats together. In principle, you can compare the two habitats, and then evaluate how habitat-specific characteristics are associated with species numbers within a habitat.
In practice, you face a problem of very few observations. Usually you want about 10 cases per predictor that you are using in your model. You might be able to do an adequate comparison of the 2 habitats, but any results within a habitat, with only 3 observations each, will be highly suspect.
A couple of other thoughts. First, count data are often better analyzed with a different type of model, a Poisson generalized linear model. Second, the numbers of species are presumably represented by different numbers of individuals of each. There is probably some information to be gleaned from that, which should be explained in the ecology literature on species diversity.
I have a dataframe that looks like this:
Sensor NewValue NewDate
1 iphone/NuhKZFrx/noise 1.00000 2015-10-20 23:26:14
2 iphone/NuhKZFrx/noiseS 58.63411 2015-10-20 23:26:14
3 iphone/wlhAlrPQ/noise 0.00000 2015-10-21 08:03:28
4 iphone/wlhAlrPQ/noiseS 65.26167 2015-10-21 08:03:28
[...]
with the following datatypes:
'data.frame': 405 obs. of 3 variables:
$ Sensor : Factor w/ 28 levels "iphone/5mZU0HWz/noise",..: 11 12 23 24 9 10 23 24 21 22 ...
$ NewValue: num 1 58.6 0 65.3 3 ...
$ NewDate : POSIXct, format: "2015-10-20 23:26:13" "2015-10-20 23:26:14" "2015-10-21 08:03:28" "2015-10-21 08:03:28" .
The Sensor field is set up like this: <model>/<uniqueID>/<type>. And I want to find out if there is a correlation between noise and noiseS for each uniqueID at a given time.
For a single uniqueID it works fine since there are only two factors. I tried to use xtabs(NewValue~NewDate+Sensor, data=dataNoises) but that gives me zeros since there aren't values for every ID at any time ...
What could I do to somehow compose the factors so that I only have on factor for noise and one for noiseS? Or is there an easier way to solve this problem?
What I want to do is the following:
Date noise noiseS
2015-10-20 23:26:14 1 58.63
2015-10-20 23:29:10 4 78.33
And then compute the pearson correlation coefficient between noise and noiseS.
If I understand your question correctly, you just want a 2-level factor that distinguishes between noise and noiseS?
That can be easily achieved by defining a new column in the dataframe and populating it with the output of grepl(). A MWE:
a <- "blahblahblahblahnoise"
aa <- "blahblahblahblahnoiseS"
b <- "noiseS"
type <- vector()
type[1] <- grepl(b, a)
type[2] <- grepl(b, aa)
type <- as.factor(type)
This two-level factor would let you build a simple model of the means for noise (type[i]==FALSE) and noiseS (type[i]==TRUE), but would not let you evaluate the CORRELATION between the types for a given UniqueID and time. One way to do this would be to create separate columns for data with type==FALSE and type==TRUE, where rows correspond to a specific UniqueID+time combination. In this case, you would need to think carefully about what you want to learn and when you assume data to be independent. For example, if you want to learn whether noise and noiseS are correlated across time for a given uniqueID, then you would need to make a separate factor for uniqueID and include it in your model as an effect (possibly a random effect, depending on your purposes and your data).
I have a data frame where each column is of type factor and has over 3000levels.
Is there a way where I can replace each level with a numeric value.
Consider the inbuilt data frame InsectSprays
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
The replacement should be as follows:
A=1,B=2,C=3,D=4,E=5,F=6.
If there are 3000 levels:
"USA"=1,"UK"=2....,France="3000".
The solution should automatically detect the levels(Ex: 3000),then replace each level starting from 1 to 3000.
For the InsectSprays example, you can use:
levels(InsectSprays$spray) <- 1:6
Should generalize to your problem.
Factor variables already have underlying numeric values corresponding to each factor level. You can see this as follows:
as.numeric(InsectSprays$spray)
or
x = factor(c("A","D","B","G"))
as.numeric(x)
If you want to add specific numeric values corresponding to each level, you can, for example, merge in those values from a lookup table:
# Create a lookup table with the numeric values you want to correspond to each level of spray
lookup = data.frame(spray=levels(InsectSprays$spray), sprayNumeric=c(5,4,1,2,3,6))
# Merge lookup values into your data frame
InsectSprays = merge(InsectSprays, lookup, by="spray")
Based on this tutorial (https://statisticsglobe.com/how-to-convert-a-factor-to-numeric-in-r/), I have used the following code to convert factor levels into specific numbers:
levels(InsectSprays$spray) # to check the order levels are stored
levels(InsectSprays$spray) <- c(0, 1, 2, 3, 4, 5) # assign the number I want to each level
InsectSprays$spray <- as.numeric(as.character(InsectSprays$spray)) # to change from factor to numeric
I am trying to use predict to apply my model to data from one time period to see what might be the values for another time period. I did this
successfully for one dataset, and then tried on another with identical code
and got the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
The only difference between the two datasets was that my predictor model for the first dataset had two predictor variables and my model for the second dataset had only one. Why would this make a difference?
My dougfir.csv contains just two columns with thirty numbers in each,
labeled height and dryshoot.
my linear model is:
fitdougfir <- lm(dryshoot~height,data=dougfir)
It gets a little complicated (and messy, sorry! I am new to R) because I
then made a second .csv - the one I used to make my model contained values
from just June. My new .csv (called alldatadougfir.csv) includes values
from October as well, and also contains a date column that labels the
values either "june" or "october".
I did the following to separate the height data by date:
alldatadougfir[alldatadougfir$date=="june",c("height")]->junedatadougfir
alldatadougfir[alldatadougfir$date=="october",c("height")]->octoberdatadougfir
I then want to use my June model to predict my October dryshoots using
height as my variable and I did the following:
predict(fitdougfir, newdata=junedatadougfir)
predict(fitdougfir, newdata=octoberdatadougfir)
Again, I did this with an identical dataset successfully - the only
difference was that my model in the successful dataset had two predictor
variables instead of the one variable (height) I have in this dataset.
This is essentially a variation on R: numeric 'envir' arg not of length one in predict() , but it might not be obvious why. What's happening is that by selecting a single column of your data frame, you are triggering R's (often annoying/unwanted) default behaviour of collapsing the data frame to a numeric vector. This triggers issue #2 from the linked answer:
The predictor variable needs to be passed in as a named column in a data frame, so that predict() knows what the numbers [it's] been handed represent ... [emphasis added]
Watch this:
dd <- data.frame(x=1:20,y=1:20)
str(dd[dd$x<10,"y"]) ## select some rows and a single column
## int [1:9] 1 2 3 4 5 6 7 8 9
You could specify drop=FALSE, which gives you a data frame with a single column rather than just the column itself:
str(dd[dd$x<10,"y",drop=FALSE])
## 'data.frame': 9 obs. of 1 variable:
## $ y: int 1 2 3 4 5 6 7 8 9
Alternately, you don't have to leave out the predictor variable when selecting new data -- R will just ignore it.
str(dd[dd$x<10,])
## 'data.frame': 9 obs. of 2 variables:
## $ x: int 1 2 3 4 5 6 7 8 9
## $ y: int 1 2 3 4 5 6 7 8 9
I have a data file that represents a contingency table that I need to work with. The problem is I can't figure out how to load it properly.
Data structure:
Rows: individual churches
1st Column: Name of the church
2nd - 12th column: Mean age of followers
Every cell: Number of people who follows corresponding church and are correspondingly old.
//In the original data set only the age range was available (e.g. between 60-69) so to enable computation with it I decided to replace it with mean age (e. g. 64.5 instead of 60-69)
Data sample:
name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000
...
I tried to simply load the data and make them a 'table' so I could expand it but it didn't work (only produced something really weird).
dataset <- read.table("C:/.../dataset.csv", sep=";", quote="\"")
dataset_table <- as.table(as.matrix(dataset))
When I tried use the data as they were to produce a simple graph it didn't work either.
barplot(dataset[2,2:4])
Error in barplot.default(dataset[2,2:4]) : 'height' must be a vector or a matrix
Classing dataset[2,2:4] showed me that it is a 'list' which I don't understand (I guess it is because dataset is data.frame and not table).
If someone could point me into the right direction how to properly load the data as a table and then work with them, I'd be forever grateful :).
If your file is already a contingency table, don't use as.table().
df <- read.table(header=T,sep=";",text="name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000")
colnames(df)[-1] <- substring(colnames(df)[-1],2)
barplot(as.matrix(df[2,2:4]), col="lightblue")
The transformation of colnames(...) is because R doesn't like column names that start with a number, so it prepends X. This codes just gets rid of that.
EDIT (Response to OP's comment)
If you want to convert the df defined above to a table suitable for use with expand.table(...) you have to set dimnames(...) and names(dimnames(...)) as described in the documentation for expand.table(...).
tab <- as.matrix(df[-1])
dimnames(tab) <- list(df$name,colnames(df)[-1])
names(dimnames(tab)) <- c("name","age")
library(epitools)
x.tab <- expand.table(tab)
str(x.tab)
# 'data.frame': 80000 obs. of 2 variables:
# $ name: Factor w/ 2 levels "catholic","hinduism": 1 1 1 1 1 1 1 1 1 1 ...
# $ age : Factor w/ 3 levels "7","15","25": 1 1 1 1 1 1 1 1 1 1 ...