ID pounds Drug
1 1 46.4 B
2 2 40.4 A
3 3 27.6 B
4 4 93.2 B
5 5 28.8 A
6 6 36.0 A
7 7 81.2 B
8 8 14.4 B
9 9 64.0 A
10 10 29.6 A
My code is
test <-permtest(data1$pounds[Drug=='A'],data1$pounds[Drug=='B'])
But I get an error saying object 'Drug' not found.
Help!
We need to extract the column with $ or [[. Here it is searching for an object 'Drug' in the global env, which is not created there, but only within the environment of the 'data1'. So, either use $/[[
permtest(data1$pounds[data1$Drug=='A'],data1$pounds[data1$Drug=='B'])
Or use with
with(data1, permtest(pounds[Drug == 'A'], pounds[Drug == 'B']))
Related
I write this code to execute an ANOVA for a simple dataframe and I want to draw a boxplot out of it
DF <- read.table('chromium.txt',header=TRUE)
Chromium.aov <- aov(Concentration ~ Lab,data=DF)
print(summary(Chromium.aov))
with(DF,boxplot(Concentration,Lab))
here is the text file
Lab Concentration
1 26.1
1 21.5
1 22.0
1 22.6
1 24.9
1 22.6
1 23.8
1 23.2
2 18.3
2 19.7
2 18.0
2 17.4
2 22.6
2 11.6
2 11.0
2 15.7
3 19.1
3 13.9
3 15.7
3 18.6
3 19.1
3 16.8
3 25.5
3 19.7
4 30.7
However, R only show 2 box plots for lab 1 and 2, not 3 and 4, how can I fix this?
boxplot(DF$Concentration ~ DF$Lab)
The syntax you used is making one box with all the values of 'Concentration', and another with the values of 'Lab'
When you do with(DF,boxplot(Concentration,Lab)), you are providing two sets of values to be plotted - Concentration and lab. You want to split the Concentration based on the unique values Lab and then create the boxplot.
boxplot(split(DF$Concentration, DF$Lab))
I have a large data set consisting of factor variables, numeric variables, and a target column I'm trying to properly feed into xgboost with the objective of making an xgb.Matrix and training a model.
I'm confused about the proper processing to get my dataframe into an xgb.DMatrix object. Specifically, I have NAs in both factor and numeric variables and I want to make a sparse.model.matrix from my dataframe before creating the xgb.Matrix. The proper handling of the NAs is really messing me up.
I have the following sample dataframe df consisting of one binary categorical variable, two continuous variables, and a target. the categorical variable and one continuous variable has NAs
'data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 2 levels "0","1": 1 2 2 1 NA 2 1 1 NA 2
$ v2 : num 3.2 5.4 8.3 NA 7.1 8.2 9.4 NA 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 NA 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 NA 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
sparse.model.matrix from the matrix library won't accept NAs. It eliminates the rows (which I don't want). So I'll need to change the NAs to a numeric replacement like -999
if I use the simple command:
df[is.na(df)] = -999
it only replaces the NAs in the numeric columns:
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 -999.0 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 -999.0 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
So I first (think I) need to change the factor variables to numeric and then do
the substitution. Doing that I get:
v1 v2 v3 target
1 1 3.2 22.1 0
2 2 5.4 44.1 0
3 2 8.3 57.0 1
4 1 -999.0 64.2 1
5 -999 7.1 33.1 0
6 2 8.2 56.9 0
7 1 9.4 71.2 0
8 1 -999.0 33.9 1
9 -999 9.9 89.3 0
10 2 4.2 97.2 0
but converting the factor variable back to a factor (I think this is necessary
so xgboost will later know its a factor) I get three levels:
data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 3 levels "-999","1","2": 2 3 3 2 1 3 2 2 1 3
$ v2 : num 3.2 5.4 8.3 -999 7.1 8.2 9.4 -999 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
I'm ultimately not sure now that making the sparse.model.matrix and ultimately
the xgb.matrix object will be meaningful because v1 appears messed up.
To make matters more confusing, xgb.Dmatrix() has an argument missing
that I can use to identify numeric values (-999) that represent NAs. But this
can only be used for a dense matrix. If I submitted the dense matrix I'd
just have the NAs and wouldn't need that. However, in the sparse matrix
where I have -999s, I can't use it.
I hope I'm not overlooking something easy. Been through xgboost.pdf extensively and looked on Google.
Please help. Thanks in advance.
options(na.action='na.pass') as mentioned by #mtoto is the best way to deal with this problem. It will make sure that you don't loose any data while building model matrix.
Specifically XGBoost implementation; in case of NAs, check for higher gain when doing the splits while growing tree. So for example if splits without considering NAs is decided to be a variable var1's (range [0,1]) value 0.5 then it calculates the gain considering var1 NAs to be < 0.5 and > 0.5. To whatever split direction it gets more gain it attributes NAs to have that split direction. So NAs now have a range [0,0.5] or [0.5,1] but not actual value assigned to it (i.e. imputed). Refer (original author tqchen's comment on Aug 12, 2014).
If you are imputing -99xxx there then you are limiting the algorithm ability to learn NA's proper range (conditional on labels).
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
I have a dataset with columns ID, Score and Average age, using ddply(), I get the following table
ddply(data, .(id, score), summarize, group_mean = round(mean(avg_age), 1))
id score group_mean
1 101 0 61.8
2 101 5 70.3
3 101 10 62.2
4 2 0 41.0
5 2 5 40.4
6 2 10 44.5
7 23 0 52.0
8 23 5 52.6
9 25 0 74.5
10 25 5 55.2
11 25 10 48.0
12 28 0 53.4
13 28 5 49.5
14 3 0 41.3
15 3 5 47.8
16 3 10 46.6
17 4 0 53.3
18 4 5 54.2
19 4 10 55.3
20 X 0 72.0
21 X 5 57.1
22 X 10 53.4
What shoud I do if I want the table to look like a pivot table with the rows as id and columns as score? Namely,
0 5 10
101 61.8 70.3 62.2
2 41.0 40.4 44.5
...
We can use spread from tidyr. If out is the result obtained from summarize output from ddply
library(tidyr)
spread(out, score, group_mean)
Or acast from reshape2
library(reshape2)
acast(out, id~score, value.var='group_mean')
Or using base R
xtabs(group_mean~id+score, out)
This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 7 years ago.
I am looking to perform a row subtraction, where I have a group of individuals and I want to subtract the more recent row from the row above it (like a rolling row subtraction). Does anyone know a simple way to do this?
The data would look something like this:
Name Day variable.1
1 Bob 1 43.4
2 Bob 2 32.0
3 Bob 3 18.1
4 Bob 4 41.2
5 Bob 5 85.2
6 Jeff 1 17.4
7 Jeff 2 55.6
8 Jeff 3 58.7
9 Jeff 4 40.6
10 Jeff 5 77.3
11 Carl 1 52.9
12 Carl 2 71.7
13 Carl 3 84.3
14 Carl 4 54.8
15 Carl 5 69.7
For example, for Bob, I would like it to come out as:
Name Day variable.1
1 Bob 1 NA
2 Bob 2 -11.4
3 Bob 3 -13.9
4 Bob 4 23.1
5 Bob 5 44
And then it would go to the next name and perform the same task.
You could try
library(data.table)#v1.9.5+
setDT(df1)[,variable.1:=c(NA,diff(variable.1)) , Name]
Or using shift from the devel version of data.table (as suggested by #Jan Gorecki). Instructions to install are here
setDT(df1)[, variable.1 := variable.1- shift(variable.1), Name]
You can use the base ave() function. For example, if your data is in a data.frame named dd,
dd$newcol <-ave(dd$variable.1, dd$Name, FUN=function(x) c(NA,diff(x)))
You can also try:
library(dplyr)
df %>% group_by(Name) %>% mutate(diff = variable.1-lag(variable.1))
Source: local data frame [15 x 4]
Groups: Name
Name Day variable.1 diff
1 Bob 1 43.4 NA
2 Bob 2 32.0 -11.4
3 Bob 3 18.1 -13.9
4 Bob 4 41.2 23.1
5 Bob 5 85.2 44.0
6 Jeff 1 17.4 NA
7 Jeff 2 55.6 38.2
8 Jeff 3 58.7 3.1
9 Jeff 4 40.6 -18.1
10 Jeff 5 77.3 36.7
11 Carl 1 52.9 NA
12 Carl 2 71.7 18.8
13 Carl 3 84.3 12.6
14 Carl 4 54.8 -29.5
15 Carl 5 69.7 14.9
I'm using a data frame similar to this one:
df<-data.frame(student=c(rep(1,5),rep(2,5)), month=c(1:5,1:5),
quiz1p1=seq(20,20.9,0.1),quiz1p2=seq(30,30.9,0.1),
quiz2p1=seq(80,80.9,0.1),quiz2p2=seq(90,90.9,0.1))
print(df)
student month quiz1p1 quiz1p2 quiz2p1 quiz2p2
1 1 1 20.0 30.0 80.0 90.0
2 1 2 20.1 30.1 80.1 90.1
3 1 3 20.2 30.2 80.2 90.2
4 1 4 20.3 30.3 80.3 90.3
5 1 5 20.4 30.4 80.4 90.4
6 2 1 20.5 30.5 80.5 90.5
7 2 2 20.6 30.6 80.6 90.6
8 2 3 20.7 30.7 80.7 90.7
9 2 4 20.8 30.8 80.8 90.8
10 2 5 20.9 30.9 80.9 90.9
Describing grades received by students during five months – in two quizzes divided into two parts each.
I need to get the two quizzes into separate rows – so that each student in each month will have two rows, one for each quiz, and two columns – for each part of the quiz.
When I melt the table:
melt.data.frame(df, c("student", "month"))
I get the two parts of the quiz in separate lines too.
dcast(dfL,student+month~variable)
of course gets me right back where I started, and I can't find a way to cast the table back in to the required form.
Is there a way to make the melt command function something like:
melt.data.frame(df, measure.var1=c("quiz1p1","quiz2p1"),
measure.var2=c("quiz1p2","quiz2p2"))
Here's how you could do this with reshape(), from base R:
df2 <- reshape(df, direction="long",
idvar = 1:2, varying = list(c(3,5), c(4,6)),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
## Checking the output
rbind(head(df2, 3), tail(df2, 3))
# student month time p1 p2
# 1.1.quiz1 1 1 quiz1 20.0 30.0
# 1.2.quiz1 1 2 quiz1 20.1 30.1
# 1.3.quiz1 1 3 quiz1 20.2 30.2
# 2.3.quiz2 2 3 quiz2 80.7 90.7
# 2.4.quiz2 2 4 quiz2 80.8 90.8
# 2.5.quiz2 2 5 quiz2 80.9 90.9
You can also use column names (instead of column numbers) for idvar and varying. It's more verbose, but seems like better practice to me:
## The same operation as above, using just column *names*
df2 <- reshape(df, direction="long", idvar=c("student", "month"),
varying = list(c("quiz1p1", "quiz2p1"),
c("quiz1p2", "quiz2p2")),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
I think this does what you want:
#Break variable into two columns, one for the quiz and one for the part of the quiz
dfL <- transform(dfL, quiz = substr(variable, 1,5),
part = substr(variable, 6,7))
#Adjust your dcast call:
dcast(dfL, student + month + quiz ~ part)
#-----
student month quiz p1 p2
1 1 1 quiz1 20.0 30.0
2 1 1 quiz2 80.0 90.0
3 1 2 quiz1 20.1 30.1
...
18 2 4 quiz2 80.8 90.8
19 2 5 quiz1 20.9 30.9
20 2 5 quiz2 80.9 90.9
There was a very similar question asked about half a year ago, in which I wrote the following function:
melt.wide = function(data, id.vars, new.names) {
require(reshape2)
require(stringr)
data.melt = melt(data, id.vars=id.vars)
new.vars = data.frame(do.call(
rbind, str_extract_all(data.melt$variable, "[0-9]+")))
names(new.vars) = new.names
cbind(data.melt, new.vars)
}
You can use the function to "melt" your data as follows:
dfL <-melt.wide(df, id.vars=1:2, new.names=c("Quiz", "Part"))
head(dfL)
# student month variable value Quiz Part
# 1 1 1 quiz1p1 20.0 1 1
# 2 1 2 quiz1p1 20.1 1 1
# 3 1 3 quiz1p1 20.2 1 1
# 4 1 4 quiz1p1 20.3 1 1
# 5 1 5 quiz1p1 20.4 1 1
# 6 2 1 quiz1p1 20.5 1 1
tail(dfL)
# student month variable value Quiz Part
# 35 1 5 quiz2p2 90.4 2 2
# 36 2 1 quiz2p2 90.5 2 2
# 37 2 2 quiz2p2 90.6 2 2
# 38 2 3 quiz2p2 90.7 2 2
# 39 2 4 quiz2p2 90.8 2 2
# 40 2 5 quiz2p2 90.9 2 2
Once the data are in this form, you can much more easily use dcast() to get whatever form you desire. For example
head(dcast(dfL, student + month + Quiz ~ Part))
# student month Quiz 1 2
# 1 1 1 1 20.0 30.0
# 2 1 1 2 80.0 90.0
# 3 1 2 1 20.1 30.1
# 4 1 2 2 80.1 90.1
# 5 1 3 1 20.2 30.2
# 6 1 3 2 80.2 90.2