How to read factors' levels right in R? - r

I have a big csv file that has 51993 rows and 18 columns. Here is part of the table:
head(ddd)
country.of.birth age sex X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
Afghanistan 0 men 0 0 1 2 2 0 1 1
Afghanistan 0 women 1 1 0 0 1 0 0 0
Afghanistan 1 men 0 2 5 2 3 4 1 1
Afghanistan 1 women 4 1 4 2 3 2 3 2
Afghanistan 2 men 5 0 8 7 7 3 5 3
Afghanistan 2 women 4 8 3 9 4 4 4 3
In the main csv file, the columns are: Country of Birth, age, sex, and then years from 2000 to 2014. My questions are why does R put X before each year number?
When I used the str() function, I got:
> str(ddd)
'data.frame': 15264 obs. of 18 variables:
$ country.of.birth: Factor w/ 261 levels "0","1","10","103",..: 51 51 51 51 51 51 51 51 51 51 ...
$ age : Factor w/ 38 levels "","0 ","1 ","10 ",..: 2 2 3 3 14 14 17 17 20 20 ...
$ sex : Factor w/ 39 levels "","0 ","1 ","10 ",..: 38 39 38 39 38 39 38 39 38 39 ...
$ X2000 : Factor w/ 786 levels "","0","1","10",..: 2 3 2 478 555 478 92 4 205 716 ...
$ X2001 : int 0 1 2 1 0 8 11 8 26 19 ...
$ X2002 : int 1 0 5 4 8 3 13 18 22 15 ...
$ X2003 : int 2 0 2 2 7 9 15 13 23 33 ...
$ X2004 : int 2 1 3 3 7 4 11 15 21 22 ...
$ X2005 : int 0 0 4 2 3 4 10 6 13 16 ...
$ X2006 : int 1 0 1 3 5 4 8 13 20 10 ...
$ X2007 : int 1 0 1 2 3 3 6 7 9 17 ...
$ X2008 : int 0 0 2 0 4 5 4 6 8 9 ...
$ X2009 : int 0 1 1 4 7 3 9 10 11 12 ...
$ X2010 : int 1 1 6 4 8 10 17 10 21 16 ...
$ X2011 : int 0 5 9 6 21 18 16 27 34 24 ...
$ X2012 : int 3 5 5 16 30 22 44 48 46 49 ...
$ X2013 : int 3 0 12 19 24 34 54 46 76 71 ...
$ X2014 : int 2 3 15 3 21 29 37 48 64 62 ...
As you notice, sex is factor but with 39 levels whereas it has only two values (men) and (women). Also, year 2000 (X2000 in the table) is a factor with 786 levels and it should have been read as an "int". Why did R read the observable "sex" with this large number of levels and why did it read year 2000 as a factor while it read the others as int (as is the case)?
Edit:
The age column has values of the form: 20-24, 25-30, ... till 85-90. and another category as 90+

X is put in front of the column names because R doesn't allow first character of the column name to be a number (try data.frame(a = 1:10, "3" = runif(10)).
Age is a factor because you have bins and what you observe is expected behavior. R doesn't handle intervals as numeric, but as factor.
Sex variable is weird and given currently available data, I would say the variable represents something other than sex, in at least part of the dataset. Has the dataset been stitched together? Perhaps there was a mistake in copy/pasting. See levels(ddd$sex) to disentangle all possible levels.

The default behaviour of read.table and its related functions is to make all column names syntactically valid. This means that they can be used without quoting after the $ operator. However, this behaviour can be changed using the check.names = FALSE parameter. This will mean you end up with columns called 2000 etc. To then use those columns with $ they will need to be backquoted, e.g.
ddd$`2000`
The same will be true if you want to use these columns with non-standard evaluation, e.g.
ggplot(ddd, aes(x = sex, y = `2000`)) + geom_boxplot()
For the sex column, there must be entries in the column further down that have numbers in. Check your original
data.
For age, you have trailing spaces in your age column. Either remove these outside R, or you could do something like this:
ddd$age <- as.numeric(sub(" +$", "", as.character(ddd$age)))
For the 2000 column, it's not clear from your str output why it's been read as a factor. By default, empty strings should be regarded as NA and so shouldn't affect the class. You could try (assuming you're now using check.names = FALSE):
as.character(ddd$`2000`)[is.na(as.numeric(as.character(ddd$`2000`))) & ddd$`2000` != ""]
This should print out any elements of the column which are non-blank and non-numeric. It may again be a trailing space issue.

Related

How to combine training and testing dataset in same format

I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"

'Object not found' error even though table() verifies the object is in the data set

I've read through others who have had a similar issue, but my situation doesn't seem to be the same as the fixes that have been proposed for those other issues. I'm trying to recode a variable using a conditional statement. I want to take a character string & turn it into a numeric so I can subset those observations out into a new data frame. Here's what I have, so far:
blad_mor <- read.csv("blad_mor.csv", header = T)
str(blad_mor)
blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
I get this output for the str() command:
> str(blad_mor)
'data.frame': 127073 obs. of 12 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ sex : Factor w/ 4 levels "1","2","F","M": 1 1 1 2 1 2 2 2 2 2 ...
$ race : Factor w/ 17 levels "America","Asian &",..: 4 4 4 4 4 4 4 4 4 4 ...
$ county : Factor w/ 79 levels "COUNTY1","COUNTY2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cod : Factor w/ 327 levels "C001","C005",..: 89 108 108 294 63 42 172 74 85 269 ...
$ fips : int 1 1 1 1 1 1 1 1 1 1 ...
$ state : int 5 5 5 5 5 5 5 5 5 5 ...
$ race_code : int 2 2 2 2 2 2 2 2 2 2 ...
$ ethnicity : Factor w/ 4 levels "","Hispanic",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ethnicity_code: int NA NA NA NA NA NA NA NA NA NA ...
But when I try the blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod) code I get this error:
> blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
Error in gsub(C670:C679, 29010, blad_mor$cod) : object 'C670' not found
So, I verify that there actually is that object by table(blad_mor$cod) with this being some of the output:
C578 C579 C58 C60 C601 C609 C61 C629 C631 C639 C64 C65 C66 C670 C672 C674 C675 C676
2 43 4 1 1 53 6162 62 1 14 2911 30 47 1 4 1 1 2
C677 C678 C679 C680 C689 C690 C692 C693 C694 C695 C696 C699 C700 C701 C709 C71 C710 C711
1 4 2776 35 77 1 4 5 1 1 8 45 7 3 11 1 29 34
The object 'C670' has one instance as per this output, yet R is telling me it is not there & doesn't run the command. What am I missing here? Should I change the class type from factor to something else? I'm quite confused.
Edit: I have tried quotes around the character strings (e.g. blad_mor_recode <- gsub('C670:C679', '29010', blad_mor$cod) as well as ifelse(). I still get the same error message.
If you want to change all strings from C70to C79 you have to use regex. Something like the following would work:
blad_mor_recode <- gsub("C7[0-9]", "29010", blad_mor$cod)
A simple example:
gsub("C7[0-9]","",c("C60","C70","C78"))
[1] "C60" "" ""

Change type of variables in multiple data frames

I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)

How can I strip dollar signs ($) from a data frame in R?

I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418

Order data by variable in R, what am I missing?

I would like to order my data in a specific way. I introduced a variable "Index" in data.a and merged it with data.b. Afterwards the merged data is not in the right order, so I would like to order it again by the Index.
My merged Data looks like:
> str(aksamp.mer)
'data.frame': 11355 obs. of 6 variables:
$ V : Factor w/ 78 levels "","V1-18","V1-2",..: 3 23 49 49 17 41 10 10 40 39 ...
$ J : Factor w/ 7 levels "","J1","J2","J3",..: 5 5 5 5 5 5 7 7 6 7 ...
$ D : Factor w/ 28 levels "","D1-1","D1-14",..: 3 23 7 7 22 22 18 18 8 9 ...
$ Class: Factor w/ 1 level "IgG": 1 1 1 1 1 1 1 1 1 1 ...
$ Count: int 63 59 1 58 52 50 49 7 43 41 ...
$ Index: int 1051 10318 3218 3218 9887 9929 7503 7503 2438 3767 ...
I am trying to reorder the data.frame again by the column "Index":
> aksamp.mer2<-aksamp.mer[order(Index),]
which gives me the Error: "object 'Index' not found. What am I doing wrong?
It is complaining that there is no Index object in your environment. The right way to access it is to use aksamp.mer$Index. So you need to do:
aksamp.mer2 <-aksamp.mer[order(aksamp.mer$Index), ]

Resources