Order data by variable in R, what am I missing? - r

I would like to order my data in a specific way. I introduced a variable "Index" in data.a and merged it with data.b. Afterwards the merged data is not in the right order, so I would like to order it again by the Index.
My merged Data looks like:
> str(aksamp.mer)
'data.frame': 11355 obs. of 6 variables:
$ V : Factor w/ 78 levels "","V1-18","V1-2",..: 3 23 49 49 17 41 10 10 40 39 ...
$ J : Factor w/ 7 levels "","J1","J2","J3",..: 5 5 5 5 5 5 7 7 6 7 ...
$ D : Factor w/ 28 levels "","D1-1","D1-14",..: 3 23 7 7 22 22 18 18 8 9 ...
$ Class: Factor w/ 1 level "IgG": 1 1 1 1 1 1 1 1 1 1 ...
$ Count: int 63 59 1 58 52 50 49 7 43 41 ...
$ Index: int 1051 10318 3218 3218 9887 9929 7503 7503 2438 3767 ...
I am trying to reorder the data.frame again by the column "Index":
> aksamp.mer2<-aksamp.mer[order(Index),]
which gives me the Error: "object 'Index' not found. What am I doing wrong?

It is complaining that there is no Index object in your environment. The right way to access it is to use aksamp.mer$Index. So you need to do:
aksamp.mer2 <-aksamp.mer[order(aksamp.mer$Index), ]

Related

How to combine training and testing dataset in same format

I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"

why column is changing to data.frame after using of match in R?

One columns is changing to data.frame after the use of match condition in R.
Can any one let me know why a column is updating to factor to data.frame after updating ?
a<-data.frame(columnB=sample(1:20,20,replace = F),
columnC=sample(4:80,20,replace = F))
d<-data.frame(columnE=letters[1:20],
columnF=sample(1:20,20,replace = F))
a$columnB<-d[match(a$columnB,d$columnF),]
str(a)
Output :
> str(a)
'data.frame': 20 obs. of 2 variables:
$ columnB:'data.frame': 20 obs. of 2 variables:
..$ columnE: Factor w/ 20 levels "a","b","c","d",..: 18 8 1 19 16 3 4 15 17 5 ...
..$ columnF: int 6 20 12 11 13 1 7 19 14 8 ...
$ columnC: int 69 6 37 80 55 49 4 5 44 76 ...
1.please clarify how it get resolved to make data frame column to normal columns
2.is there any method to easy match and update of column in a table based on d table.

Compare one csv file to multiple csv files and write new csv files R

I am pretty new to loops in R so I do apologies if this question has been asked elsewhere.
Read in all 30 CSVfiles -> Compare File A species to the other 30 CSV files by species -> Write a new CSV file for each of the 30 files with just the matching species
File A has one column with the names of 190 species ($name). The 30 other csv files each have a column with the species ($SBSname) with differing number of species in the column $SBSname that can range from 100-500 with replicates (so the file CSV file can be larger than 190 rows). However I don't know how to write the code that ...
This is all I have at the moment ...
I have looped in all the CSV files:
30files = list.files(pattern="*.csv")
for (i in 1:length(30files)) assign(30files[i], read.csv(30files[i]))
I have code for just comparing one CSV file (branching.csv) against File A:
> str(FileA)
'data.frame': **190 obs. of 1 variable**:
$ name: Factor w/ 190 levels "Acaena novae zelandiae",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(branching.csv)
'data.frame': **4055 obs. of 7 variables:**
$ SBSname : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 794 2075 1049 162 132 333 541 1840 272 1553 ...
$ SBS.number : int 16443 26711 40171 40398 40867 41151 37871 42412 35847 36245 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 3 1 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 1 1 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 9 9 20 3 3 3 3 3 33 33 ...
Species<-branching.csv[(branching.csv$SBSname %in% FileA$name),]
write.csv(Species, file = "Branching.csv")
> str(Species)
'data.frame': **298 obs. of 7 variables:**
$ name : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 1049 162 1548 47 57 1647 1060 2788 2094 1976 ...
$ SBS.number : int 40171 40398 36280 40532 41629 42495 40103 32792 32892 30583 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 2 2 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 2 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 3 3 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 20 3 33 33 33 33 33 44 44 44 ...
Any help or suggestions would be great. Doesn't have to be a loop!
How about this simple loop?
library(dplyr)
for(i in 1:length(30files))
{
csv.matching = read.csv(30files[i]) %>% inner_join(FileA, by=c("SBSname"="name"))
write.csv(csv.matching, file=gsub("\\.csv", "_matchin.csv", 30files[i]), na="")
}

Change type of variables in multiple data frames

I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)

How to read factors' levels right in R?

I have a big csv file that has 51993 rows and 18 columns. Here is part of the table:
head(ddd)
country.of.birth age sex X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
Afghanistan 0 men 0 0 1 2 2 0 1 1
Afghanistan 0 women 1 1 0 0 1 0 0 0
Afghanistan 1 men 0 2 5 2 3 4 1 1
Afghanistan 1 women 4 1 4 2 3 2 3 2
Afghanistan 2 men 5 0 8 7 7 3 5 3
Afghanistan 2 women 4 8 3 9 4 4 4 3
In the main csv file, the columns are: Country of Birth, age, sex, and then years from 2000 to 2014. My questions are why does R put X before each year number?
When I used the str() function, I got:
> str(ddd)
'data.frame': 15264 obs. of 18 variables:
$ country.of.birth: Factor w/ 261 levels "0","1","10","103",..: 51 51 51 51 51 51 51 51 51 51 ...
$ age : Factor w/ 38 levels "","0 ","1 ","10 ",..: 2 2 3 3 14 14 17 17 20 20 ...
$ sex : Factor w/ 39 levels "","0 ","1 ","10 ",..: 38 39 38 39 38 39 38 39 38 39 ...
$ X2000 : Factor w/ 786 levels "","0","1","10",..: 2 3 2 478 555 478 92 4 205 716 ...
$ X2001 : int 0 1 2 1 0 8 11 8 26 19 ...
$ X2002 : int 1 0 5 4 8 3 13 18 22 15 ...
$ X2003 : int 2 0 2 2 7 9 15 13 23 33 ...
$ X2004 : int 2 1 3 3 7 4 11 15 21 22 ...
$ X2005 : int 0 0 4 2 3 4 10 6 13 16 ...
$ X2006 : int 1 0 1 3 5 4 8 13 20 10 ...
$ X2007 : int 1 0 1 2 3 3 6 7 9 17 ...
$ X2008 : int 0 0 2 0 4 5 4 6 8 9 ...
$ X2009 : int 0 1 1 4 7 3 9 10 11 12 ...
$ X2010 : int 1 1 6 4 8 10 17 10 21 16 ...
$ X2011 : int 0 5 9 6 21 18 16 27 34 24 ...
$ X2012 : int 3 5 5 16 30 22 44 48 46 49 ...
$ X2013 : int 3 0 12 19 24 34 54 46 76 71 ...
$ X2014 : int 2 3 15 3 21 29 37 48 64 62 ...
As you notice, sex is factor but with 39 levels whereas it has only two values (men) and (women). Also, year 2000 (X2000 in the table) is a factor with 786 levels and it should have been read as an "int". Why did R read the observable "sex" with this large number of levels and why did it read year 2000 as a factor while it read the others as int (as is the case)?
Edit:
The age column has values of the form: 20-24, 25-30, ... till 85-90. and another category as 90+
X is put in front of the column names because R doesn't allow first character of the column name to be a number (try data.frame(a = 1:10, "3" = runif(10)).
Age is a factor because you have bins and what you observe is expected behavior. R doesn't handle intervals as numeric, but as factor.
Sex variable is weird and given currently available data, I would say the variable represents something other than sex, in at least part of the dataset. Has the dataset been stitched together? Perhaps there was a mistake in copy/pasting. See levels(ddd$sex) to disentangle all possible levels.
The default behaviour of read.table and its related functions is to make all column names syntactically valid. This means that they can be used without quoting after the $ operator. However, this behaviour can be changed using the check.names = FALSE parameter. This will mean you end up with columns called 2000 etc. To then use those columns with $ they will need to be backquoted, e.g.
ddd$`2000`
The same will be true if you want to use these columns with non-standard evaluation, e.g.
ggplot(ddd, aes(x = sex, y = `2000`)) + geom_boxplot()
For the sex column, there must be entries in the column further down that have numbers in. Check your original
data.
For age, you have trailing spaces in your age column. Either remove these outside R, or you could do something like this:
ddd$age <- as.numeric(sub(" +$", "", as.character(ddd$age)))
For the 2000 column, it's not clear from your str output why it's been read as a factor. By default, empty strings should be regarded as NA and so shouldn't affect the class. You could try (assuming you're now using check.names = FALSE):
as.character(ddd$`2000`)[is.na(as.numeric(as.character(ddd$`2000`))) & ddd$`2000` != ""]
This should print out any elements of the column which are non-blank and non-numeric. It may again be a trailing space issue.

Resources