find correlation for CSV file dataset in R,

find correlation for CSV file dataset in R, - r

After reading csv file in r, I cannot find the correlation, send error sayes "x must be numeric"
s = read.csv(file.choose(), header=T)
Error in cor(s) : 'x' must be numeric
the data set
https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/MASS/UScereal.csv

Your data is not entirely numeric, so you should perform correlation function only with numerical data
library(MASS)
data("UScereal")
# type of variables
str(UScereal)
# 'data.frame': 65 obs. of 11 variables:
# $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
# $ calories : num 212 212 100 147 110 ...
# $ protein : num 12.12 12.12 8 2.67 2 ...
# $ fat : num 3.03 3.03 0 2.67 0 ...
# $ sodium : num 394 788 280 240 125 ...
# $ fibre : num 30.3 27.3 28 2 1 ...
# $ carbo : num 15.2 21.2 16 14 11 ...
# $ sugars : num 18.2 15.2 0 13.3 14 ...
# $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
# $ potassium: num 848.5 969.7 660 93.3 30 ...
# $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
Correlation should then be performed only
for numerical values, which means columns 2 to 8, and 10
# correlation matrix
cor(UScereal[c(2:8,10)])
# calories protein fat sodium fibre carbo sugars potassium
# calories 1.0000000 0.7060105 0.5901757 0.5286552 0.3882179 0.78872268 0.49529421 0.4765955
# protein 0.7060105 1.0000000 0.4112661 0.5727222 0.8096397 0.54709029 0.18484845 0.8417540
# fat 0.5901757 0.4112661 1.0000000 0.2595606 0.2260715 0.18285220 0.41567397 0.3232754
# sodium 0.5286552 0.5727222 0.2595606 1.0000000 0.4954831 0.42356172 0.21124365 0.5566426
# fibre 0.3882179 0.8096397 0.2260715 0.4954831 1.0000000 0.20307489 0.14891577 0.9638662
# carbo 0.7887227 0.5470903 0.1828522 0.4235617 0.2030749 1.00000000 -0.04082599 0.2420485
# sugars 0.4952942 0.1848484 0.4156740 0.2112437 0.1489158 -0.04082599 1.00000000 0.2718335
# potassium 0.4765955 0.8417540 0.3232754 0.5566426 0.9638662 0.24204848 0.27183347 1.0000000

Related

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

R Dataframe issue preventing normality test

I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?

chart.Correlation with continious and categorical variables

I want to see if there is correlation between my variables. This is the structure of the dataset
'data.frame': 189 obs. of 20 variables:
$ age : num 24 31 32 35 36 26 31 24 35 36 ...
$ diplM2 : Factor w/ 3 levels "0","1","2": 3 2 1 3 2 2 3 2 2 1 ...
$ TimeDelcat : Factor w/ 4 levels "0","1","2","3": 1 1 3 3 3 4 2 1 4 4 ...
$ SeasonDel : Factor w/ 4 levels "1","2","3","4": 1 2 4 3 4 3 4 3 2 3 ...
$ BMIM2 : num 23.4 25.7 17 26.6 24.6 21.6 21 22.3 20.8 20.7 ...
$ WgtB2 : int 3740 3615 3705 3485 3420 2775 3365 3770 3075 3000 ...
$ sex : Factor w/ 2 levels "1","2": 2 2 1 2 2 2 1 1 1 1 ...
$ smoke : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 1 1 3 ...
$ nRBC : num 0.1621 0.0604 0.1935 0.0527 0.1118 ...
$ CD4T : num 0.1427 0.2143 0.1432 0.0686 0.0979 ...
$ CD8T : num 0.1574 0.1549 0.1243 0.0804 0.0782 ...
$ NK : num 0.02817 0 0.04368 0.00641 0.02398 ...
$ Bcell : num 0.1033 0.1124 0.1468 0.0551 0.0696 ...
$ Mono : num 0.0633 0.0641 0.0773 0.0531 0.0656 ...
$ Gran : num 0.428 0.442 0.329 0.716 0.6 ...
$ chip : Factor w/ 92 levels "200251580021",..: 12 24 23 2 27 22 6 22 17 22 ...
$ pos : Factor w/ 12 levels "R01C01","R01C02",..: 11 12 1 6 9 2 12 1 7 11 ...
$ trim1PM25ifdmv4: num 9.45 13.81 15.59 7.13 15.43 ...
$ trim2PM25ifdmv4: num 13.27 15.53 10.69 13.56 9.27 ...
$ trim3PM25ifdmv4: num 16.72 16.21 12.17 6.47 10.66 ...
As you can see, there are both continious and categorical variables.
When I run chart.Correlation(variables, histrogram=T,method = c("pearson") )
I get this error:
Error in pairs.default(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, :
non-numeric argument to 'pairs'
How can I fix this?
Thank you.

I believe you want correlation only between numerical variables. The below code will do this and it will output only unique correlations between the input.
library(reshape2)
data <- data.frame(x1=rnorm(10),
x2=rnorm(10),
x3=rnorm(10),
x4=c("a","b","c","d","e","f","g","h","i","j"),
x5=c("ab","sp","sp","dd","hg","hj","qw","dh","ko","jk"))
data
x1 x2 x3 x4 x5
1 -1.2169793 0.5397598 0.4981513 a ab
2 -0.7032631 -2.1262837 -1.0377371 b sp
3 0.8766831 -0.2326975 -0.1219613 c sp
4 0.3405332 2.4766225 -1.1960618 d dd
5 0.1889945 0.3444534 1.9659062 e hg
6 0.8086956 0.4654644 -1.2526696 f hj
7 -0.6850181 -1.7657241 0.5156620 g qw
8 0.8518034 0.9484547 1.4784063 h dh
9 0.5191793 1.2246566 1.3867829 i ko
10 0.4568953 -0.6881464 0.3548839 j jk
#finding correlation for all numerical values
corr=cor(data[as.numeric(which(sapply(data,class)=="numeric"))])
#convert the correlation table to long format
res=melt(corr)
##keeping only one side of the correlations
res$type=apply(res,1,function(x)
paste(sort(c(as.character(x[1]),as.character(x[2]))),collapse="*"))
res=unique(res[,c("type","value")])
res
type value
x1*x1 1.00000000
x1*x2 0.44024939
x1*x3 0.04936654
x2*x2 1.00000000
x2*x3 0.08859169
x3*x3 1.00000000

R write.table read.table change the format of some columns in dataframes

I am experienced a problem when saving data using write.table and reading data using read.table.
I wrote some code that collect data from thousands of files, does some calculations, and creates a data frame. In this data frame I have 8 columns and more then 11000 rows. The columns contain the 8 variables, 3 of which are ordered factors; the other variables are numeric.
When I look at the structure of my data before using the command write.table I got exactly what I expect which is:
str(data)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : Ord.factor w/ 6 levels "160"<"800"<"1.4"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ EnergyUnit: Ord.factor w/ 3 levels "MeV"<"GeV"<"TeV": 1 1 1 1 1 1 1 1 1 1 ...
$ Location : Ord.factor w/ 7 levels "BeamImpact"<"WithinBulky"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : num 30 28 26 24 22 20 30 18 28 16 ...
After that I use the usual write.table command to save my file:
write.table(data, file = "filename.txt")
Now, when I read again this file into R, and I look at the structure, I get this:
mydata <- read.table("filename.txt", header=TRUE)
> str(mydata)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : num 160 160 160 160 160 160 160 160 160 160 ...
$ EnergyUnit: Factor w/ 3 levels "GeV","MeV","TeV": 2 2 2 2 2 2 2 2 2 2 ...
$ Location : Factor w/ 7 levels "10cmTarget","AdjBulky",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : int 30 28 26 24 22 20 30 18 28 16 ...
Do you know how to solve this problem? THis bothers me also because I am creating a Shiny app and this changed class doesn't fit my purpose.
Thanks!

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe

Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>

The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

find correlation for CSV file dataset in R, - r

After reading csv file in r, I cannot find the correlation, send error sayes "x must be numeric" s = read.csv(file.choose(), header=T) Error in cor(s) : 'x' must be numeric the data set https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/MASS/UScereal.csv

Related

Classify factor output with factors with >60 levels and numeric inputs

R Dataframe issue preventing normality test

chart.Correlation with continious and categorical variables

R write.table read.table change the format of some columns in dataframes

Carc data from rda file to numeric matrix

Categories

Resources