Related
I have a data set for which I'm calculating its distance matrix. Below is the data, which has 251 observations.
> str(mydata)
'data.frame': 251 obs. of 7 variables:
$ BodyFat: num 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
$ Weight : num 154 173 154 185 184 ...
$ Chest : num 93.1 93.6 95.8 101.8 97.3 ...
$ Abdomen: num 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
$ Hip : num 94.5 98.7 99.2 101.2 101.9 ...
$ Thigh : num 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
$ Biceps : num 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
I normalize the data.
means = apply(mydata,2,mean)
sds = apply(mydata,2,sd)
nor = scale(mydata,center=means,scale=sds)
When i calculate the distance matrix, I can see lot of empty values and moreover distance is measured only from 4 observations.
distance =dist(nor)
> str(distance)
'dist' num [1:31375] 1.33 2.09 1.9 3.08 3.99 ...
- attr(*, "Size")= int 251
- attr(*, "Labels")= chr [1:251] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distance # o/p omitted from this post as it has 257 observations.
1 2 3 4 5 6 7
2 1.3346445
3 2.0854437 2.5474796
4 1.8993458 1.4908813 2.5840752
5 3.0790252 3.4485667 2.2165366 2.7021809
8 9 10 11 12 13 14
2
3
4
5
15 16 17 18 19 20 21
This list goes on empty for the remaining 247 comparisons.
Now, I reduce the data set to 20 observations
Here I get a proper distance matrix.
distancetiny=dist(nor)
> str(distancetiny)
'dist' num [1:1176] 1.14 1.8 1.61 2.62 3.39 ...
- attr(*, "Size")= int 49
- attr(*, "Labels")= chr [1:49] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distancetiny
1 2 3 4 5 6 7
2 1.1380433
3 1.7990293 2.2088928
4 1.6064118 1.2871522 2.2483586
5 2.6235853 2.9669283 1.9132224 2.3256624
6 3.3898119 3.3730508 3.3718447 2.2615557 2.0094434
7 1.8947704 2.0065514 1.7685604 1.1065940 1.7387938 2.2321156
8 1.1732465 1.0663217 1.6733689 0.8873140 2.1959298 2.7939555 1.1448269
9 2.2721969 2.0545882 3.4263262 1.4058375 3.1811955 2.4011074 2.3078714
10 2.3753110 2.2424464 3.0289947 1.2808398 2.3230202 1.4242653 1.8571654
11 1.5620472 1.1878554 2.5750350 0.5718248 2.7714795 2.6314286 1.5132365
12 3.5088571 3.2484020 4.1164488 2.2723772 3.1377318 1.4795230 2.8274818
13 2.1448841 2.2679705 1.8726670 1.3494988 1.2176727 1.5544030 1.0725518
14 3.6679035 3.7459402 3.6869023 2.6677308 2.1318420 0.7347359 2.5729973
15 2.9908457 3.3312661 3.1289870 2.4340473 1.8027070 1.3626019 2.3795360
16 1.6117570 2.0283356 1.2011116 1.5961064 1.3196981 2.4456436 1.2569683
17 3.2991393 3.5991747 3.0438049 2.6066933 1.4742664 1.0945621 2.2214101
18 3.9409008 4.0726826 4.0113908 2.9250144 2.5228901 0.9087254 2.8158563
19 2.7468511 2.9495031 3.2439229 1.8312508 2.4122436 1.3932604 1.9640170
20 3.7515064 3.7021743 3.9404231 2.5813440 2.5390519 0.8352961 2.6530503
21 2.3102053 2.3878491 2.0836800 1.4328028 1.2991221 1.5287862 1.1769205
There is no empty values in the output when the observation is 21.
Why is this so? Does the dist() do not work when the observation count goes beyond a threshold ?
I'm unable to figure it out. Please help.
This seems to be a size issue. When the dataset contains more than 60-80 observations, the distance matrix is unable to be displayed properly (even for the initial rows). Looks like the values are present in it perfectly alright, and just that we cannot see them as it is.
Further operation on the distance matrix (like Hierarchical agglomerative clustering ) proved that nothing to worried about it's weird display.
I am trying to write a function which will do scatter plotting for me,
the data structure that I am working with looks as follow:
'data.frame': 129 obs. of 15 variables:
$ Player : Factor w/ 129 levels "Abbrederis, Jared",..: 1 2 3 4 5 6 7 8 9 10 ...
$ College : Factor w/ 79 levels "Alabama","Arizona",..: 78 20 65 77 27 48 67 31 31 19 ...
$ Position : Factor w/ 7 levels "DB","LB","OL",..: 7 7 6 4 4 4 2 2 7 7 ...
$ OverallGrade: num 5.2 5.96 5.4 5.16 5.45 5.1 6.6 5.37 5.9 6.4 ...
$ Height : int 73 73 77 70 68 73 77 73 71 77 ...
$ ArmLength : num 31.4 32.6 34 31.2 31 ...
$ Weight : int 195 212 265 225 173 218 255 237 198 240 ...
$ HandLength : num 9.62 9 9 9.5 8.88 ...
$ Dash40 : num 4.5 4.56 4.74 4.82 4.26 4.48 4.66 4.64 4.43 4.61 ...
$ BenchPress : int 4 14 28 20 20 19 15 22 7 13 ...
$ VerticalJump: num 30.5 39.5 33 29.5 38 38 34.5 35 38.5 32.5 ...
$ BroadJump : int 117 123 118 106 122 121 119 123 122 119 ...
$ Cone3Drill : num 6.8 6.82 7.42 7.24 6.86 7.07 6.82 7.24 6.69 7.33 ...
$ Shuttle20 : num 4.08 4.3 4.3 4.49 4.06 4.46 4.19 4.35 3.94 4.39 ...
$ Position1 : Factor w/ 7 levels "WO","DB","S",..: 1 1 6 4 4 4 5 5 1 1 ...
..- attr(*, "scores")= num [1:7(1d)] 4.54 4.75 5.22 4.59 4.58 ...
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr "DB" "LB" "OL" "RB" ...
I managed to do the plotting without writing a function and the code works:
with(nfl,plot(nfl$Dash40,nfl$BenchPress,
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab = "Bench Press weight",
ylab="40-year dash time in seconds"),
panel.first = grid())
legend("bottomright", legend=levels(nfl$Position),
pch=c(1,3,4,2,0,8,5),
cex=0.5,
col=c("black","red","blue","darkgreen","purple","orange","gray"))
a<-paste(nfl$Player,nfl$BenchPress)
text(nfl$Dash40,nfl$BenchPress,label=as.character(a),cex=0.5)
So basically I want to see relationship between different numeric variables, and I thought if the code above work, the following function should work as well,
myplot<-function(xvar,yvar,xlab,ylab){
b<-paste("xlab","vs","ylab")
xvar<-nfl$"xvar"
yvar<-nfl$"yvar"
with(nfl,plot(yvar,xvar),
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab="xlab",ylab="ylab",
main="b")
}
myplot(Dash40,BenchPress,dash,bench)
I used Dash40 and BenchPress to test the function but it turns out the function doesn't work:
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) :
Show Traceback
Rerun with Debug
Error in plot.window(...) : need finite 'xlim' values
Essentially I am trying to do the same job using two different codes, why the second one doesn't work? Could someone please give me some hinds on how to solve the problem?
I am very confused by your function. Why does it have arguments xvar,yvar that it overwrites without use?
What is meant by nfl$"xvar" and nfl$"yvar"?
Why is this done at all since you then use with(nfl,...?
If nfl doesn't have columns called xvar or yvar then they will cause an error or be interpreted as NULL; if you plot NULL then you need to specify the limits of the plot since this cannot be found from the data.
The function fails because you are plotting NULL
You also need to pass the data from nfl to the function as there is a danger that Dash40 and BenchPress will not be found, since they only exist within nfl. To do the you use the $ symbol. Instead of BenchPress you should pass nfl$BenchPress to the function.
The below should work.
myplot<-function(xvar,yvar,xlab,ylab){
b<-paste(xlab,"vs",ylab)
plot(yvar,xvar,
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab=xlab,ylab=ylab,
main=b)
}
myplot(nfl$Dash40,nfl$BenchPress,"dash","bench")
I have a dataframe that looks like this:
12/04/2017 00:00:02.30,-2.31,-2.97,-0.3,-1.4
12/04/2017 00:00:02.40,-1.89,-2.94,-1.15,-1.4
12/04/2017 00:00:02.50,-1.66,-3.14,-0.06,-1.39
12/04/2017 00:00:02.60,-1.84,-3.16,0.18,-1.37
12/04/2017 00:00:02.70,-2.12/04/2017 00:00:02.80,-2,-2.56,0.17,-1.41
12/04/2017 00:00:02.90,-2.18,-2.31,0.11,-1.45
12/04/2017 00:00:03,-2.14,-2.21,-0.05,-1.45
The logger where the data comes from somtimes writes one of the dates into the row of the other line (5th row in the example). I need to delete these lines in R. But I have not really a clue how to find and delete these lines in the dataframe.
My first idea was to look for the number of forward slashes in each row. But could not find a way on how to do that.
Another way might be to get the mean length of all rows and check for lines that are longer than the mean and delete those. But same here. Can't find a way to make a mean over aall characters ina row (strings and numbers).
edit: The output from str(df):
str(df)
'data.frame': 856645 obs. of 6 variables:
$ station: chr "Arof" "Arof" "Arof" "Arof" ...
$ date : Factor w/ 863989 levels "12/04/2017 00:00:01.10",..: 1 2 3 4 5 6 7 8 9 10 ...
$ u : Factor w/ 1327 levels "","0","-0.01",..: 132 84 146 136 112 120 126 33 281 240 ...
$ v : num -0.62 -0.41 -1.58 -1.65 -1.25 -1.8 -1.86 -2.46 -2.59 -2.87 ...
$ w : num 0.89 1.09 0.63 0.53 0.84 0.58 0.46 0.48 -0.16 -0.01 ...
$ temp : num -1.36 -1.41 -1.41 -1.41 -1.41 -1.41 -1.5 -1.48 -1.51 -1.46 ...
- attr(*, "na.action")=Class 'omit' Named int [1:7344] 18 113 246 378 513 643 646 778 909 1042 ...
.. ..- attr(*, "names")= chr [1:7344] "18" "113" "246" "378" ...
Usinggrepl we can search for . followed by 2 digits number followed by /
grepl("\\.\\d{2}\\/",data$date)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
apply(data,1, function(x) sum(grepl("\\.\\d{2}\\/",x)))
I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]
I'd like to split a dataframe in 4 equals parts, because I'd like to use the 4 cores of my computer.
I did this :
df2 <- split(df, 1:4)
unsplit(df2, f=1:4)
and that
df2 <- split(df, 1:4)
unsplit(df2, f=c('1','2','3','4')
But the unsplit function did not work, I have these warnings messages
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Do you have an idea of the reason ?
How many rows in df? You will get that warning if the number of rows in your table is not divisible by 4. I think you are using the split factor f incorrectly, unless what you want to do is put each subsequent row into a different split data.frame.
If you really want to split your data into 4 dataframes. one row after the other then make your splitting factor the same size as the number of rows in your dataframe using rep_len like this:
## Split like this:
split(df , f = rep_len(1:4, nrow(df) ) )
## Unsplit like this:
unsplit( split(df , f = rep_len(1:4, nrow(df) ) ) , f = rep_len(1:4,nrow(df) ) )
Hopefully this example illustrates why the error occurs and how to avoid it (i.e. use a proper splitting factor!).
## Want to split our data.frame into two halves, but rows not divisible by 2
df <- data.frame( x = runif(5) )
df
## Splitting still works but...
## We get a warning because the split factor 'f' was not recycled as a multiple of it's length
split( df , f = 1:2 )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
## Instead let's use the same split levels (1:2)...
## but make it equal to the length of the rows in the table:
splt <- rep_len( 1:2 , nrow(df) )
splt
#[1] 1 2 1 2 1
## Split works, and f is not recycled because there are
## the same number of values in 'f' as rows in the table
split( df , f = splt )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
## And unsplitting then works as expected and reconstructs our original data.frame
unsplit( split( df , f = splt ) , f = splt )
# x
#1 0.6970968
#2 0.6206521
#3 0.5614762
#4 0.1798006
#5 0.5910995
In the R language 'split' example . . .
aq <- airquality
g <- aq$Month
l <- split(aq,g)
After the 'scale' function is executed
l <- lapply(l, transform, Ozone = scale(Ozone))
I am guessing that at one time in R history
the function 'scale' did not add extra attributes
to the column it is modifying.
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
As seen in here . . .
> str(l)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 0.782 0.557 -0.523 -0.253 NA ...
.. ..- attr(*, "scaled:center")= num 23.6
.. ..- attr(*, "scaled:scale")= num 22.2
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] NA NA NA NA NA ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 2.399 -0.32 -0.857 NA 0.154 ...
.. ..- attr(*, "scaled:center")= num 59.1
.. ..- attr(*, "scaled:scale")= num 31.6
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] -0.528 -1.284 -1.108 0.455 -0.629 ...
.. ..- attr(*, "scaled:center")= num 60
.. ..- attr(*, "scaled:scale")= num 39.7
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] 2.674 1.928 1.721 2.467 0.644 ...
.. ..- attr(*, "scaled:center")= num 31.4
.. ..- attr(*, "scaled:scale")= num 24.1
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
But now it does add those attributes
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
and the very simple 'unsplit' function is not programmed to handle those attributes.
> unsplit(l,g)
Error in xj[i, , drop = FALSE] : (subscript) logical subscript too long
The (direct and simple) solution is to get rid of those attributes.
attributes(l[[1]]$Ozone) <- NULL
attributes(l[[2]]$Ozone) <- NULL
attributes(l[[3]]$Ozone) <- NULL
attributes(l[[4]]$Ozone) <- NULL
attributes(l[[5]]$Ozone) <- NULL
Then try to unsplit again.
str( unsplit(l,g) )
> str( unsplit(l,g) )
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 0.782 0.557 -0.523 -0.253 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
So, now it works.
Andre Mikulec