Merge fairly similar datasets for later comparison with wilcox.test{stats} - r

I have loaded two datasets as data.frames, named DF1 and DF2. Both have the columns time and area. DF1 though has more rows than DF2, i.e. more time points (or data points). The merge function would allow me to combine the area columns of the two datasets by="time", but the time points are dissimilar. round isn't useful here (too coarse and duplicates).
What I actually want to do is to run a two-sample wilcox.test (i.e. they don't follow a normal distribution), which doesn't allow for vectors of different length (afaik).
> head(DF1)
timesteps area time
1 0 1030 40.00
2 100 1031 40.11
3 200 1039 40.22
4 300 1046 40.32
5 400 1053 40.43
6 500 1061 40.54
> head(DF2)
time area
1 33.83506 952.7843
2 43.31922 935.7430
3 47.95656 1528.4501
4 52.78808 2400.7030
5 67.29044 5699.4736
6 72.12320 8277.1240

Why not just use
wilcox.test(DF1$time, DF2$time)
or area if that is the desired test.
The following works:
wilcox.test(rnorm(50), (rnorm(100)+2))

Related

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!
Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

rbind matrices with different dimension and some columns in common in R

I need your help.
I'm working with genotypes with 3 files of different sizes (in separate files).
I need to join the genotypes one underneath the other by the common columns. NA genotypes may not appear in the file.
The file would have to look like below.
Column 1 is the identification of the animal, column 2 is the chip and the other columns are the genotypes coded as 0, 1, 2 and 5.
Could anyone help?
Thank you.
1353 1 2110101100201201101101011011111121111121
8014 1 2111010151110112022111011151111210111221
516 2 2110510120
517 2 2210510220
181 3 11101111122011205502
182 3 11201111122012205502

Detection of time-series outliers

I'm working on a university project forecasting. I have a huge database with demand between two cities. However, I know that this dataset is contaminated. However, I do not know which data points are obscured. The dataset is a panel data set that follows demand between city pairs on a monthly basis. Below is a part of the data that I am working with.
CAI.JED CAI.RUH ADD.DXB CAI.IST ALG.IST
2013-01-01 19196 14777 16 1413 12
2013-02-01 19913 8 18203 1026 5
2013-03-01 34242 11751 17836 985 1
2013-04-01 23481 12000 13479 948 27
2013-05-01 24428 16046 16391 954 9
2013-06-01 31791 23479 16571 1 4
2013-07-01 33716 20090 11323 0 5724
2013-08-01 35553 2 11121 0 0
2013-09-01 18746 13423 12119 0 26
2013-10-01 10 12223 10239 0 0
2013-11-01 19 20234 14231 5 2
2013-12-01 15198 1 12132 10 5
The dataset is a combination from two datasets. The persons that provided me the data told me that in some months, only one of the two dataset is working. However, it is not known for which months, which specific dataset is available.
Now comes my question: for the next part of the project, I need to get annual demand numbers. However, as I know that the figures are obscured, I would like to remove outliers. What techniques are available in R to do this?
As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf). However, I could not get this working. Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.
After knowing what the outliers are, I would like to either replace them (e.g. with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.
I prefer density based clustering algorithm such as DBSCAN.
If you modify the epsilon and num-samples, you can filter outliers very specifically
using a plot to visualize the result (label -1 are the outliers)

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources