Select missing rows in different dataframes - r

I have two dataframes: list1 and list2
>head(list1)
RS_ID CHROM POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058 1 195680131 C T 0.9996
2 rs73056353 1 195680971 A G 0.9999
3 rs12130880 1 195681419 A T 0.5475
4 rs76457267 1 195681460 A C 0.9993
5 rs10921893 1 195681616 T C 0.5060
6 rs75239769 1 195682022 G A 0.9999
AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1 0.0004 0.9996 0.0004 0.7830
2 0.0001 0.9998 0.0002 0.3740
3 0.4525 0.5442 0.4558 0.0597
4 0.0007 0.9992 0.0008 0.3590
5 0.4940 0.5099 0.4901 0.0302
6 0.0001 1.0000 0.0000 0.5500
>head(list2)
RS_ID CHROM POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058 1 195680131 C T 0.9996
2 rs73056353 1 195680971 A G 0.9999
3 rs12130880 1 195681419 A T 0.5475
4 rs76457267 1 195681460 A C 0.9993
5 rs10921893 1 195681616 T C 0.5060
6 rs75239769 1 195682022 G A 0.9999
AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1 0.0004 0.9996 0.0004 0.7830
2 0.0001 0.9998 0.0002 0.3740
3 0.4525 0.5442 0.4558 0.0597
4 0.0007 0.9992 0.0008 0.3590
5 0.4940 0.5099 0.4901 0.0302
6 0.0001 1.0000 0.0000 0.5500
> dim(list1)
[1] 235111 10
> dim(list2)
[1] 234520 10
as you can see with dim() they differ in number of rows by 591. I now want to get a new dataframe with all rows from list1 that are not in list2 (those 591)
I tried
> match_diff=list1[!(list1 %in% list2)]
> dim(match_diff)
[1] 235111 10
but as you can see it tells me, that all rows from list1 differ from list2.
I checked with str() if there's an underlying cause, but both are identical (originate from the same rawdata)
I can't check by a single column but must compare each row as a whole.

This is database join operation. If you search for joins you will find more information on the different kinds out there. As #starja said, you want the anti_join from dplyr:
Install dplyr if you don't have it already with install.packages('dplyr')
R> list1 <- data.frame(a=0:5, b=10:15)
R> list2 <- data.frame(a=(0:5)+3, b=(10:15)+3)
R> list1
a b
1 0 10
2 1 11
3 2 12
4 3 13
5 4 14
6 5 15
R> list2
a b
1 3 13
2 4 14
3 5 15
4 6 16
5 7 17
6 8 18
R> list3 <- dplyr::anti_join(list1, list2)
Joining, by = c("a", "b")
R> list3
a b
1 0 10
2 1 11
3 2 12
R>

Related

Simulate unbalanced clustered data

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?
As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]
Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180

Why am I getting 'train' and 'class' have different lengths"

Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0
It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6

Replacing conditional values with previous values in r

I have some data on organism survival as a function of time. The data is constructed using the averages of many replicates for each time point, which can yield a forward time step with an increase in survival. Occasionally, this results in a survivorship greater than 1, which is impossible. How can I conditionally change values greater than 1 to the value preceeding it in the same column?
Here's what the data looks like:
>df
Generation Treatment time lx
1 0 1 0 1
2 0 1 2 1
3 0 1 4 0.970
4 0 1 6 0.952
5 0 1 8 0.924
6 0 1 10 0.913
7 0 1 12 0.895
8 0 1 14 0.729
9 0 2 0 1
10 0 2 2 1
I've tried mutating the column of interest as such, which still yields values above 1:
df1 <- df %>%
group_by(Generation, Treatment) %>%
mutate(lx_diag = as.numeric(lx/lag(lx, default = first(lx)))) %>% #calculate running survival
mutate(lx_diag = if_else(lx_diag > 1.000000, lag(lx_diag), lx_diag)) #substitute values >1 with previous value
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 1.01
10 12 1 19 0.76 1.04
I expect the results to look something like:
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 0.814
10 12 1 19 0.76 0.814
I know you can conditionally change the values to a specific value (i.e. ifelse with no else), but I haven't found any solutions that can conditionally change a value in a column to the value in the previous row. Any help is appreciated.
EDIT: I realized that mutate and if_else are quite efficient when it comes to converting values. Instead of replacing values in sequence from the first to last, as I would have expected, the commands replace all values at the same time. So in a series of values >1, you will have some left behind. Thus, if you just run the command:
SurvTot1$lx_diag <- if_else(SurvTot1$lx_diag > 1, lag(SurvTot1$lx_diag), SurvTot1$lx_diag)
over again, you can rid of the values >1. Not the most elegant solution, but it works.
This looks like a very ugly solution to me, but I couldn't think of anything else:
df = data.frame(
"Generation" = rep(12,10),
"Treatent" = rep(1,10),
"Time" = c(seq(0,14,by=2),15,19),
"lx_diag" = c(1,1,1,0.996,0.992,0.968,0.925,0.814,1.04,1.04)
)
update_lag = function(x){
k <<- k+1
x
}
k=1
df %>%
mutate(
lx_diag2 = ifelse(lx_diag <=1,update_lag(lx_diag),lag(lx_diag,n=k))
)
Using the data from #Fino, here is my vectorized solution using base R
vals.to.replace <- which(df$lx_diag > 1)
vals.to.substitute <- sapply(vals.to.replace, function(x) tail( df$lx_diag[which(df$lx_diag[1:x] <= 1)], 1) )
df$lx_diag[vals.to.replace] = vals.to.substitute
df
Generation Treatent Time lx_diag
1 12 1 0 1.000
2 12 1 2 1.000
3 12 1 4 1.000
4 12 1 6 0.996
5 12 1 8 0.992
6 12 1 10 0.968
7 12 1 12 0.925
8 12 1 14 0.814
9 12 1 15 0.814
10 12 1 19 0.814

R extract or split an interval into vectors

What is this operation called and how do I achieve this? (I can't find an example.)
Given
temp1
Var1 Freq
1 (0,0.78] 0
2 (0.78,0.99] 0
3 (0.99,1.07] 0
4 (1.07,1.201] 1
5 (1.201,1.211] 0
6 (1.211,1.77] 2
How do I split the intervals in Var1 into two vectors for start and end?
Like this
df2
start end Freq
1 0.000 0.780 0
2 0.780 0.990 0
3 0.990 1.070 0
4 1.070 1.201 1
5 1.201 1.211 0
6 1.211 1.770 2
This is an XY problem. You shouldn't need to have that format to fix in the first place.
E.g.:
x <- 1:10
brks <- c(0,5,10)
data.frame(table(cut(x,brks)))
# Var1 Freq
#1 (0,5] 5
#2 (5,10] 5
data.frame(start=head(brks,-1), end=tail(brks,-1), Freq=tabulate(cut(x,brks)))
# start end Freq
#1 0 5 5
#2 5 10 5

How to get correlations between two variables with lags

I would like to check if there is a correlation between "birds" & "wolfs" in different lags.Getting the correlation value is easy but how can I address the lag issue ( I need to check the correleation value for 1:4 lags )? The output that I look for is a data table that contains the lag value and the related correlation value.
df <- read.table(text = " day birds wolfs
0 2 21
1 8 4
2 2 5
3 2 4
4 3 6
5 1 12
6 7 10
7 1 9
8 2 12 header = TRUE)
Output(not real results):
Lag CorValue
0 0.9
1 0.8
2 0.7
3 0.9
If you do this :
corLag<-ccf(df$birds,df$wolfs,lag.max=max(df$day))
it will return this :
Autocorrelations of series ‘X’, by lag
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
-0.028 0.123 -0.045 -0.019 0.145 -0.176 -0.082 -0.126 -0.296 0.757 -0.134 -0.180 0.070 -0.272 0.549 -0.170 -0.117
the first row is the lag, the second is the correlation value. you can check that cor(df$birds,df$wolfs)is indeed equal to -0.296

Resources