subset rows and columns in a dataframe based on boundary conditions - r

I have some problems to express myself. Probably, that is why I havent found anything which helps me yet. The example should make clear what I want.
Suppose I have a m x m matrix structure of coordinates. Lets say it ranges from A1 to E5 . and I want to subset the rows/columns which are k lines away from the outer coordinates.
In my example k is 2. So I want to select all records in the data frame which have the coordinates B2, B3, B4, C2, C4, D2, D3, D4. Manually, I would do the following:
cc <- data.frame(x=(LETTERS[1:5]), y=c(rep(1,5),rep(2,5),rep(3,5), rep(4,5), rep(5,5)) , z=rnorm(25))
slct <- with(cc, which( (x=="B" | x=="C" | x=="D" ) & (y==2 | y==3 | y==4) & !(x=="C" & y==3) ))
cc[slct,] # result data frame
But if the matrix dimensions increase that is not the way which will work great. Any better ideas?

Rather hard to read but it does the trick.
m <- 5 # Matrix dimensions
k <- 2 # The index of the the inner square that you want to extract
cc[(cc$x %in% LETTERS[c(k,m-k+1)] & !cc$y %in% c(1:(k-1), m:(m-k+2))) |
(cc$y %in% c(k, m-k+1) & !cc$x %in% LETTERS[c(1:(k-1), m:(m-k+2))]),]
The first line of comparisons extracts the k:th column from the left and right edges of the matrix, but not the parts that are closer than k to the upper and lower edges. The second line does the same thing but for rows.

cc$xy <- paste0(cc$x,cc$y)
coords <- c("B2","B3","B4", "C2", "C4", "D2", "D3", "D4")
cc[cc$xy %in% coords,]
# x y z xy
#7 B 2 -0.9031472 B2
#8 C 2 -0.1405147 C2
#9 D 2 1.6017619 D2
#12 B 3 1.7713041 B3
#14 D 3 -0.2005749 D3
#17 B 4 1.8671238 B4
#18 C 4 0.3428815 C4
#19 D 4 0.1470436 D4

Related

Replacing Column Name with Loop in R

I have three df P1,P2,P3 with each three columns. I want to change the second column from each df to D1, D2, D3 with a loop but nothing is working. What do I miss out?
C1 <- c(12,34,22)
C2 <- c(43,86,82)
C3 <- c(98,76,25)
C4 <- c(12,34,22)
C5 <- c(43,86,82)
C6 <- c(98,76,25)
C7 <- c(12,34,22)
C8 <- c(43,86,82)
C9 <- c(98,76,25)
P1 <- data.frame(C1,C2,C3)
P2 <- data.frame(C4,C5,C6)
P3 <- data.frame(C7,C8,C9)
x <- c("P1", "P2", "P3")
b <- c("D1","D2","D3")
for (V in b){
names(x)[2] <- "V"
}
The output I would expect is:
P1 <- data.frame(C1,D1,C3)
P2 <- data.frame(C4,D2,C6)
P3 <- data.frame(C7,D3,C9)
We can use mget to get the values of the string vector in a list, use Map to rename the second column each of the list element with the corresponding 'b' value, then use list2env to update those objects in the global env
list2env(Map(function(x, y) {names(x)[2] <- y; x}, mget(x), b), .GlobalEnv)
-output
P1
# C1 D1 C3
#1 12 43 98
#2 34 86 76
#3 22 82 25
P2
# C4 D2 C6
#1 12 43 98
#2 34 86 76
#3 22 82 25
P3
# C7 D3 C9
#1 12 43 98
#2 34 86 76
#3 22 82 25
For understanding the code, first step is mget on the vector of strings
mget(x)
returns a list of data.frame
Then, we are passing this as argument to Map along with the corresponding 'b' vector. i.e. each element of list is a unit, similarly each element of vector is a unit
Map(function(x, y) x, mget(x), b)
The function(x, y) is anonymous/lambda function construct. Using that we set the names of the second column to that of 'b'. Here, the anonymous argument for 'b' is 'y'. Later, we wrap everything in list2env as it is a named list and it will look up for those names in the global env to update it
It is usually best to work with lists for this type of thing. That can make life easier and is generally a good workflow to learn.
list1 <- list(P1 = P1, P2 = P2, P3 = P3)
# here is your loop
for (i in seq_along(list1)) {
names(list1[[i]])[2] <- b[i]
}
# you can use Map as well for iteration, similar to #akrun's solution
# this is really identical at this point, except you've created the list differently
Map(function(df, b) {names(df)[2] <- b; df}, list1, b)

iterating a loop to give a new value if the previous value already exists in a data frame in R

I have a data frame with this information here:
df <- data.frame("string1" = c("ABECDE","ABECDE","ABECDE"),
"string2" = c("ABCD","ABCD","ABCD"),
"site1" = NA, "site2" = NA, "combine" = NA, "filtered" = NA)
I would like to write a code that picks sites E and D in the string and adds them to the data frame.
If the combination is already created I'd like for it to go back and chose a new combination and check again until it gets one that has not been picked.
I have provided below the code I have done so far which gives the output of:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD E3 D4 E3D4 E3D4
2 ABECDE ABCD E3 D4 E3D4 <NA>
3 ABECDE ABCD E3 D4 E3D4 <NA>
Here, E3D4 is the value you get when it first goes through the function.
I would now like for it to go back and pick the next possible combinations:
E6D4 and D5D4 for the next two lines but I have no idea how to properly structure the iteration.
Here is the code I have so far (there is probably a less redundant way to write it but I am a beginner so apologies if it is overly long)
#make the columns of string1 and string2 into vectors
string1 <- df$string1
string2 <- df$string2
#for each string in the vector check to see first if it has an E, if not, then a D
#get the output as a letter and its position (eg E3)
for (i in 1:nrow(df)){
if (grepl("E", string1[i])){
sites1 = gregexpr('E', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
} else if (grepl("D", string1[i])){
sites = gregexpr('D', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
}
}
#do the same for the second vector
for (i in 1:nrow(df)){
if (grepl("E", string2[i])){
sites2 <- gregexpr('E', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
} else if (grepl("D", string2[i])){
sites2 <- gregexpr('D', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
}
}
#combine the sites
df$combine <- paste0(df$site1, df$site2)
#for each row of combined sites, check to see if the value is already created
for (i in 1:nrow(df)){
if(!df$combine[i] %in% df$filtered){
df$filtered[i] <- df$combine[i]
} else if(df$combine[i] %in% df$filtered){
#go back to for loop and look for either another E in the list
#if there is none, go to the next condition (looking for a D).
#pick the next possible values, put them together and check again
#do this continuously until you get a unique combine.
#do this for string1 and then string2 (or alternating both, which ever is easier)
}
}
Perhaps you could simplify and try the following.
Create a custom function, that will detect all positions of "D" and "E" in your strings. Then use expand.grid to get all combinations of these positions. In your example data, this will include combinations of positions 3, 5, 6 with position 4 (in the end 3 combinations: (3, 4), (5, 4), and (6, 4)).
Then, you can go through each of these combinations and create the desired strings, by combining with paste the letter from the position with the position number. A list will hold these results and be assembled in the end with rbind.
There are a few questions that remain, including if there are situations when there are no "D" or "E" letters found.
my_fun <- function(x) {
p1 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string1"]])))
p2 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string2"]])))
cbn <- expand.grid(p1, p2)
lst <- list()
for (i in seq_len(nrow(cbn))) {
site1 <- paste0(substr(x[["string1"]], cbn[i, "Var1"], cbn[i, "Var1"]), cbn[i, "Var1"])
site2 <- paste0(substr(x[["string2"]], cbn[i, "Var2"], cbn[i, "Var2"]), cbn[i, "Var2"])
lst[[i]] <- c(string1 = x[["string1"]], string2 = x[["string2"]], site1 = site1, site2 = site2, combine = paste0(site1, site2))
}
return(as.data.frame(do.call("rbind", lst)))
}
do.call(rbind, apply(df, 1, my_fun))
I created example data to test this out:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD NA NA NA NA
2 AABCDE ABCE NA NA NA NA
3 ABCDDE ACDD NA NA NA NA
Which would give the following output:
string1 string2 site1 site2 combine
1 ABECDE ABCD E3 D4 E3D4
2 ABECDE ABCD D5 D4 D5D4
3 ABECDE ABCD E6 D4 E6D4
4 AABCDE ABCE D5 E4 D5E4
5 AABCDE ABCE E6 E4 E6E4
6 ABCDDE ACDD D4 D3 D4D3
7 ABCDDE ACDD D5 D3 D5D3
8 ABCDDE ACDD E6 D3 E6D3
9 ABCDDE ACDD D4 D4 D4D4
10 ABCDDE ACDD D5 D4 D5D4
11 ABCDDE ACDD E6 D4 E6D4

connecting groups of duplicates

I have some data which has lots of duplication. For example, this data frame shows IDs in the data set that are known to be identical (e.g. row1 indicates a =b, therefore the rest of the data indicate that a=b=c and d=e=f):
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
duplicates <- cbind(a,b)
Is there any easy way to split these into two groups that are true IDs (e.g. here a,b & c are all the same and d,e & f are also all the same). So for my sample data:
a <- c('a','b','c','d','e','f')
b <- c('c1','c1','c1','c2','c2','c2')
new_id <- cbind(a,b)
The actual data has thousands of rows and is not fully connected (i.e. in a cluster of duplicates this could occur: a=b, a=c,b=/=c), due to some errors in duplicate detection.
Sounds like you are looking at network analyses. There are a few packages that deal with this. So you might want to use the one you are the most familiar with (network, tidygraph, igraph, diagrammeR). I use igraph, because I know that one a bit more than the others.
Steps:
First create a graph from the data using the dup data.frame. Next use the clusters function (or one of the other cluster options) to create clusters based on the data. Last step is to transform the clusters into a data.frame. Additionally you could plot the data (depends on how much data you have).
library(igraph)
g <- graph_from_data_frame(dup, directed = FALSE)
clust <- clusters(g)
clusters <- data.frame(name = names(clust$membership),
cluster = clust$membership,
row.names = NULL,
stringsAsFactors = FALSE)
clusters
name cluster
1 a 1
2 b 1
3 c 1
4 d 2
5 e 2
6 f 2
# plot graph if needed
plot(g)
data:
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
dup <- data.frame(a,b, stringsAsFactors = FALSE)
You could work with factors.
df.1$id <- with(df.1, ifelse(as.numeric(a) %in% 1:3, "c1", "c2"))
new_id <- unique(df.1[, -2])
rownames(new_id) <- NULL # just in case
Yielding
> new_id
a id
1 a c1
2 b c1
3 c c1
4 d c2
5 e c2
6 f c2
Data
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
df.1 <- data.frame(a, b)

R Compare non side-by-side duplicates in 2 columns

There are many similar questions but I'd like to compare 2 columns and delete all the duplicates in both columns so that all that is left is the unique observations in each column. Note: Duplicates are not side-by-side. If possible, I would also like a list of the duplicates (not just TRUE/FALSE). Thanks!
C1 C2
1 a z
2 c d
3 f a
4 e c
would become
C1 C2
1 f z
2 e d
with duplicate list
duplicates: a, c
Here is another answer
where_dupe <- which(apply(df, 2, duplicated), arr.ind = T)
Gives you the location of the duplicated elements within your original data frame.
col_unique <- setdiff(1:ncol(df), where_dupe)
Gives you which columns had no duplicates
You can find out the values by indexing.
df[,col_unique]
Here is a base R method using duplicated and lapply.
temp <- unlist(df)
# get duplicated elements
myDupeVec <- unique(temp[duplicated(temp)])
# get list without duplicates
noDupesList <- lapply(df, function(i) i[!(i %in% myDupeVec)])
noDupesList
$C1
[1] "f" "e"
$C2
[1] "z" "d"
data
df <- read.table(header=T, text=" C1 C2
1 a z
2 c d
3 f a
4 e c ", as.is=TRUE)
Note that this returns a list. This is much more flexible structure, as there is generally a possibility that a level may be repeated more than once in a particular variable. If this is not the case, you can use do.call and data.frame to put the result into a rectangular structure.
do.call(data.frame, noDupesList)
C1 C2
1 f z
2 e d

recursive data subset based on column attributes in R

I have a data frame with 10K rows and 6 columns. The first two columns are factors.
A B C D E F
A1 B1 0.1 0.2 0.3 0.4
A2 B2 .........................
A1 B3 .........................
A1 B1 0.3 ...................
Now I want to generate models(using my function F) based on different subsets of data (different rows), that is different combinations of attributes of A and B.
In my above example, I should have call my function F 6 times with Cartesian production of A and B
(A1,A2) x (B1,B2,B3). I wonder how to do this in R efficiently without explicit loop?
To avoid confusion
e.g, apply F to (A1,B1) combination, in this case, rows 1 and 4, columns 3 to 6.
to other combinations is similar
Try:
lapply(seq_len(length(df$A)*length(df$B))-1, function(x)
myFunction(df[df$A == paste0("A",1+floor(x / length(df$B))) &
df$B == paste0("B",1+(x %% length(df$B))), ]))

Resources