R : Select either or, but not both - r

I am absolutely new to coding so please forgive me if this should be very easy to solve or to find - maybe it's so simple that nobody has bothered explaining so far or I just haven't been searching with the right keywords.
I have a column in my dataset that contains the letters f, n, i in all possible combinations. Now I want to find only those rows that contain either f or n, but not both of them. So that could be f, or fi, or n, or ni.
Then I want to compare those two sets of rows to each other in a boxplot. So ideally I would have two boxes: one with all the data points belonging to group f, including fi, and one with all the data points belonging to group n, including ni.
Example of my dataset:
df <- data.frame(D = c("f", "f", "fi", "n", "ni", "ni", "fn", "fn"), y = c(1, 0.8, 1.1, 2.1, 0.9, 8.8, 1.7, 5.4))
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
7 fn 1.7
8 fn 5.4
Now what I want to get is this subset:
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
and then somehow have 1,2,3 and 4,5,6 in a group each, to plot in a boxplot.
So far I have only succeeded in getting a subset that has only entries with either f or n, but not fi, ni etc, which is not what I want, with this code:
df2<-df[df$D==c("f","n"),]
and in creating a subset that has all different groups with f and n:
df2 <- df[grepl("f", df$D) | grepl("n", bat.df$D),]
I read about the "exclusive or" operator xor but when I try to use that like this:
df2 <- bat.df[xor(match("n", df$D), match("f", df$D)),]
it just gives me a dataframe full of NAs. But even if that did work, I guess I would only be able to make a boxplot with four groups, f, n, fi and ni, where I want only two groups. So how can I get that code to work, and how do I go on from there?
I hope this is not too terrible for a first question! I am kind of bleary eyed after spending far too much time on this. Any help, about my problem, on where to look for the answer or on how to improve the question is very much appreciated!

I think your last example is pretty close. xor only works with things that return logical like TRUE and FALSE, but match actually returns the integer position. So just use grepl with xor:
xor(grepl("f", df$D), grepl("n", df$D))
Or you could get fancy:
library(functional)
Reduce(xor, lapply(c("f", "n"), grepl, df$D))

We all cut our teeth on R at some point, so I'll try to construct an example for you that fits the question. How about:
# simulate a data.frame with "all possible combinations" of singles and pairs
df <- data.frame(txt = as.character(outer(c("i", "f", "n"), c("", "i", "f", "n"), paste0)),
stringsAsFactors = FALSE)
# create an empty factor variable to contain the result
df$has_only <- factor(rep(NA, nrow(df)), levels = 1:2, labels = c("f", "n"))
# replace with codes if contains either f or n, not both(f, n)
df$has_only[which(grepl("f", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "f"
df$has_only[which(grepl("n", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "n"
df
## txt has_only
## 1 i <NA>
## 2 f f
## 3 n n
## 4 ii <NA>
## 5 fi f
## 6 ni n
## 7 if f
## 8 ff f
## 9 nf <NA>
## 10 in n
## 11 fn <NA>
## 12 nn n
plot(df$has_only)
Note that this is a bar plot, not a box plot, since a box plot would only plot the range of continuous values, and you have not specified what are the continuous values or what they would look like. But if you did have such a variable, say df$myvalue, then you could produce a box plot with:
# simulate some continuous data
set.seed(50)
df$myvalue <- runif(nrow(df))
boxplot(myvalue ~ has_only, data = df)

Related

connecting groups of duplicates

I have some data which has lots of duplication. For example, this data frame shows IDs in the data set that are known to be identical (e.g. row1 indicates a =b, therefore the rest of the data indicate that a=b=c and d=e=f):
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
duplicates <- cbind(a,b)
Is there any easy way to split these into two groups that are true IDs (e.g. here a,b & c are all the same and d,e & f are also all the same). So for my sample data:
a <- c('a','b','c','d','e','f')
b <- c('c1','c1','c1','c2','c2','c2')
new_id <- cbind(a,b)
The actual data has thousands of rows and is not fully connected (i.e. in a cluster of duplicates this could occur: a=b, a=c,b=/=c), due to some errors in duplicate detection.
Sounds like you are looking at network analyses. There are a few packages that deal with this. So you might want to use the one you are the most familiar with (network, tidygraph, igraph, diagrammeR). I use igraph, because I know that one a bit more than the others.
Steps:
First create a graph from the data using the dup data.frame. Next use the clusters function (or one of the other cluster options) to create clusters based on the data. Last step is to transform the clusters into a data.frame. Additionally you could plot the data (depends on how much data you have).
library(igraph)
g <- graph_from_data_frame(dup, directed = FALSE)
clust <- clusters(g)
clusters <- data.frame(name = names(clust$membership),
cluster = clust$membership,
row.names = NULL,
stringsAsFactors = FALSE)
clusters
name cluster
1 a 1
2 b 1
3 c 1
4 d 2
5 e 2
6 f 2
# plot graph if needed
plot(g)
data:
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
dup <- data.frame(a,b, stringsAsFactors = FALSE)
You could work with factors.
df.1$id <- with(df.1, ifelse(as.numeric(a) %in% 1:3, "c1", "c2"))
new_id <- unique(df.1[, -2])
rownames(new_id) <- NULL # just in case
Yielding
> new_id
a id
1 a c1
2 b c1
3 c c1
4 d c2
5 e c2
6 f c2
Data
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
df.1 <- data.frame(a, b)

permute dataframe but must have unique rows

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

Using a custom summary function for factors within multiple columns

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)
Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

Plotting data using vectors of different length in R

I want to plot several files in the same figure; each file has two-column data.
The problem is that each file has a different number of rows (529,567,660, etc)
For data with same number of rows I did the following:
data1 <- read.table(file="ro0.2/T0.1/sq_Ave.dat")
x1 <- data1[1]
y1 <- data1[2]
data2 <- read.table(file="ro0.4/T0.1/sq_Ave.dat")
x2 <- data2[1]
y2 <- data2[2]
max_valuex = max(x1,x2,x3,x4,x5)
max_valuey = max(y1,y2,y3,y4,y5)
matplot(x1,cbind(y1,y2,y3,y4,y5),type="l",
col=c("black","red","green","blue","orange"),
lwd = 2,xlab = expression(q*sigma), ylab="S(q)", col.lab="black",
cex.lab=1.5,font.lab=4, xaxt = "n", yaxt = "n", xlim = c(0,max_valuex),
ylim = c(0,max_valuey), xaxs = "i", yaxs = "i")
However, this does not work for files with different number of rows.
R complains with:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 529, 567, 661
Calls: matplot -> ncol -> as.matrix -> cbind -> cbind -> data.frame
Any idea or suggestion would be greatly appreciated!
Thanks a lot in advance
S H-V
I guess you could enlarg your vectors with NAs. I believe this won't matter in next handling your data. E.g.:
a= 1:10
b=1:5
d=1:7
data.frame(a,b,d) #different length
#Error in data.frame(a, b, d) :
#arguments imply differing number of rows: 10, 5, 7
length(b) = length(d) = length(a)
data.frame(a,b,d) # no error now
a b d
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 NA 6
7 7 NA 7
8 8 NA NA
9 9 NA NA
10 10 NA NA
Even if you manage to load the y-axis data into a list object which would be the natural data type in R for storing vectors of variable length. In the next step you will get something like:
> matplot (matrix(1:100, nrow=10, ncol=10)[1], matrix(1:100, nrow=12, ncol=10))
Error in matplot(matrix(1:100, nrow = 10, ncol = 10)[, 1], matrix(1:100, :
'x' and 'y' must have same number of rows
A scatterplot like plot or matplot needs complete x,y tuples, but in your code you are only using x1 as the x-values. Your code example is not complete. You are also loading x2,x3,...
but only use them to calculate xlim. Why would you calculate xlim including the maxima of all x if you do not intend to plot them?
It is therefore not clear to me what the final plot should look like, and wether or not a scatterplot is the correct visualization of the data. You might want to give more details about what your data in fact consists of and how incomplete data should be handled.
Could it be that you want to plot several line-plots into a single figure, using add=TRUE:
matplot(data1[,1:2], xlim = c(0,max_valuex), ylim = c(0,max_valuey))
matplot(data2[,1:2], add=TRUE)

How to ddply() without sorting?

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.

Resources