Subsetting with variable selection range - r

I have to make a set of selections that vary by the day on this dataset (dat), which is composed by species (sp), day (day, in POSIXct) and area (ar):
sp day ar
A 1-Jan-00 2
B 1-Jan-00 6
C 2-Jan-00 2
A 2-Jan-00 1
D 2-Jan-00 4
E 2-Jan-00 12
F 3-Jan-00 8
A 4-Jan-00 3
G 4-Jan-00 2
B 4-Jan-00 1
I need to subset where species "A" occurs. However, the areas to be selected will vary by day, given by this matrix (dat.ar):
day ar.select
1-Jan-00 (1,6)
2-Jan-00 (1,12)
3-Jan-00 (4,8)
4-Jan-00 (3,12)
More specifically, for areas where species "A" occurs, on 1-jan-00, I need only areas 1 and 6. For 2-jan-00, areas 1 and 12, and so on.
As an example, the desired output on this example for this selection is given below:
sp day ar
A 2-Jan-00 1
A 4-Jan-00 3
I haven't had much success getting a for loop, as I am still trying to learn the semantics in R. In summary, a rough idea of what must be done, but still struggling with the language. Here is a sketch of where I think this should go:
dat1 = with(dat,sapply(day[sp=="A" & dat.ar$day.s[i] ],
function(x) ar == (ar[sp=="A" & day == x]==dat.ar$ar.select[j])
final=dat[rowSums(dat1) > 0, ]
I believe I have to fit a for loop, that would go through dat.ar, specifying the areas to be selected in dat. But despite my efforts in trying to get for the for loop, I haven't gotten anywhere near. I am not even sure if combining an sapply and a for loop is the right way to go about this.
In case someone wishes to reproduce the problem:
sp=c("A","B","C","A","D","E","F","A","G","B")
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "3-Jan-00", "4-Jan-00", "4-Jan-00", "4-Jan-00")
day=as.POSIXct(day, format="%d-%b-%y")
ar=c(2,6,2,1,4,12,8,3,2,1)
dat= as.data.frame(cbind(sp, day, ar))
day.s=c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s=as.POSIXct(day.s, format="%d-%b-%y")
a.s=c(1,1,4,3)
a.e=c(6,12,8,12)
ar.select=paste(a.s, a.e, sep=",")
dat.ar=cbind(day.s, ar.select)
Any help is much appreciated.

You could merge your table of conditions to the original dataset and filter them conditionally. Consider a1 and a2 like your sp and day values, and obs to be like your ar value.
library(data.table)
dataset <- data.table(
a1 = c("A","B","C","B","A","A","A","A"),
a2 = c("P","Q","Q","Q","R","R","P","Q"),
obs = c(3,2,3,4,2,4,8,0)
)
constraints <- data.table(
a1 = c("A","B","C","A","B","C","A","B","C"),
a2 = c("P","P","P","Q","Q","Q","R","R","R"),
lower = c(1,2,3,4,3,2,3,2,5),
upper = c(6,4,5,7,5,6,5,3,7)
)
checkingdataset <- merge(dataset,constraints, by = c("a1","a2"), all.x = TRUE)
checkingdataset[obs <= upper & obs >= lower, obs.keep := TRUE]
# a1 a2 obs lower upper obs.keep
#1: A P 3 1 6 TRUE
#2: A P 8 1 6 NA
#3: A Q 0 4 7 NA
#4: A R 2 3 5 NA
#5: A R 4 3 5 TRUE
#6: B Q 2 3 5 NA
#7: B Q 4 3 5 TRUE
#8: C Q 3 2 6 TRUE

First, I would not use as.data.frame(cbind(...)) to make your data.frames. Second, I would create dat.ar in much the same structure that you have created dat. Third, I would then just use merge to get the result you are looking for.
dat <- data.frame(sp=c("A","B","C","A","D","E","F","A","G","B"),
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "2-Jan-00", "3-Jan-00", "4-Jan-00",
"4-Jan-00", "4-Jan-00"),
ar=c(2,6,2,1,4,12,8,3,2,1))
dat$day <- as.POSIXct(dat$day, format="%d-%b-%y")
day.s <- c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s <- as.POSIXct(day.s, format="%d-%b-%y")
a.s <- c(1,1,4,3)
a.e <- c(6,12,8,12)
ar.select <- paste(a.s, a.e, sep=",")
dat.ar <- data.frame(sp = "A", day = day.s, ar = ar.select)
dat.ar <- cbind(dat.ar[-3],
read.csv(text = as.character(dat.ar$ar), header = FALSE))
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")
dat.ar
# sp day variable ar
# 1 A 2000-01-01 V1 1
# 2 A 2000-01-02 V1 1
# 3 A 2000-01-03 V1 4
# 4 A 2000-01-04 V1 3
# 5 A 2000-01-01 V2 6
# 6 A 2000-01-02 V2 12
# 7 A 2000-01-03 V2 8
# 8 A 2000-01-04 V2 12
merge(dat, dat.ar)
# sp day ar variable
# 1 A 2000-01-02 1 V1
# 2 A 2000-01-04 3 V1
Of course, I would just suggest that you make your dat.ar object in a more friendly manner to begin with. Why paste values together if you are going to separate them out later anyway? ;)
dat.ar <- data.frame(sp = "A",
day = c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00"),
a.s = c(1,1,4,3), a.e = c(6,12,8,12))
dat.ar$day <- as.POSIXct(dat.ar$day, format="%d-%b-%y")
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")

Related

Sort a dataframe based on a character column containing letters followed by numbers in R [duplicate]

This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 7 years ago.
I have a dataframe like this
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B)
df$Day <- as.character(df$Day)
The first column is a character and hence I used this solution to sort this dataframe but not quite getting it right since this only sorts the first column and leaves the column 2 & 3 unchanged.
df$Day <- df$Day[order(nchar(df$Day), df$Day)]
My desired output is
Day A B
Day1 5 15
Day5 2 16
Day10 0 30
Day20 7 12
What am I missing here? Kindly provide some inputs.
You can try using something like this that does numeric day sorting:
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B, stringsAsFactors = FALSE)
df$DayNum <- as.numeric(gsub('Day', '', df$Day))
df <- df[order(df$DayNum), ]
Output as follows:
df
Day A B DayNum
1 Day1 5 15 1
3 Day5 2 16 5
4 Day10 0 30 10
2 Day20 7 12 20
You can avoid creating a new column by doing the following (was trying to show full detail of what was going on):
df <- df[order(as.numeric(substr(df$Day, 4, nchar(df$Day)))), ]
Output will be same as above.
This could be done with mixedorder from library(gtools)
library(gtools)
df[mixedorder(df$Day),]
# Day A B
#1 Day1 5 15
#3 Day5 2 16
#4 Day10 0 30
#2 Day20 7 12
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B, stringsAsFactors = FALSE)
# add leading zero(s) to digits in values of Day column,
# e.g., "Day5" --> "Day05"
# then return the indices of the sorted vector
indices_to_sort_by <- sort(
sub(
pattern = "([a-z]{1})([1-9]{1}$)",
replacement = "\\10\\2",
x = df$Day
),
index.return = TRUE)$ix
df[indices_to_sort_by, ]
# Day A B
# 1 Day1 5 15
# 3 Day5 2 16
# 4 Day10 0 30
# 2 Day20 7 12

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

Match a Variable to Other Dataset Based using Multiple Overlapping Variables

I have two data sets with some overlapping variables. One dataset is basically a subset of the other but needs an additional variable added based on some of the overlapping variables. For example
varA <- c(rep(c("a","b"), each=5))
blah <- c(11:20)
varB <- c(1:10)
speed <- rnorm(10)
dataset1 <- data.frame(varA,blah,varB,speed)
varA.2 <- c("a","a","b","b")
varB.2 <- c(2,10,11,7)
speed.2 <- rep(NA, 4)
dataset2 <- data.frame(varA.2, varB.2, speed.2)
dataset2
I would like the "speed.2" variable to contain the speed values for the lines where varA and varB are matching between the two sets.
I've tried something with "merge" but am having issues.
Thank you!
May be:
colnames(dataset2) <- gsub("\\..*","", colnames(dataset2))
library(dplyr)
left_join(dataset2[,-3],dataset1[,-2])
# Joining by: c("varA", "varB")
# varA varB speed
#1 a 2 -1.3243815
#2 a 10 NA
#3 b 11 NA
#4 b 7 -0.6026936
Or without changing the column names.
merge(dataset1[,-2],dataset2[,-3], by.x=c("varA","varB"), by.y=c("varA.2", "varB.2"), all.y=TRUE)
# varA varB speed
# 1 a 2 -0.6797753
# 2 a 10 NA
# 3 b 7 -2.1838454
# 4 b 11 NA
Values in speed differ as the example was without using set.seed()
You can use 'match' function for "where varA and varB are matching"
dataset2$speed.2 = dataset1[match(paste(dataset2$varA.2,dataset2$varB.2),
paste(dataset1$varA, dataset1$varB)),]$speed
dataset2
varA.2 varB.2 speed.2
1 a 2 0.3917783
2 a 10 NA
3 b 11 NA
4 b 7 1.3265439
>

Remove rows of a data set belonging to a factor of specified length

I have a data.frame similar to the following:
df <- data.frame(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
Haplotype1 = rep(1:4,2),
Haplotype2 = rep(5:8,2))
> df
population individual Haplotype1 Haplotype2
1 AA A1 1 5
2 AA A2 2 6
3 AA A3 3 7
4 BB B1 4 8
5 BB B2 1 5
6 CC C1 2 6
7 CC C2 3 7
8 CC C3 4 8
I want to create a new dataset where any population consisting of less than a
specified number of individuals is omitted from the dataset. For example, I
want to reanalyze the data for only populations with greater than three or
more individuals. This following is the dataset I want:
> df <- df[!df$population=="BB",]
> df
population individual Haplotype1 Haplotype2
1 AA A1 1 5
2 AA A2 2 6
3 AA A3 3 7
6 CC C1 2 6
7 CC C2 3 7
8 CC C3 4 8
However, I have 400 populations ranging in size from 5 to 155 individuals, and
manually picking populations out by name is not feasible. I want to write a
function where I say in essence "give me a dataset with all populations
consisting of X number of individuals or more and delete those with less than
X." Any help or feedback is appreciated.
This should do the trick:
tab <- table(df$population) > 2
df[df$population %in% names(tab)[tab], ]
# population individual Haplotype1 Haplotype2
# 1 AA A1 1 5
# 2 AA A2 2 6
# 3 AA A3 3 7
# 6 CC C1 2 6
# 7 CC C2 3 7
# 8 CC C3 4 8
The most direct approach I can think of is to use data.table() from the "data.table" package:
library(data.table)
DT <- data.table(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
Haplotype1 = rep(1:4,2), Haplotype2 = rep(5:8,2),
key = "population")
## Or, convert your existing data.frame "df" to data.table:
## DT <- data.table(df, key = "population")
DT[, .SD[length(unique(individual)) >= 3], by = key(DT)]
# population individual Haplotype1 Haplotype2
# 1: AA A1 1 5
# 2: AA A2 2 6
# 3: AA A3 3 7
# 4: CC C1 2 6
# 5: CC C2 3 7
# 6: CC C3 4 8
Update
I'm not sure if this is important to you or not, but note that with Tyler's and Sven's current solutions, although the output is correct according to the data in the question you've posted, there is actually some potentially flawed thinking going on.
I write "potentially" because you mention that you're looking for groups (from df$population) where there are three or more individuals (from df$individual). However, both of their solutions currently only look at the lengths of population, while by your actual question I would have assumed that you would want the number of unique individuals mentioned by population.
Here's a simple example. Using your original "df", change the individual in row 3 to "A2" (df[3, 2] <- "A2"). Now, according to your criteria in your question, only rows with population == "CC" should be returned.
If your data already only has unique individuals, then no problem--but I thought I would mention it ;)
A base R solution that keeps this logic into account is:
uniqueIndividuals <- ave(as.character(df$individual),
df$population, FUN = function(x) length(unique(x)))
df[which(as.numeric(uniqueIndividuals) >= 3), ]
This would work as well:
lens <- tapply(df$population , df$population, length)
df[df$population %in% names(lens)[lens > 2], ]
EDIT: Per mrdwab's sharp reading I have edited my answer. I must admit I looked at the input and output only:
lens <- tapply(df$individual, df$population, function(x) length(unique(x)))
df[df$population %in% names(lens)[lens > 2], ]

Alternatives to stats::reshape

The melt/cast functions in the reshape package are great, but I'm not sure if there is a simple way to apply them when measured variables are of different types. For example, here is a snippet from data where each MD provides the gender and weight of three patients:
ID PT1 WT1 PT2 WT2 PT3 WT3
1 "M" 170 "M" 175 "F" 145
...
where the objective is to reshape so each row is a patient:
ID PTNUM GENDER WEIGHT
1 1 "M" 170
1 2 "M" 175
1 3 "F" 145
...
Using the reshape function in the stats package is one option of which I'm aware, but I'm posting here in the hopes that R users more experienced than me will post other, hopefully better methods. Many thanks!
--
#Vincent Zoonekynd :
I liked your example a lot, so I generalized it to multiple variables.
# Sample data
n <- 5
d <- data.frame(
id = 1:n,
p1 = sample(c("M","F"),n,replace=TRUE),
q1 = sample(c("Alpha","Beta"),n,replace=TRUE),
w1 = round(runif(n,100,200)),
y1 = round(runif(n,100,200)),
p2 = sample(c("M","F"),n,replace=TRUE),
q2 = sample(c("Alpha","Beta"),n,replace=TRUE),
w2 = round(runif(n,100,200)),
y2 = round(runif(n,100,200)),
p3 = sample(c("M","F"),n,replace=TRUE),
q3 = sample(c("Alpha","Beta"),n,replace=TRUE),
w3 = round(runif(n,100,200)),
y3 = round(runif(n,100,200))
)
# Reshape the data.frame, one variable at a time
library(reshape)
d1 <- melt(d, id.vars="id", measure.vars=c("p1","p2","p3","q1","q2","q3"))
d2 <- melt(d, id.vars="id", measure.vars=c("w1","w2","w3","y1","y2","y3"))
d1 = cbind(d1,colsplit(d1$variable,names=c("var","ptnum")))
d2 = cbind(d2,colsplit(d2$variable,names=c("var","ptnum")))
d1$variable = NULL
d2$variable = NULL
d1c = cast(d1,...~var)
d2c = cast(d2,...~var)
# Join the two data.frames
d3 = merge(d1c, d2c, by=c("id","ptnum"), all=TRUE)
--
Final thoughts: my motivation for this question was to learn about alternatives to the reshape package other than the stats::reshape function. For the moment, I've reached the following conclusions:
Stick to stats::reshape when you can. As long as you remember to use a list rather than a simple vector for the "varying" argument, you'll stay out of trouble. For smaller data sets--a few thousand patient cases with less than 200 variables in total is what I was dealing with this time--the lower speed of this function is worth the simplicity of the code.
To use the cast/melt approach in Hadley Wickham's reshape (or reshape2) package, you have to split your variables into two sets, one consisting of numeric variables and another of character variables. When your data set is large enough that you find stats::reshape unbearable, I imagine the extra step of dividing your variables into two sets won't seem so bad.
You can process each variable separately,
and join the resulting two data.frames.
# Sample data
n <- 5
d <- data.frame(
id = 1:n,
pt1 = sample(c("M","F"),n,replace=TRUE),
wt1 = round(runif(n,100,200)),
pt2 = sample(c("M","F"),n,replace=TRUE),
wt2 = round(runif(n,100,200)),
pt3 = sample(c("M","F"),n,replace=TRUE),
wt3 = round(runif(n,100,200))
)
# Reshape the data.frame, one variable at a time
library(reshape2)
d1 <- melt(d,
id.vars="id", measure.vars=c("pt1","pt2","pt3"),
variable.name="patient", value.name="gender"
)
d2 <- melt(d,
id.vars="id", measure.vars=c("wt1","wt2","wt3"),
variable.name="patient", value.name="weight"
)
d1$patient <- as.numeric(gsub("pt", "", d1$patient))
d2$patient <- as.numeric(gsub("wt", "", d1$patient))
# Join the two data.frames
merge(d1, d2, by=c("id","patient"), all=TRUE)
I think the reshape function in the stats package is simplest. Here is a simple example, does this do what you want?
> tmp
id val val2 cat
1 1 1 14 a
2 1 2 13 b
3 2 3 12 b
4 2 4 11 a
> tmp2 <- tmp
> tmp2$t <- ave(tmp2$val, tmp2$id, FUN=seq_along)
> tmp2
id val val2 cat t
1 1 1 14 a 1
2 1 2 13 b 2
3 2 3 12 b 1
4 2 4 11 a 2
> reshape(tmp2, idvar='id', timevar='t', direction='wide')
id val.1 val2.1 cat.1 val.2 val2.2 cat.2
1 1 1 14 a 2 13 b
3 2 3 12 b 4 11 a
Hopefully your patients sex is not changing each appointment, but there could be other categorical variables that change between visits

Resources