I have a dataframe, with a column that increases with every row, and periodically (though not regularly) resets back to 1.
I'd like to track/ count these resets in separate column. This for-loop example does exactly what I want, but is incredibly slow when applied to large datasets. Is there a better/ quicker/ more R way to do this same operation:
ColA<-seq(1,20)
ColB<-rep(seq(1,5),4)
DF<-data.frame(ColA, ColB)
DF$ColC<-NA
DF[1,'ColC']<-1
#Removing line 15 and changing line 5 to 1.1 per comments in answer
DF<-DF[-15,]
DF[5,2]<-0.1
for(i in seq(1,nrow(DF)-1)){
print(i)
MyRow<-DF[i+1,]
if(MyRow$ColB < DF[i,'ColB']){
DF[i+1,"ColC"]<-DF[i,"ColC"] +1
}else{
DF[i+1,"ColC"]<-DF[i,"ColC"]
}
}
No need for a loop here. We can just use the vectorized cumsum. This ought to be faster:
DF$ColC<-cumsum(DF$ColB==1)
DF
To keep using varying variable reset values that are always lower then the previous value, use cumsum(ColB < lag(ColB)):
DF %>% mutate(ColC = cumsum(ColB < lag(ColB, default = Inf)))
ColA ColB ColC
1 1 1.0 1
2 2 2.0 1
3 3 3.0 1
4 4 4.0 1
5 5 0.1 2
6 6 1.0 2
7 7 2.0 2
8 8 3.0 2
9 9 4.0 2
10 10 5.0 2
11 11 1.0 3
12 12 2.0 3
13 13 3.0 3
14 14 4.0 3
16 16 1.0 4
17 17 2.0 4
18 18 3.0 4
19 19 4.0 4
20 20 5.0 4
Related
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)
I want to replicate observations based on the values of the variable iptw to create pseudo-populations for further analysis.
For example, if iptw=4.5, then weight=5 should be created, and the observation/row multiplied by 5. Likewise, if iptw=2.3, then weight=2, and that row is multiplied by 2, which is equivalent to adding the corresponding observation twice to the data frame.
Here is my dataset:
dtNEW <- data.table(id = 1:4, x1 = 10:13, x2=21:24, iptw=c(2.3,0.6,4.5,0.1))
There is a similar question here but the solutions there do not answer my question.
Assuming you want to replicate the ith row round(iptw[i]) times:
dtNEW[rep(1:.N, round(iptw)), ]
giving:
id x1 x2 iptw
1: 1 10 21 2.3
2: 1 10 21 2.3
3: 2 11 22 0.6
4: 3 12 23 4.5
5: 3 12 23 4.5
6: 3 12 23 4.5
7: 3 12 23 4.5
Another option is uncount from tidyr
library(tidyr)
uncount(dtNEW, round(iptw))
# id x1 x2 iptw
#1: 1 10 21 2.3
#2: 1 10 21 2.3
#3: 2 11 22 0.6
#4: 3 12 23 4.5
#5: 3 12 23 4.5
#6: 3 12 23 4.5
#7: 3 12 23 4.5
I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0
ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0
ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:
f <- function(x) {
idx <- which(is.na(x))
for (id in idx) {
sel <- x[id+c(-1,1)]
if (id < length(x))
sel <- sel[!is.na(sel)]
x[id] <- mean(sel)
}
return(x)
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
# id time ss ss_imp
# 11 1 0 2 2.0
# 12 1 1 2 2.0
# 13 1 2 4 4.0
# 14 1 3 3 3.0
# 15 1 4 NA 1.5
# 16 1 5 0 0.0
# 17 1 6 0 0.0
# 21 2 0 1 1.0
# 22 2 1 4 4.0
# 23 2 2 0 0.0
# 24 2 3 NA 0.0
# 25 2 4 0 0.0
# 26 2 5 0 0.0
# 27 2 6 0 0.0
# 31 3 0 4 4.0
# 32 3 1 2 2.0
# 33 3 2 1 1.0
# 34 3 3 3 3.0
# 35 3 4 3 3.0
# 36 3 5 2 2.0
# 37 3 6 NA NA
# 41 4 0 3 3.0
# 42 4 1 4 4.0
# 43 4 2 3 3.0
# 44 4 3 NA 3.0
# 45 4 4 NA 1.5
# 46 4 5 0 0.0
# 47 4 6 0 0.0
I want to make new column in my data set with the values determined by values in another data set, but it's not as simple as the values in one column being a function of the values in the other. Here's an example:
>df1
chromosome position
1 1 1
2 1 2
3 1 4
4 1 5
5 1 7
6 1 12
7 1 13
8 1 15
9 1 21
10 1 23
11 1 24
12 2 1
13 2 5
14 2 7
15 2 8
16 2 12
17 2 15
18 2 18
19 2 21
20 2 22
and
>df2
chromosome segment_start segment_end segment.number
1 1 1 5 1.1
2 1 6 20 1.2
3 1 21 25 1.3
4 2 1 7 2.1
5 2 8 16 2.2
6 2 18 22 2.3
I want to make a new column in df1 called 'segment', and the value in segment is to be determined by which segment (as determined by 'segment_start', 'segment_end', and 'chromosome' from df2) the value in 'position' belongs to. For example, in df1, row 7, position=13, and chromosome=1. Because 13 is between 6 and 20, the entry in my hypothetical 'segment' column would be 1.2, from row 2 of df2, because 13 falls between segment_start and segment_end from that row (6 and 20, respectively), and the 'chromosome' value from df1 row 7 is 1, just as 'chromosome' in df2 row 2 is 1.
Each row in df1 belongs to one of the segments described in df2; that is, it lies on the same chromosome as one of the segments, and its 'position' is >=segment_start and <=segment_end. And I want to get that information into df1, so it says what segment each position belongs to.
I was thinking of using an if function, and started with:
if(df1$position>=df2$segment_start & df1$position<=df2$segment_end & df1$chromosome==df2$chromosome) df1$segment<-df2$segment.number
But am not sure that way will be feasible. If nothing else maybe the code can help illustrate what it is I'm trying to do. Basically, I want match each row by its position and chromosome to a segment in df2. Thanks.
This appears to be a rolling join. You can use data.table for this
require(data.table)
DT1 <- data.table(df1, key = c('chromosome','position'))
DT2 <- data.table(df2, key = c('chromosome','section_start'))
# this will perform the join you want (but retain all the
# columns with names names of DT2)
# DT2[DT1, roll=TRUE]
# which is why I have renamed and subset here)
DT2[DT1, roll=TRUE][ ,list(chromosome,position = segment_start,segment.number)]
# chromosome position segment.number
# 1: 1 1 1.1
# 2: 1 2 1.1
# 3: 1 4 1.1
# 4: 1 5 1.1
# 5: 1 7 1.2
# 6: 1 12 1.2
# 7: 1 13 1.2
# 8: 1 15 1.2
# 9: 1 21 1.3
# 10: 1 23 1.3
# 11: 1 24 1.3
# 12: 2 1 2.1
# 13: 2 5 2.1
# 14: 2 7 2.1
# 15: 2 8 2.2
# 16: 2 12 2.2
# 17: 2 15 2.2
# 18: 2 18 2.3
# 19: 2 21 2.3
# 20: 2 22 2.3
You really need to check out the GenomicRanges package from Bioconductor. It provides the data structures that are appropriate for your use case.
First, we create the GRanges objects:
gr1 <- with(df1, GRanges(chromosome, IRanges(position, width=1L)))
gr2 <- with(df2, GRanges(chromosome, IRanges(segment_start, segment_end),
segment.number=segment.number))
Then we find the overlaps and do the merge:
hits <- findOverlaps(gr1, gr2)
gr1$segment[queryHits(hits)] <- gr2$segment.number[subjectHits(hits)]
I'm going to assume that the regions in df2 are non-overlapping, continuous and complete (not missing any positions from df1). I seem to do this differently every time I try, so here's my latest idea.
First, make sure chromosome is a factor in both data sets
df1$chromosome<-factor(df1$chromosome)
df2$chromosome<-factor(df2$chromosome)
Now I want to unwrap, chr/pos into one over all generic position, i'll do that with
ends<-with(df2, tapply(segment_end, chromosome, max))
offset<-head(c(0,cumsum(ends)),-1)
names(offset)<-names(ends)
This assigns unique position values to all positions across all chromosomes and it tracks the offset to the beginning of each chromosome in this new system. Now we will build a translation function from the data in df2
seglookup <- approxfun(with(df2, offset[chromosome]+segment_start), 1:nrow(df2),
method="constant", rule=2)
We use approxfun to find the right interval in the genetic position space for each segment. Now we use this function on df1
segid <- with(df1, seglookup(offset[chromosome]+position))
Now we have the correct ID for each position. We can verify this by merging the data and looking at the results
cbind(df1, df2[segid,-1])
chromosome position segment_start segment_end segment.number
1 1 1 1 5 1.1
2 1 2 1 5 1.1
3 1 4 1 5 1.1
4 1 5 1 5 1.1
5 1 7 6 20 1.2
6 1 12 6 20 1.2
7 1 13 6 20 1.2
8 1 15 6 20 1.2
9 1 21 21 25 1.3
10 1 23 21 25 1.3
11 1 24 21 25 1.3
12 2 1 1 7 2.1
13 2 5 1 7 2.1
14 2 7 1 7 2.1
15 2 8 8 16 2.2
16 2 12 8 16 2.2
17 2 15 8 16 2.2
18 2 18 18 22 2.3
19 2 21 18 22 2.3
20 2 22 18 22 2.3
So it looks like we did alright.
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")