sequential subtraction in r - r

I would highly appreciate if somebody could help me out with this. This looks simple but I have no clue how to go about it.
I am trying to work out the percentage change in one row with respect to the previous one. For example: my data frame looks like this:
day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8
. .
. .
. .
365 27.2
What I am trying to do is to calculate the percentage change in each row with respect to previous row. For example:
day value
1 21
2 (day2-day1/day1)*100
3 (day3-day2/day2)*100
4 (day4-day3/day3)*100
5 (day5-day4/day4)*100
6 (day6-day5/day5)*100
7 (day7-day6/day6)*100
8 (day8-day7/day7)*100
. .
. .
. .
365 (day365-day364/day364)*100
and then print out only those days where the there was a percentage increase of >50% from the previous row
Many thanks

You are looking for diff(). See its help page by typing ?diff. Here are the indices of days that fulfill your criterion:
> value <- c(21,23.4,10.7,5.6,3.2,35.2,12.9,67.8)
> which(diff(value)/head(value,-1)>0.5)+1
[1] 6 8

Use diff:
value <- 100*diff(value)/value[2:length(value)]

Here's one way:
dat <- data.frame(day = 1:10, value = 1:10)
dat2 <- transform(dat, value2 = c(value[1], diff(value) / head(value, -1) * 100))
day value value2
1 1 1 1.00000
2 2 2 100.00000
3 3 3 50.00000
4 4 4 33.33333
5 5 5 25.00000
6 6 6 20.00000
7 7 7 16.66667
8 8 8 14.28571
9 9 9 12.50000
10 10 10 11.11111
dat2[dat2$value2 > 50, ]
day value value2
2 2 2 100

You're looking for the difffunction :
x<-c(3,1,4,1,5)
diff(x)
[1] -2 3 -3 4

Here is another way:
#dummy data
df <- read.table(text="day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8", header=TRUE)
#get index for 50% change
x <- sapply(2:nrow(df),function(i)((df$value[i]-df$value[i-1])/df$value[i-1])>0.5)
#output
df[c(FALSE,x),]
# day value
#6 6 35.2
#8 8 67.8

Related

Creating a percentage column based on the sums of a column grouped by a different column? [duplicate]

This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)

lag and summarize time series data

I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated
something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2

Grouped ranking in R

I have a data with primary key and ratio values like the following
2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869
I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:
PROC RANK DATA = TAB1 GROUPS = &NUM_BINS
I did the following:
Convert your vector to data frame.
Create variable Rank:
test2$rank<-rank(test2$test)
> test2
test rank
1 2.243164 15.0
2 1.429242 9.0
3 2.119271 11.0
4 3.013427 16.0
5 1.208635 3.5
6 1.208635 3.5
7 1.236576 7.0
8 2.212136 14.0
9 2.168583 13.0
10 2.151961 12.0
11 1.159886 2.0
12 1.234106 6.0
13 1.694206 10.0
14 1.401425 8.0
15 5.210126 17.0
16 1.215268 5.0
17 1.089190 1.0
Define function to convert to percentile ranks and then define pr as that percentile.
percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))
Then I created bins on the fact you wanted 3 of them.
test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))
test x rank pr bins
1 2.243164 15.0 15.0 88 66-100
2 1.429242 9.0 9.0 52 34-66
3 2.119271 11.0 11.0 64 34-66
4 3.013427 16.0 16.0 94 66-100
5 1.208635 3.5 3.5 20 0-33
6 1.208635 3.5 3.5 20 0-33
7 1.236576 7.0 7.0 41 34-66
8 2.212136 14.0 14.0 82 66-100
9 2.168583 13.0 13.0 76 66-100
10 2.151961 12.0 12.0 70 66-100
11 1.159886 2.0 2.0 11 0-33
12 1.234106 6.0 6.0 35 34-66
13 1.694206 10.0 10.0 58 34-66
14 1.401425 8.0 8.0 47 34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268 5.0 5.0 29 0-33
17 1.089190 1.0 1.0 5 0-33
That work for you?
Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:
df <- data.frame(values = c(2.243164164,
1.429242413,
2.119270714,
3.013427143,
1.208634972,
1.208634972,
1.23657632,
2.212136028,
2.168583297,
2.151961216,
1.159886063,
1.234106444,
1.694206176,
1.401425329,
5.210125578,
1.215267806,
1.089189869))
library(dplyr)
df <- df %>%
arrange(values) %>%
mutate(rank = ntile(values, 3))
values rank
1 1.089190 1
2 1.159886 1
3 1.208635 1
4 1.208635 1
5 1.215268 1
6 1.234106 1
7 1.236576 2
8 1.401425 2
9 1.429242 2
10 1.694206 2
11 2.119271 2
12 2.151961 2
13 2.168583 3
14 2.212136 3
15 2.243164 3
16 3.013427 3
17 5.210126 3
Or see cut_number from ggplot2 package:
library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))
values rank rank2
1 1.089190 1 1
2 1.159886 1 1
3 1.208635 1 1
4 1.208635 1 1
5 1.215268 1 1
6 1.234106 1 1
7 1.236576 2 2
8 1.401425 2 2
9 1.429242 2 2
10 1.694206 2 2
11 2.119271 2 2
12 2.151961 2 3
13 2.168583 3 3
14 2.212136 3 3
15 2.243164 3 3
16 3.013427 3 3
17 5.210126 3 3
Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.
> table(df$rank)
1 2 3
6 6 5
> table(df$rank2)
1 2 3
6 5 6
See also here: Splitting a continuous variable into equal sized groups

R ignores 0 values

I am trying to create some descriptive statistics and histograms out of ordered variables (range 0 to 10). I used the following commands:
class(data$var1)
describe(as.numeric(data$var1))
But R starts from 1 and counts the "refusal" values as a further numeric value.
How can I let R start from 0 and ignore the "refusal" values?
Thank you.
Edit: I was able to let R ignore "refusal" value using the following command:
is.na (data$var1[data$var1=="Refusal"]) <- TRUE
But when I search for possible solution about the 0 values I am only finding suggestion on how to ignore/remove 0 values...
Edit2: This is a sample of my data,
[1] 5 8 8 8 Refusal 10 8 Refusal 7
[10] 7 8 7 8 8 8 8 8 8
[19] 8 0 9 Refusal 6 10 7 7 9
as you can see the range is from 0 to 10 but using the R library "psych" and the command "describe" the output range is always 1 to 11 and this invalidates the whole statistics.
> class(data$var1)
[1] "factor"
> describe(as.numeric(data$var1), na.rm=TRUE)
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 1115 8.38 1.94 9 8.57 1.48 1 11 10 -1.06 1.42 0.06
Sorry for the ongoing editing but I am new of stackoverflow.com
Have a look at how factors work, with ?factor, or looking at the example question here. In essence, each level is given a number starting at 1, hence ending at 11 if you have 11 unique values. Conversion of a factor to numeric returns these codes, rather than the underlying numbers they relate to. To do this, first convert to character, then to numeric. See the difference between these code snippets:
#create data
set.seed(0)
a <- factor(sample(c(0:10,"refusal"),50,T)) #Some dummy data
class(a)
# [1] "factor"
snippet 1 - how you're doing it
describe(as.numeric(a),na.rm=TRUE)
#as.numeric(a)
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#50 0 11 6.28 2.00 2.00 4.00 6.00 8.75 10.00 11.00
#
#1 2 3 4 5 6 7 8 9 10 11
#Frequency 2 5 5 4 2 8 6 5 3 6 4
#% 4 10 10 8 4 16 12 10 6 12 8
snippet 2 - correct way
describe(as.numeric(as.character(a)),na.rm=TRUE)
#as.numeric(as.character(a))
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#46 4 10 5.304 1.0 1.0 3.0 5.0 8.0 9.5 10.0
#
#0 1 2 3 4 5 7 8 9 10
#Frequency 2 5 4 2 8 6 5 3 6 5
#% 4 11 9 4 17 13 11 7 13 11
#Warning message:
# In describe(as.numeric(as.character(a)), na.rm = TRUE) :
# NAs introduced by coercion
Note the difference in range (even if my describe function isn't the same as yours). The warning refers to the "refusals which are converted to NAs as they don't represent a number

Merging data sets with unequal observations

I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")

Resources