I have a dataframe and i want to compare variables in line 3 using if statement with the use of more or less
let's say i want to compare the same values in third column with more or less 0.2
data >
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5
here a program to compare the exact values
for (i in 1:3) {
d <- i+1
for (j in d:4) {
if(data [3,i] == data [3,j] ){
print(paste("The columns" , colnames(data[,i]) ,"and " , colnames(data[,i]) , "are equal"))
}
}
}
Here it retuns nothings because the program compares the exacte values and me i want to compare that have the same values more or less 0.2
the result i want is
the column A and B are equal
the column A and D are equal
it's because A(=7) + or - the same as B(7.1)
and the same thing for D
A(=7) + or - D (6.9)
Thank you
Take the combination of columns then compare with tolerance:
df1 <- read.table(text ="
NAME A B C D
first 3 2 4 5
second 1 2 3 4
third 7 7.1 7.5 6.9
four 2 1 0 5", header = TRUE)
tolerance = 0.2
cbind(df1,
combn(colnames(df1[, 2:5]), 2, FUN = function(x){
paste0(x[ 1 ],
ifelse(abs(df1[, x[ 1 ] ] - df1[, x[ 2 ] ]) <= tolerance, "=", "!="),
x[ 2 ])
}))
# NAME A B C D 1 2 3 4 5 6
# 1 first 3 2.0 4.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 2 second 1 2.0 3.0 4.0 A!=B A!=C A!=D B!=C B!=D C!=D
# 3 third 7 7.1 7.5 6.9 A=B A!=C A=D B!=C B=D C!=D
# 4 four 2 1.0 0.0 5.0 A!=B A!=C A!=D B!=C B!=D C!=D
I am facing a problem with the amount of time needed to run my code. Basically, I have several columns a key value in the last column (that I identify as the mean in the reproducible example). I want it to be 1 when it is below the value and 2 when it is above.
Is there an easier way to do this?
a <- c(1,3,5,6,4)
b <- c(10,4,24,5,3)
df <- data.frame (a,b)
df$mean <- rowMeans (df)
for (i in 1:5){
df[i,1:2] [df[i,1:2]<df$mean[i]] <- 1
df[i,1:2] [df[i,1:2]>df$mean[i]] <- 2
}
Thank you in advance
You can simply do,
df[1:2] <- (df[1:2] > df$mean) + 1 #removed as.integer as per #akrun's comment
Which gives,
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5
Always avoid using loops when possible in R!
Alternative Solution using mutate_each from dplyr
df %>% mutate_each(funs(ifelse(mean>.,1,2)), 1:2)
Also gives
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5
I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))
As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)
if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?
There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
r
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.
I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.
I have a data frame as follows,
> mydata
date station treatment subject par
A a 0 R1 1.3
A a 0 R1 1.4
A a 1 R2 1.4
A a 1 R2 1.1
A b 0 R1 1.5
A b 0 R1 1.8
A b 1 R2 2.5
A b 1 R2 9.5
B a 0 R1 0.3
B a 0 R1 8.2
B a 1 R2 7.3
B a 1 R2 0.2
B b 0 R1 9.4
B b 0 R1 3.2
B b 1 R2 3.5
B b 1 R2 2.4
....
where:
date is a factor with 2 levels A/B;
station is a factor with 2 levels a/b;
treatment is a factor with 2 levels 0/1;
subject are the replicates R1 to R20 assigned to treatment (10 to treatment 0 and 10 to treatment 1);
and
par is my parameter, which is a repeated measurement of particle size for each subject at at each date and station
What i need to do is:
divide par in 10 equal bins and count the number in each bin. This has to be done in subsets of mydata definded by a combination of date station and subject. The final outcome has to be a daframe myres as follow:
> myres
date station treatment bin.centre freq
A a 0 1.2 4
A a 0 1.3 3
A a 0 1.4 2
A a 0 1.5 1
A a 1 1.2 4
A a 1 1.3 3
A a 1 1.4 2
A a 1 1.5 1
B b 0 2.3 5
B b 0 2.4 4
B b 0 2.5 3
B b 0 2.6 2
B b 1 2.3 5
B b 1 2.4 4
B b 1 2.5 3
B b 1 2.6 2
....
this is what i've done so far:
#define the number of bins
num.bins<-10
#define the width of each bins
bin.width<-(max(par)-min(par))/num.bins
#define the lower and upper boundaries of each bins
bins<-seq(from=min(par), to=max(par), by=bin.width)
#define the centre of each bins
bin.centre<-c(seq(min(bins)+bin.width/2,max(bins)-bin.width/2,by=bin.width))
#create a vector to store the frequency in each bins
freq<-numeric(length(length(bins-1)))
# this is the loop that counts the frequency of particles between the lower and upper boundaries
of each bins and store the result in freq
for(i in 1:10){
freq[i]<-length(which(par>=bins[i] &
par<bins[i+1]))
}
#create the data frame with the results
res<-data.frame(bin.centre,res)
my first approach was to subset mydata manually, using subset(),for each combination of subject station and date, and apply the above sequence of commands for each subsets, then build the final dataframe combining each single res using rbind(), but this procedure was very convoluted and subject to the propagation of errors.
What i would like to do, is to automate the above procedure so that it calculates the binned frequency distribution for each subject. My intuition is that the best way to do this is by creating a function for estimating this particle distribution, and then applying it to each subject via a for loop. However, I am not sure of how to do this. Any suggestions would be really appreciated.
thanks
matteo.
You can do this in a few steps using the functionality in the plyr package. This allows you to split your data into the desired chunks, apply a statistic to each chunk, and combine the results.
First I set up some dummy data:
set.seed(1)
n <- 100
dat <- data.frame(
date=sample(LETTERS[1:2], n, replace=TRUE),
station=sample(letters[1:2], n, replace=TRUE),
treatment=sample(0:1, n, replace=TRUE),
subject=paste("R", sample(1:2, n, replace=TRUE), sep=""),
par=runif(n, 0, 5)
)
head(dat)
date station treatment subject par
1 A b 0 R2 3.2943880
2 A a 0 R1 0.9253498
3 B a 1 R1 4.7718907
4 B b 0 R1 4.4892425
5 A b 0 R1 4.7184853
6 B a 1 R2 3.6184538
Now I use the function in base called cut to divide your par into equal sized bins:
dat$bin <- cut(dat$par, breaks=10)
Now for the fun bit. Load package plyr and use the function ddply to split, apply and combine. Because you want a frequency count, we can use the function length to count the number of times each replicate appeared in that bin:
library(plyr)
res <- ddply(dat, .(date, station, treatment, bin),
summarise, freq=length(treatment))
head(res)
date station treatment bin freq
1 A a 0 (0.00422,0.501] 1
2 A a 0 (0.501,0.998] 2
3 A a 0 (1.5,1.99] 4
4 A a 0 (1.99,2.49] 2
5 A a 0 (2.49,2.99] 2
6 A a 0 (2.99,3.48] 1