I'm new to R and I'm trying to sum 2 columns of a given dataframe, if both the elements to be summed satisfy a given condition. To make things clear, what I want to do is:
> t.d<-as.data.frame(matrix(1:9,ncol=3))
> t.d
V1 V2 V3
1 4 7
2 5 8
3 6 9
> t.d$V4<-rep(0,nrow(t.d))
> for (i in 1:nrow(t.d)){
+ if (t.d$V1[i]>1 && t.d$V3[i]<9){
+ t.d$V4[i]<-t.d$V1[i]+t.d$V3[i]}
+ }
> t.d
V1 V2 V3 V4
1 4 7 0
2 5 8 10
3 6 9 0
I need an efficient code, as my real dataframe has about 150000 rows and 200 columns. This gives an error:
t.d$V4<-t.d$V1[t.d$V1>1]+ t.d$V3[t.d$V3>9]
Is "apply" an option? I tried this:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d$V4<-rep(0,nrow(t.d))
my.fun<-function(x,y){
if(x>1 && y<9){
x+y}
}
t.d$V4<-apply(X=t.d,MAR=1,FUN=my.fun,x=t.d$V1,y=t.d$V3)
but it gives an error as well.
Thanks very much for your help.
This operation doesn't require loops, apply statements or if statements. Vectorised operations and subsetting is all you need:
t.d <- within(t.d, V4 <- V1 + V3)
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0
t.d
V1 V2 V3 V4
1 1 4 7 0
2 2 5 8 10
3 3 6 9 0
Why does this work?
In the first step I create a new column that is the straight sum of columns V1 and V4. I use within as a convenient way of referring to the columns of d.f without having to write d.f$V all the time.
In the second step I subset all of the rows that don't fulfill your conditions and set V4 for these to 0.
ifelse is your friend here:
t.d$V4<-ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0)
I'll chip in and provide yet another version. Since you want zero if the condition doesn't mach, and TRUE/FALSE are glorified versions of 1/0, simply multiplying by the condition also works:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d <- within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9))
...and it happens to be faster than the other solutions ;-)
t.d <- data.frame(V1=runif(2e7, 1, 2), V2=1:2e7, V3=runif(2e7, 5, 10))
system.time( within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9)) ) # 3.06 seconds
system.time( ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0) ) # 5.08 seconds
system.time( { t.d <- within(t.d, V4 <- V1 + V3);
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0 } ) # 4.50 seconds
Related
I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}
I want to make an existing vector size n and use NA. I know I can pad at the end of the vector like so:
v1 <- 1:10
v2 <- diff(v1)
length(v2) <- length(v1)
v2
# 1 1 1 1 1 1 1 1 1 NA
But I want to fill the NA at the beginnning instead in a generic way. I mean for this particular example I can just
v2 <- c(NA, diff(v1))
# NA 1 1 1 1 1 1 1 1 1
But I was hoping that there exist some base R function or library that provides something like v2 <- pad(v2, n=length(v1), value=NA)
Is there anything like that I can use off the self or do I need to define my own function:
pad <- function(x, n) { # ugly function that doesn't keep the attributes of x
len.diff <- n - length(x)
c(rep(NA, len.diff), x)
}
pad(1:10, 12) # NA NA 1 2 3 4 5 6 7 8 9 10
Assuming v1 has the desired length and v2 is shorter (or the same length) these left pad v2 with NA values to the length of v1. The first four assume numeric vectors although they can be modified to also work more generally by replacing NA*v1 in the code with rep(NA, length(v1)).
replace(NA * v1, seq(to = length(v1), length = length(v2)), v2)
rev(replace(NA * v1, seq_along(v2), rev(v2)))
replace(NA * v1, seq_along(v2) + length(v1) - length(v2), v2)
tail(c(NA * v1, v2), length(v1))
c(rep(NA, length(v1) - length(v2)), v2)
The fourth is the shortest. The first two and fourth do not involve any explicit arithmetic calculations other than multiplying v1 with NA values. The second is likely slow since it involves two applications of rev.
One option is diff from zoo which also have the na.pad
library(zoo)
as.vector(diff(zoo(v1), na.pad=TRUE))
#[1] NA 1 1 1 1 1 1 1 1 1
Defining nrValues as the number of elements you want at the start of v2 you could use:
n <- length(v1)
v2 <- c(rep(NA,nrValues),v1[nrValues:n])
I'm not familiar with a function that does this, so if you intend to do it multiple times I would create your own function.
I am working analyzing SNP data for a fungus, and I am trying to impute the missing data by changing the Ns to the genotype of the more frequent allele....see below.
newdata is a matrix of my snps (rows)and fungal isolates(columns). The genotypes for each snp are in the 0, 1, and N format, and that is why I am trying to impute the missing genotypes.
newdata_imputed=newdata
for (k in 1:nrow(newdata)){
u=newdata[k,]
x<-sum(u==0)
y<-sum(u==1)
all_freq=y/(x+y)
if (all_freq<0.5){
newdata_imputed[k,]=gsub("N",0,u)
} else{newdata_imputed[k,]=gsub("N",1,u)}
print(k)
}
However, I keep getting this error:
[1] 295
[1] 296
Error in if (all_freq < 0.5) { : missing value where TRUE/FALSE needed
It is obvious that the code runs but stops after encountering a problem. Please, can someone tell me what I am doing wrong? I am a newbie to R, and any advice would be greatly appreciated.
#akrun, the reason why i used a for loop is because it is nested in another for loop..so after using your code.
newdata=as.data.frame(newdata)
u=newdata
all_freq <- rowSums(u==1)/rowSums((u==1)|(u==0))
indx <- all_freq < 0.5
indx1 <- indx & !is.na(indx)
indx2 <- !indx & !is.na(indx)
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
newdata[] <- lapply(newdata, as.numeric)
I got weird values
newdata[1:10,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 1 1 1 1 1 1 1 1 1 1
Please where is the "3" coming from.???? I should only have 0 or 1
We could do this using rowSums. As #bergant and #MatthewLundberg mentioned in the comments, if there are rows with no 0 or 1 elements, we get NaN based on the calculation. One way would be to modify the logical condition by including !is.na, i.e. elements that are not NA along with the previous condition.
#using `rowSums` to create the all_freq vector
all_freq <- rowSums(newdata==1)/rowSums((newdata==1)|(newdata==0))
#Create a logical index based on elements that are less than 0.5
indx <- all_freq < 0.5
#The NA elements can be changed to FALSE by adding another condition
indx1 <- indx & !is.na(indx)
#similarly for elements that are > 0.5
indx2 <- !indx & !is.na(indx)
Now, we subset the rows of the 'newdata' with 'indx1', loop through the columns (lapply) and use gsub with pattern and replacement arguments and assign the output back to the subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
Similarly, we can do the replacement for the rows that are greater than 0.5 for 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
The gsub output columns are character class, which can be converted back to numeric (if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24)
newdata <- as.data.frame(matrix(sample(c(0:1, "N"), 10*4, replace=TRUE),
ncol=4), stringsAsFactors=FALSE)
newdata[7,] <- 2
There is probably a basic fix for this, but being new to R, I have been unsuccessful in finding it.
I have two variables, V1 (POSIXct) and V2 (numeric). I would like to add (10-V2) seconds to V1 if V2!=0
df <- data.frame(V1=c(970068340, 970068350, 970068366, 970068376, 970068380,
970068394), V2= c(0,0,6,6,0,4))
I've attempted the following loop, but with more than 2 million observations, it takes much too long to execute. Is there an efficient solution to this problem?
for(i in 1:length(df$V2)) {
if (df$V2[i] != 0){
df$V1[i] = df$V1[i] + (10-df$V2[i])
}
}
For clarification, the data look like this:
V1 V2
970068340 0
970068350 0
970068356 6
970068366 6
970068370 0
970068384 4
and I would like to transform it to the following:
V1 V2
970068340 0
970068350 0
970068360 6
970068370 6
970068370 0
970068390 4
I'd use [ to subset and [<- to replace. You can do this with entirely vectorised operations (even though it looks a little untidy). Without using data.table I would reckon this would be the fastest way in base R...
rows <- df$V2 != 0
df[ rows , "V1" ] <- df[ rows , "V1" ] + 10 - df[ rows , "V2" ]
# V1 V2
#1 970068340 0
#2 970068350 0
#3 970068370 6
#4 970068380 6
#5 970068380 0
#6 970068400 4
Another option is:
transform(df,V1=V1+(10-V2)*as.logical(V2))
V1 V2
1 970068340 0
2 970068350 0
3 970068370 6
4 970068380 6
5 970068380 0
6 970068400 4
df$V1 = with(df, {V1 + ifelse(V2!=0,10-V2,0)})
library(data.table)
dt = data.table(df)
dt[V2 != 0, V1 := V1 + 10 - V2]
Suppose I have a dataframe that looks like this:
v1 v2 v3 v4 v5 v6
r1 1 2 2 4 5 9
r2 1 2 2 4 5 10
r3 1 2 2 4 5 7
r4 1 2 2 4 5 12
r5 2 2 2 4 5 9
r6 2 2 2 4 5 10
I would like to get the row with the highest value in v6 that has the value 1 in v1.
I know how to get all rows where v1 = 1 and select the first row of that, thanks to this answer to a previous question:
ddply( df , .variables = "v1" , .fun = function(x) x[1,] )
How can I change the function so that I get the row with the highest value in v6?
From the previous results, I'd use [ to subset on your first condition using logical comparators and then do a second subset on your second condition because as #sgibb points out in the comments, the max value of v6 might not be in a row where v1 == 1.
# Subset to those rows where v1 == 1
tmp <- df[ df$v1 == 1 , ]
# Then select those rows where the max value of v6 appears
tmp[ tmp$v6 == max( tmp$v6 ) , ]
If you want the first occurence, use which.max()
we could also use the subset operator like
x_sub= subset(x, state == "C" & chainlength == 5 & segment == "C2C_REG")
where x is the data frame and the other parameter is a logical expression