I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}
Related
Suppose, that one has the following dataframe:
x=data.frame(c(1,1,2,2,2,3),c("A","A","B","B","B","B"))
names(x)=c("v1","v2")
x
v1 v2
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 3 B
In this dataframe a value in v1 I want to correspond into a label in v2. However, as one can see in this example B has more than one corresponding values.
Is there any elegant and fast way to find which labels in v2 correspond to more than one values in v1 ?
The result I want ideally to show, the values - which in our example should be c(2,3) - as well as the row number - which in our example should be r=c(5,6).
Assuming that we want the index of the unique elements in 'v1' grouped by 'v2' and that should have more than one unique elements, we create a logical index with ave and use that to subset the rows of 'x'.
i1 <- with(x, ave(v1, v2, FUN = function(x)
length(unique(x))>1 & !duplicated(x, fromLast=TRUE)))!=0
x[i1,]
# v1 v2
#5 2 B
#6 3 B
Or a faster option is data.table
library(data.table)
i1 <- setDT(x)[, .I[uniqueN(v1)>1 & !duplicated(v1, fromLast=TRUE)], v2]$V1
x[i1, 'v1', with = FALSE][, rn := i1][]
# v1 rn
#1: 2 5
#2: 3 6
I want to make an existing vector size n and use NA. I know I can pad at the end of the vector like so:
v1 <- 1:10
v2 <- diff(v1)
length(v2) <- length(v1)
v2
# 1 1 1 1 1 1 1 1 1 NA
But I want to fill the NA at the beginnning instead in a generic way. I mean for this particular example I can just
v2 <- c(NA, diff(v1))
# NA 1 1 1 1 1 1 1 1 1
But I was hoping that there exist some base R function or library that provides something like v2 <- pad(v2, n=length(v1), value=NA)
Is there anything like that I can use off the self or do I need to define my own function:
pad <- function(x, n) { # ugly function that doesn't keep the attributes of x
len.diff <- n - length(x)
c(rep(NA, len.diff), x)
}
pad(1:10, 12) # NA NA 1 2 3 4 5 6 7 8 9 10
Assuming v1 has the desired length and v2 is shorter (or the same length) these left pad v2 with NA values to the length of v1. The first four assume numeric vectors although they can be modified to also work more generally by replacing NA*v1 in the code with rep(NA, length(v1)).
replace(NA * v1, seq(to = length(v1), length = length(v2)), v2)
rev(replace(NA * v1, seq_along(v2), rev(v2)))
replace(NA * v1, seq_along(v2) + length(v1) - length(v2), v2)
tail(c(NA * v1, v2), length(v1))
c(rep(NA, length(v1) - length(v2)), v2)
The fourth is the shortest. The first two and fourth do not involve any explicit arithmetic calculations other than multiplying v1 with NA values. The second is likely slow since it involves two applications of rev.
One option is diff from zoo which also have the na.pad
library(zoo)
as.vector(diff(zoo(v1), na.pad=TRUE))
#[1] NA 1 1 1 1 1 1 1 1 1
Defining nrValues as the number of elements you want at the start of v2 you could use:
n <- length(v1)
v2 <- c(rep(NA,nrValues),v1[nrValues:n])
I'm not familiar with a function that does this, so if you intend to do it multiple times I would create your own function.
I am working analyzing SNP data for a fungus, and I am trying to impute the missing data by changing the Ns to the genotype of the more frequent allele....see below.
newdata is a matrix of my snps (rows)and fungal isolates(columns). The genotypes for each snp are in the 0, 1, and N format, and that is why I am trying to impute the missing genotypes.
newdata_imputed=newdata
for (k in 1:nrow(newdata)){
u=newdata[k,]
x<-sum(u==0)
y<-sum(u==1)
all_freq=y/(x+y)
if (all_freq<0.5){
newdata_imputed[k,]=gsub("N",0,u)
} else{newdata_imputed[k,]=gsub("N",1,u)}
print(k)
}
However, I keep getting this error:
[1] 295
[1] 296
Error in if (all_freq < 0.5) { : missing value where TRUE/FALSE needed
It is obvious that the code runs but stops after encountering a problem. Please, can someone tell me what I am doing wrong? I am a newbie to R, and any advice would be greatly appreciated.
#akrun, the reason why i used a for loop is because it is nested in another for loop..so after using your code.
newdata=as.data.frame(newdata)
u=newdata
all_freq <- rowSums(u==1)/rowSums((u==1)|(u==0))
indx <- all_freq < 0.5
indx1 <- indx & !is.na(indx)
indx2 <- !indx & !is.na(indx)
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
newdata[] <- lapply(newdata, as.numeric)
I got weird values
newdata[1:10,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 1 1 1 1 1 1 1 1 1 1
Please where is the "3" coming from.???? I should only have 0 or 1
We could do this using rowSums. As #bergant and #MatthewLundberg mentioned in the comments, if there are rows with no 0 or 1 elements, we get NaN based on the calculation. One way would be to modify the logical condition by including !is.na, i.e. elements that are not NA along with the previous condition.
#using `rowSums` to create the all_freq vector
all_freq <- rowSums(newdata==1)/rowSums((newdata==1)|(newdata==0))
#Create a logical index based on elements that are less than 0.5
indx <- all_freq < 0.5
#The NA elements can be changed to FALSE by adding another condition
indx1 <- indx & !is.na(indx)
#similarly for elements that are > 0.5
indx2 <- !indx & !is.na(indx)
Now, we subset the rows of the 'newdata' with 'indx1', loop through the columns (lapply) and use gsub with pattern and replacement arguments and assign the output back to the subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
Similarly, we can do the replacement for the rows that are greater than 0.5 for 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
The gsub output columns are character class, which can be converted back to numeric (if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24)
newdata <- as.data.frame(matrix(sample(c(0:1, "N"), 10*4, replace=TRUE),
ncol=4), stringsAsFactors=FALSE)
newdata[7,] <- 2
I have several vectors that look like this:
v1 <- c(1,2,4)
v2 <- c(3,5,8)
v3 <- c(4)
This is just a small sample of them. I'm trying to figure out a way to add values to each of them to make them all consecutive vectors. So that at the end, they look like this:
v1 <- c(1,2,3,4)
v2 <- c(1,2,3,4,5,6,7,8)
v3 <- c(1,2,3,4)
So "3" is added to the first vector, "1","2","4","6","7" is added to the second and so forth. I have several hundred vectors that look like this so I'm trying to figure out a solution that would scale/be automated.
You can use seq and max
seq(max(v1))
For multiple vectors, we can loop
lapply(mget(paste0('v',1:3)), function(x) seq(max(x)))
#$v1
#[1] 1 2 3 4
#$v2
#[1] 1 2 3 4 5 6 7 8
#$v3
#[1] 1 2 3 4
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I'm trying to do resampling of the elements of a data frame. I'm open to use other data structures if recommended, but my understanding is that a DF would be better for combining strings, numbers, etc.
Let's say my input is this data frame:
16 x y z 2
11 a b c 1
.........
And I'd like to build as output another data structure (I take, another df) like this:
16 x y z
16 x y z
11 a b c
.........
I guess my main issue is the way to append the content, which is on columns df[,1:4].
Thanks in advance, p.
It's unclear from your description, but your desired output implies that you want to duplicate columns 1:4 according to column 5, this should do the job
df[rep(seq_len(nrow(df)), df[, 5]), -5]
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
Assuming you're starting with something like:
mydf
# V1 V2 V3 V4 V5
# 1 16 x y z 2
# 2 11 a b c 1
Then, you can just use expandRows from my "splitstackshape" package, like this:
library(splitstackshape)
expandRows(mydf, count = "V5")
# V1 V2 V3 V4
# 1 16 x y z
# 1.1 16 x y z
# 2 11 a b c
By default, the function assumes that you are expanding your dataset based on an existing column, but you can just as easily add a numeric vector as the count argument, and set count.is.col = FALSE.
If you want to sample with replacement n rows from df data frame:
df[sample(nrow(df), n, replace=TRUE), ]