I have a data frame column
df$col1=(1,2,3,4,5,6,7,8,9,...,500000)
and a vector
vc<-c(1,2,4,5,7,8,10,...,499999)
If i compare the two vectors the second vector has some missing values how i can insert in the missing values' places 0s e.g the second vector i want to be
vc<-c(1,2,0,4,5,0,7,8,9,10,...,499999,0)
You could use match and replace (thanks to #RonakShah)
Input
vc <- c(1,2,4,5,7,8,10)
x <- 1:15
Result
out <- replace(tmp <- vc[match(x, vc)], is.na(tmp), 0L)
out
# [1] 1 2 0 4 5 0 7 8 0 10 0 0 0 0 0
You could try using the larger vector containing all values as a template, and then assign zero to any value which does not match to the second smaller vector:
v_out <- df$col1
v_out[!(v_out %in% vc)] <- 0
v_out
[1] 1 2 0 4 5 0 7 8 0 10
Data
df$col1 <- c(1:10)
vc <- c(1,2,4,5,7,8,10)
A more cryptic, but maybe faster one-liner alternative (using Tim's data):
`[<-`(numeric(max(df$col1)),vc,vc)
#[1] 1 2 0 4 5 0 7 8 0 10
Related
I have a really large boolean vector (i.e. T or F) and I want to simply be able to estimate how many "blocks" of consecutive T there are in my vector contained between the F elements.
A simple example of a vector with 3 of these consecutive "blocks" of T elements:
x <- c(T,T,T,T,F,F,F,F,T,T,T,T,F,T,T)
Output:
1,1,1,1,0,0,0,0,2,2,2,2,0,3,3
You can do:
rle <- rle(x)
rle$values <- with(rle, cumsum(values) * values)
inverse.rle(rle)
[1] 1 1 1 1 0 0 0 0 2 2 2 2 0 3 3
And a simplified and more elegant version of the basic idea (proposed by #Lyngbakr):
with(rle(x), rep(cumsum(values) * values, lengths))
Another solution with rle/inverse.rle:
x <- c(T,T,T,T,F,F,F,F,T,T,T,T,F,T,T)
rle_x <- rle(x)
rle_x$values[rle_x$values] <- 1:length(which(rle_x$values))
inverse.rle(rle_x)
# [1] 1 1 1 1 0 0 0 0 2 2 2 2 0 3 3
I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1
I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.
We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1
I am searching for a solution to create some mask, with which I can remove some data (e.g. rows in data.frame) depending on some criteria, e.g.:
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
mask <- a == 0
mask
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
In my actual problem this cut is too harsh, I would like to have some smoother transition. The idea: I want to include some zeros before the non-zeros, and also add some zeros after the non-zeros. Simple approach: if I have this vector, I would like to switch every TRUE adjacent to a FALSE into a FALSE, which adds a overlapping tolerance region to the data. So instead of
a[!mask]
[1] 3 5 6 3 4 5 8 5
I would rather have something like
a[!mask]
[1] 0 3 5 6 3 0 0 4 5 8 5 0
or (increasing the size of the tolerance window)
a[!mask]
[1] 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0
In the last case the three zeros in the middle arise, since the tolerance from the left and from the right start overlapping. My question: has anyone a good approach, how to write a function to create such a mask with overlapping tolerance?
[EDIT] It to me some time I realised the error in my initial question (thanks #tospig) In my initial post I completely made the number of zeros in the middle part wrong! Sorry for the confusion. So, for clarification: in the case of a tolerance window of 1, there really should be two zeros in the middle: one from the right bunch of valid data, one from the left bunch of valid data. Sorry for the confusion!
So, despite the really cool approach from #tospig (which I have to keep in mind) the solution from #agenis solves my problem perfectly!
I think I would go with a classic moving average of order 3 which simply expands the "non-zeros" by one to the left and one to the right. As simple as this. You will just have to figure out what you do with the first and the last point of your vector that are turned into NA (in my example I make them zeros).
And you have your desired result (for a bigger mask you take the order 5 instead of 3):
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(forecast)
a.ma <- ma(a, 3)
a.ma[is.na(a.ma)] <- 0
mask <- a.ma == 0
a[!mask]
#### [1] 0 3 5 6 3 0 0 4 5 8 5 0
Then you can easily transform this piece of code into a function.
[EDIT] this method does not ensure the conservation of the total number of zeros (see additional comments to clarify the OP initial question)
We can try
library(data.table)
lst1 <- split(a[!mask],rleid(mask)[!mask])
c(0,unlist(Map(`c`, lst1, 0), use.names=FALSE))
#[1] 0 3 5 6 3 0 4 5 8 5 0
Or another option is
n <- 1
i1 <- !inverse.rle(within.list(rle(mask), {
lengths[values] <- lengths[values]-n
lengths[!values] <- lengths[!values]+n}))
c(a[i1],0)
#[1] 0 3 5 6 3 0 4 5 8 5 0
Here's a solution that allows you to specify the tolerance. At the moment it doesn't 'overlap' zeros.
We can use a data.table structure (or a data.frame, but I like using data.table) and control how many zeros we want to keep between the set of positive numbers. We can specify any tolerance value, but if it's greater than a sequence of zeros, only the maximum number of consecutive zeroes will be returned.
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(data.table)
tolerance <- 1
dt <- data.table( id = seq(1, length(a), by = 1),
a = a)
## subset all the 0s, with their 'ids' for joining back on
dt_zero <- dt[a == 0]
## get the positions where the difference between values is greater than one,
## and create groups based on their length
changed <- which(c(TRUE, diff(dt_zero$id) > 1))
dt_zero$grps <- rep(changed, diff(c(changed, nrow(dt_zero) + 1)))
## we only need the 'tolerance' number of zeros
## if 'tolerance' is greater than number of entries in a group,
## it will return 'na'
dt_zero <- dt_zero[ dt_zero[ order(id) , .I[c(1:tolerance)], by=grps ]$V1, ]
## join back onto original data.table,
## and subset only relevant results
dt_zero <- dt_zero[, .(id, a)][ dt , on = "id"][(is.na(a) & i.a > 0) | a == 0]
res <- dt_zero$i.a
res
# [1] 0 3 5 6 3 0 4 5 8 5 0
## try different tolerances
tolerance <- 2
...
# 0 0 3 5 6 3 0 0 4 5 8 5 0 0
tolerance <- 6
...
# 0 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0 0 0 0