Can you explain me, why, when i fill vector in R with sequence, i ve got this result:
sekv <- seq(from = 1, to = 20, by = 2)
test <- c()
for (j in sekv) {
test[j] = j
}
test
[1] 1 NA 3 NA 5 NA 7 NA 9 NA 11 NA 13 NA 15 NA 17 NA 19
I want make a vector, which i can fill with some sequence and use it in loop, but only with values, not with NA values. Can somebody help me ?
The actual issue is that whenever your are trying to assign value to test[j] it using value of j to change the size of the vector. The last value of j is 19 hence size of test is set to 19. But you have assigned value only to few indexes i.e. 1, 3, 5 etc. Rest of values are set to NA.
You can solve it by executing for loop only for number of items available in sekv.
sekv <- seq(from = 1, to = 20, by = 2)
test <- c()
for (j in 1:length(sekv)) {
test[j] = sekv[j]
}
print(test)
In sekv you currently pass 2.. 4 ..6, use seq_along
sekv <- seq(from = 1, to = 20, by = 2)
test <- c()
for (j in seq_along(sekv)) {
test[j] = j
}
test
The reason why you have NAs in the test vector is that you assign the first, third, fifth, seventh, etc element in the test vector to be 1, 3, 5, 7 etc...
I am still not very clear what you are trying to do, but one solution is that you can remove all the NAs after the for loop.
sekv <- seq(from = 1, to = 20, by = 2)
test <- c()
for (j in sekv) {
test[j] = j
}
test
# [1] 1 NA 3 NA 5 NA 7 NA 9 NA 11 NA 13 NA 15 NA 17 NA 19
test <- test[-c(which(is.na(test)))]
test
# [1] 1 3 5 7 9 11 13 15 17 19
As #PoGibas suggested, this also works if you want to remove NAs:
test <- na.omit(test)
test
# [1] 1 3 5 7 9 11 13 15 17 19
Related
I want to add empty rows at specific positions of a dataframe. Let's say we have this dataframe:
df <- data.frame(var1 = c(1,2,3,4,5,6,7,8,9),
var2 = c(9,8,7,6,5,4,3,2,1))
In which I want to add an empty row after rows 1, 3 and 5 (I know that this is not best practice in most cases, ultimately I want to create a table using flextable here). These row numbers are saved in a vector:
rows <- c(1,3,5)
Now I want to use a for loop that loops through the rows vector to add an empty row after each row using add_row():
for (i in rows) {
df <- add_row(df, .after = i)
}
The problem is, that while the first iteration works flawlessly, the other empty rows get misplaced, since the dataframe gets obviously longer. To fix this I tried adding 1 to the vector after each iteration:
for (i in rows) {
df <- add_row(df, .after = i)
rows <- rows+1
}
Which does not work. I assume the rows vector does only get evaluated once. Anyone got any ideas?
Do it all at once, no need for looping. Make a sequence of row numbers, add the new rows in, sort, then replace the duplicated row numbers with NA:
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
# var1 var2
#1 1 9
#1.1 NA NA
#2 2 8
#3 3 7
#3.1 NA NA
#4 4 6
#5 5 5
#5.1 NA NA
#6 6 4
#7 7 3
#8 8 2
#9 9 1
This will be much more efficient than looping or loop-like code, for even moderately sized data:
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
})
# user system elapsed
# 0.01 0.00 0.02
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)
})
# user system elapsed
# 26.03 0.00 26.03
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
})
# user system elapsed
# 25.05 0.00 25.04
You could achieve your result by looping in the reverse direction:
df <- data.frame(
var1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
var2 = c(9, 8, 7, 6, 5, 4, 3, 2, 1)
)
rows <- c(1, 3, 5)
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
df
#> var1 var2
#> 1 1 9
#> 2 NA NA
#> 3 2 8
#> 4 3 7
#> 5 NA NA
#> 6 4 6
#> 7 5 5
#> 8 NA NA
#> 9 6 4
#> 10 7 3
#> 11 8 2
#> 12 9 1
I have a simple vector of 20 elements in which all values are NA.
a = rep(NA, 20)
At some specific intervals, I have assigned a few NA values to 2
a[c(1, 8, 15, 20)] = 2
Now, I want to assign all values after values 2 to 3. The following statement works fine, but it adds a new element to the vector at index 21. However, I do not want to increase the size of the vector.
a[which(a == 2) + 1] <- 3
Is there a way to check the limits of this vector and assign value 3 only within vector boundaries?
Another way is to not do the nth value of the vector in the comparison. Instead, we will only compare to n-1.
a[which(head(a, -1L) == 2) + 1] <- 3
Following is the simplest one I can figure out.
library(zoo)
a = rep(NA, 20)
a[c(1, 8, 15, 20)] = 2
a[which(a == 2 & index(a) < length(a)) + 1] <- 3
You can remove those index which are greater than the length of a.
a <- rep(NA, 20)
ind <- c(1, 8, 15, 20)
a[ind] = 2
new_ind <- ind + 1
new_ind <- new_ind[new_ind > 0 & new_ind < length(a)]
a[new_ind] <- 3
a
#[1] 2 3 NA NA NA NA NA 2 3 NA NA NA NA NA 2 3 NA NA NA 2
Another way would be to initially store the length of your vector and then select the values as per your initial dimension.
a = rep(NA, 20)
dim_a <- length(a)
a[c(1, 8, 15, 20)] = 2
a[which(a == 2) + 1] <- 3
a[seq_len(dim_a)] #select all elements up until length of initial a
#2 3 NA NA NA NA NA 2 3 NA NA NA NA NA 2 3 NA NA NA 2
Returning values after last NA in a vector
I can remove all NA values from a vector
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v2 <- na.omit(v1)
v2
but how do I return a vector with values only after the last NA
c( 7,8,9,10,11,12)
Thank you for your help.
You could detect the last NA with which and add 1 to get the index past the last NA and index until the length(v1):
v1[(max(which(is.na(v1)))+1):length(v1)]
[1] 7 8 9 10 11 12
Here’s an alternative solution that does not use indices and only vectorised operations:
after_last_na = as.logical(rev(cumprod(rev(! is.na(v1)))))
v1[after_last_na]
The idea is to use cumprod to fill the non-NA fields from the last to the end. It’s not a terribly useful solution in its own right (I urge you to use the more obvious, index range based solution from other answers) but it shows some interesting techniques.
You could detect the last NA with which
v1[(tail(which(is.na(v1)), 1) + 1):length(v1)]
# [1] 7 8 9 10 11 12
However, the most general - as #MrFlick pointed out - seems to be this:
tail(v1, -tail(which(is.na(v1)), 1))
# [1] 7 8 9 10 11 12
which also handles the following case correctly:
v1[13] <- NA
tail(v1, -tail(which(is.na(v1)), 1))
# numeric(0)
To get the null NA case, too,
v1 <- 1:13
we can do
if (any(is.na(v1))) tail(v1, -tail(which(is.na(v1)), 1)) else v1
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Data
v1 <- c(1, 2, 3, NA, 5, 6, NA, 7, 8, 9, 10, 11, 12)
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 7 8 9 10 11 12
v1 = 1:5
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 1 2 3 4 5
v1 = c(1:5, NA)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#integer(0)
The following will do what you want.
i <- which(is.na(v1))
if(i[length(i)] < length(v1)){
v1[(i[length(i)] + 1):length(v1)]
}else{
NULL
}
#[1] 7 8 9 10 11 12
Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}
I have two datasets
datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
#-----------
name number1
1 regular 1
2 kklmin 8
3 notSo 9
4 Jijoh 2
5 Kish 18
6 Lissp 25
7 Kcn 33
8 CCCa 8
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
"KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
#-------------
name number2
1 reGulr 2
2 ntSo 8
3 Jijoh 12
4 sean 13
5 LiSsp 20
6 KcN 18
7 CaPN 13
I want to merge them by name column, however with partial match is allowed (to avoid hampering merging spelling errors in large data set and even to detect such spelling errors) and for example
(1) If consecutive four letters (all if the number of letters are less than 4) at any position - match that is fine
ABBCD = BBCDK = aBBCD = ramABBBCD = ABB
(2) Case sensitivity is off in the match e.g ABBCD = aBbCd
(3) The new dataset will have both names (names from datf1 and datf2) preserved. So that letter we can detect if the match is perfect (may a separate column with how many letter do match)
Is such merge possible ?
Edits:
datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean",
"LiSsp", "KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular <NA> 1 NA 0
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> reGulr NA 2 0
10 <NA> ntSo NA 8 0
11 <NA> sean NA 13 0
12 <NA> CaPN NA 13 0
Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):
uglyMerge <- function(df1, df2) {
## lower all strings to allow case-insensitive comparison
lowerNames1 <- tolower(df1[, 1]);
lowerNames2 <- tolower(df2[, 1]);
## split strings into single characters
names1 <- strsplit(lowerNames1, "");
names2 <- strsplit(lowerNames2, "");
## create the final dataframe
mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA,
number1=df1[,2], number2=NA, matches=0,
stringsAsFactors=FALSE);
## store names of dataframe2 (to remember which strings have no match)
toMerge <- df2[, 1];
for (i in seq(along=names1)) {
for (j in seq(along=names2)) {
## set minimal match to 4 or to string length
minMatch <- min(4, length(names2[[j]]));
## find single matches
matches <- names1[[i]] %in% names2[[j]];
## look for consecutive matches
r <- rle(matches);
## any matches found?
if (any(r$values)) {
## find max consecutive match
possibleMatch <- r$value == TRUE;
maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];
## store max conscutive match length
maxMatch <- r$length[maxPos];
## to remove FALSE-POSITIVES (e.g. CCC and kcn) find
## largest substring
start <- sum(r$length[0:(maxPos-1)]) + 1;
stop <- start + r$length[maxPos] - 1;
maxSubStr <- substr(lowerNames1[i], start, stop);
## all matching criteria fulfilled
isConsecutiveMatch <- maxMatch >= minMatch &&
grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
nchar(maxSubStr) > 0;
if (isConsecutiveMatch) {
## merging
mergedDf[i, "matches"] <- maxMatch
mergedDf[i, "name2"] <- as.character(df2[j, 1]);
mergedDf[i, "number2"] <- df2[j, 2];
## don't append this row to mergedDf because already merged
toMerge[j] <- NA;
## stop inner for loop here to avoid possible second match
break;
}
}
}
}
## append not matched rows to mergedDf
toMerge <- which(df2[, 1] == toMerge);
df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]),
number1=NA, number2=df2[toMerge, 2], matches=0,
stringsAsFactors=FALSE);
mergedDf <- rbind(mergedDf, df2);
return (mergedDf);
}
Output:
> uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular reGulr 1 2 5
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> ntSo NA 8 0
10 <NA> sean NA 13 0
11 <NA> CaPN NA 13 0
agrep will get you started.
something like:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))
then you can adjust the max.distance parameter until you get the appropriate amount of matching. then merge however you like.