Data set "dat" looks like this:
**V1 V2**
1 2
2 2
3 5
9 8
9 9
a 2
Want to create dummy variable V3:
if V1=V2, 0
otherwise, within a range 1-8
Where 8+ is involved, or any symbol or letter, the variable should read NA. In the above example, the
V3 = {0,1,0,NA,NA,NA}
This would be one of the many ways it can be done. There might be some more efficient ways:
# Create the original dataset
data <- data.frame(V1 = c(1,2,3,9,9,"a"), V2 = c(2,2,5,8,9,2))
# Check if V1 == V2 and write the result to V3 for ALL observations
data$V3 <- data$V1 == data$V2
# Where V1 or V2 are not in the range [1,8], overwrite V3 with NA
data$V3[!(grepl("\\b[12345678]\\b", data$V2) &
grepl("\\b[12345678]\\b", data$V1))] <- NA
Where the "\\b[12345678]{1,1}\\b" can be decomposed as follows:
1) the [12345678] part check, if the string contains some number from the range 1:8.
2) the \bb ... \bb part gives you the word boundary - thus number 2 will be matched, but number 28 will not.
If you wanted to match a range 0:13, you would adjust the regular expression like this:
data$V3[!(grepl("\\b([0-9]|1[0-3])\\b", data$V2) &
grepl("\\b([0-9]|1[0-3])\\b", data$V1))] <- NA
Where the \\b([0-9]|1[0-3])\\b can be translated as follows:
1) [0-9] matches numbers 0:9
2) 1[0-3] matches numbers 10:13
3) [0-9]|1[0-3] tells you that numbers 0:9 or 10:13 should be matched
4) \b...\b gives you the word boundaries
5) (...) tells you that the word boundaries should be evaluated after the expression within brackets. Without the brackets, this would be equivalent operation: \\b[0-9]\\b|\\b1[0-3]\\b
For more detailed introduction into matching numeric ranges with regular expression see this link: http://www.regular-expressions.info/numericranges.html
There are many ways to do this. This one has a loop which checks each row and based on a set of rules, returns whatever you want. This is easily extendable for more complex rules. Warnings can be ignored as they are produced when "a" is being coerced to numeric.
x <- read.table(text = "1 2
2 2
3 5
9 8
9 9
a 2", header = FALSE)
x$V3 <- apply(x, MARGIN = 1, FUN = function(m) {
xm <- as.numeric(as.character(m))
if (!any(is.na(xm))) {
if (any(xm > 8)) {
return(NA)
}
if(xm[1] == xm[2]) {
return(1)
} else {
return(0)
}
} else {
return(NA)
}
})
V1 V2 V3
1 1 2 0
2 2 2 1
3 3 5 0
4 9 8 NA
5 9 9 NA
6 a 2 NA
Related
I'm trying to use a new R package called waldo (see at the tidyverse blog too) that is designed to compare data objects to find differences. The waldo::compare() function returns an object that is, according to the documentation:
a character vector with class "waldo_compare"
The main purpose of this function is to be used within the console, leveraging coloring features to highlight outstanding values that are not equal between data objects. However, while just examining in console is useful, I do want to take those values and act on them (filter them out from the data, etc.). Therefore, I want to programmatically extract the outstanding values. I don't know how.
Example
Generate a vector of length 10:
set.seed(2020)
vec_a <- sample(0:20, size = 10)
## [1] 3 15 13 0 16 11 10 12 6 18
Create a duplicate vector, and add additional value (4) into an 11th vector element.
vec_b <- vec_a
vec_b[11] <- 4
vec_b <- as.integer(vec_b)
## [1] 3 15 13 0 16 11 10 12 6 18 4
Use waldo::compare() to test the differences between the two vectors
waldo::compare(vec_a, vec_b)
## `old[8:10]`: 12 6 18
## `new[8:11]`: 12 6 18 4
The beauty is that it's highlighted in the console:
But now, how do I extract the different value?
I can try to assign waldo::compare() to an object:
waldo_diff <- waldo::compare(vec_a, vec_b)
and then what? when I try to do waldo_diff[[1]] I get:
[1] "`old[8:10]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \n`new[8:11]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \033[34m4\033[39m"
and for waldo_diff[[2]] it's even worse:
Error in waldo_diff[3] : subscript out of bounds
Any idea how I could programmatically extract the outstanding values that appear in the "new" vector but not in the "old"?
As a disclaimer, I didn't know anything about this package until you posted so this is far from an authoritative answer, but you can't easily extract the different values using the compare() function as it returns an ANSI formatted string ready for pretty printing. Instead the workhorses for vectors seem to be the internal functions ses() and ses_context() which return the indices of the differences between the two objects. The difference seems to be that ses_context() splits the result into a list of non-contiguous differences.
waldo:::ses(vec_a, vec_b)
# A tibble: 1 x 5
x1 x2 t y1 y2
<int> <int> <chr> <int> <int>
1 10 10 a 11 11
The results show that there is an addition in the new vector beginning and ending at position 11.
The following simple function is very limited in scope and assumes that only additions in the new vector are of interest:
new_diff_additions <- function(x, y) {
res <- waldo:::ses(x, y)
res <- res[res$t == "a",] # keep only additions
if (nrow(res) == 0) {
return(NULL)
} else {
Map(function(start, end) {
d <- y[start:end]
`attributes<-`(d, list(start = start, end = end))
},
res[["y1"]], res[["y2"]])
}
}
new_diff_additions(vec_a, vec_b)
[[1]]
[1] 4
attr(,"start")
[1] 11
attr(,"end")
[1] 11
At least for the simple case of comparing two vectors, you’ll be better off
using diffobj::ses_dat() (which is from the package that waldo uses
under the hood) directly:
waldo::compare(1:3, 2:4)
#> `old`: 1 2 3
#> `new`: 2 3 4
diffobj::ses_dat(1:3, 2:4)
#> op val id.a id.b
#> 1 Delete 1 1 NA
#> 2 Match 2 2 NA
#> 3 Match 3 3 NA
#> 4 Insert 4 NA 3
For completeness, to extract additions you could do e.g.:
extract_additions <- function(x, y) {
ses <- diffobj::ses_dat(x, y)
y[ses$id.b[ses$op == "Insert"]]
}
old <- 1:3
new <- 2:4
extract_additions(old, new)
#> [1] 4
I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}
Good day
1) Is there a R function similar to Excel's match function?
2) I've made my own as below(lengthy..TT)
Could anybody suggest things need to be improved? Or other way?
fmatch2<-function(ss1, ss2) { #ss1 correspond the first argument of Excel match function. ss2 for the second.
fmatch<-function(ii,ss) { # return location in ss where ii match.
if (length(which(ss==ii))>0 ) {
rr<- min(which(ss==ii))
} else {
if (length(which(ss>ii))>0)
{rr<-min(which(ss>ii))-1 }
}
return(rr)
}
rr<-list()
n<-1
for ( x in ss1 ) { # apply fmatch to each member in ss1
nn<-fmatch(x,ss2[1:n])
rr<-rbind(rr,nn)
n<-n+1
}
as.vector(unlist(rr[,1]))
}
Usages of the function fmatch2 as below.
Mimicking Excel "=MATCH(H1,$I$1:1,1)". Element name of the list below "ch, ci" correspond to column H, Column I. The result is the list named cn.
x<-data.frame(cf=c(0,1,2,3,4,5),ch=c(0,0,3,6,6,6),ci=c(0,0,3,7,11,13))
y<-data.frame(cf=c(0,1,2,3,4,5),ch=c(0,0,3,6,6,6),ci=c(0,0,3,7,11,13),cn=fmatch2(x[[2]],x[[3]]))
Ofcourse i am not entirely sure what you're trying to do, as i'd expect your fmatch2 function to return NA for ch==6 (because 6 is not present in ci), but i love doing things using dplyr:
library(dplyr)
result <- x %>% # "%>%" means "and then"
mutate(chInCi = match(ch, x$ci)) #adds a column named "chInCi" with the position in ci of the first match of the value in ch
result
cf ch ci chInCi
1 0 0 0 1
2 1 0 0 1
3 2 3 3 3
4 3 6 7 NA
5 4 6 11 NA
6 5 6 13 NA
To start off, and example Dataset :
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"))
I've got a large dataset with a multitude of numeric and character variables (survey data). These responses vary greatly in content and length; the order these variables are in matter, as well. I'm trying to find a way to select all of the character variables in my dataset, and then set any responses to the letter "N"/"Another item" (while leaving the NA values intact).
With the help of other users in the community, I'm able to fill all of these character variables with NA or "N", etc. :
x[,sapply(x, is.character)] <- "N"
But, I would really like to be able to retain those NA values present within the data - Something like this (I'm not very proficient with the apply functions just yet) :
x[ #Contains ANY Text# ,sapply(x, is.character)] <- "NA"
I haven't found anything that will allow me find any and all text within a row/column? It appears something like GREP only works with specific character strings to my knowledge. I'm also unsure of my formatting with the aforementioned function is correct, so please let me know if I'm making an error in placing my #Contains ANY text# argument.
Thanks in advance All!
A data.frame is a list so its columns can be changed using lapply.
Here we can subset x to the character columns, and then lapply over them replacing non-NA values with whatever we want.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE) # your original data.frame had factors
x
# v1 v2 v3 v4 v5
# 1 1 1 1 Bob Hello
# 2 2 2 2 Green This question is awful, Mad
# 3 3 3 3 Curley <NA>
# 4 4 4 4 Banana Help
# 5 5 5 5 No Me
is_char_col <- sapply(x, is.character)
is_char_col
# v1 v2 v3 v4 v5
# FALSE FALSE FALSE TRUE TRUE
Use replace:
x[is_char_col] <- lapply(x[is_char_col], function(k) replace(k, !is.na(k), "N"))
x
# v1 v2 v3 v4 v5
# 1 1 1 1 N N
# 2 2 2 2 N N
# 3 3 3 3 N <NA>
# 4 4 4 4 N N
# 5 5 5 5 N N
If the replacement logic is actually more complicated, you could modify the anonymous function inside lapply.
Here is a method using a generic function as mentioned by #effel.
x <- data.frame(v1=1:5,v2=1:5,v3=1:5,
v4=c("Bob","Green","Curley","Banana","No"),
v5=c("Hello","This question is awful, Mad",NA,"Help","Me"),
stringsAsFactors = FALSE)
x <- data.frame(lapply(x, function(i) if(is.character(i)) ifelse(!is.na(i), "N", i) else i))
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.