Extract data elements found in a single column - r

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.

This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Related

r - find maximum length "chain" of numerically increasing pairs of numbers

I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

Add index numbers when converting sorted table to dataframe

I have a vector of strings that I'm trying to convert into a data frame with a frequency column. So far so good, but when I dim my data frame, I get only one column instead of two. I guess R is using the words as the index values.
Anyway here is how it starts. My list:
a<-c("welcoming", "whatsyourexcuse", "whiteway", "zero", "yay", "whatsyourexcuse", "yay")
Then, I tried to sort the frequency values in decreasing order and store as data frame using:
df <- as.data.frame(sort(table(a), decreasing=TRUE))
Problem is when I dim(df) I get [1] 5 1 instead of [1] 5 2. Here is what df looks like:
sort(table(a), decreasing = TRUE)
whatsyourexcuse 2
yay 2
welcoming 1
whiteway 1
zero 1
instead of:
a Freq
[1] whatsyourexcuse 2
[2] yay 2
[3] welcoming 1
[4] whiteway 1
[5] zero 1
Any pointers please? Thanks.
Try:
library(plyr)
a1 <- count(a)
a1[order(-a1$freq),]
# x freq
# 2 whatsyourexcuse 2
# 4 yay 2
# 1 welcoming 1
# 3 whiteway 1
# 5 zero 1
dim(a1)
#[1] 5 2
Or
a2 <- stack(sort(table(a),decreasing=TRUE))[,2:1]
dim(a2)
#[1] 5 2
When you are converting to data.frame using as.data.frame(sort(table(a), decreasing=TRUE)), the names of the elements become the rownames of the dataframe, so you are creating only one column instead of two. When you do sort, it no longer is the table object. For comparison check str(table(a)) and str(sort(table(a), decreasing=TRUE)))
You can also create the data.frame by
tbl <- sort(table(a), decreasing=TRUE)
data.frame(col1= names(tbl), Values= as.vector(tbl))

r create a column that contains the objects names inside a lapply function

I would like to create a column that contains the objects names inside a lapply function, as a proxy I call it name.of.x.as.strig.function(), unfortunately I am not sure how to do it, maybe a combination of assign, do.call and paste. But so far using this function only led my into deeper troubles, I am quite sure there is a more R like solution.
# generates a list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
# subsets the second column into the object data.anova
data.anova <- lapply(data, function(x){x <- x[[2]];
return(matrix(x))})
This should allow me to create a column inside the dataframe that contains its name, for all matrices inside the list
data.anova <- lapply(data, function(x){
x$id <- name.of.x.as.strig.function(x)
return(x)})
I would like to retrieve:
3 one
3 one
3 two
3 two
...
Any input is highly appreciated.
Search history: function to retrieve object name as string, R get name of an object inside lapply...
Can it be that you are just looking for stack?
stack(lapply(data, `[[`, 2))
# values ind
# 1 3 one
# 2 3 one
# 3 3 two
# 4 3 two
# 5 3 tree
# 6 3 tree
# 7 3 four
# 8 3 four
(Or, using your original approach: stack(lapply(data, function(x) {x <- x[[2]]; x})))
If this is the case, melt from "reshape2" would also work.
Loop through the indices of data.anova, and use that to fetch both the data and the names:
data.anova <- lapply(seq_along(data.anova), function(i){
x <- as.data.frame(data.anova[[i]])
x$id <- names(data.anova)[i]
return(x)})
This produces:
# [[1]]
# V1 id
# 1 3 one
# 2 3 one
# [[2]]
# V1 id
# 1 3 two
# 2 3 two
# [[3]]
# V1 id
# 1 3 tree
# 2 3 tree
# [[4]]
# V1 id
# 1 3 four
# 2 3 four

Create Data Frame and Populate It R

How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5

Resources