I have a list of vectors, j, that looks like this:
>j
[[1]
[1] "a" "b" "c"
[[2]]
[1] "c" "c"
[[3]]
[1] "d" "d" "d" "a" "a"
.
.
.
I would like to transform this into a dataframe that has one column with each vectors contents concatenated together. So the column would look like:
Column_Name
1 a b c
2 c c
3 d d d a a
I have tried using Replace() function as well as a loop where I would use after:
for (x in 1:length(j)){
j[x] = paste(j[x], collapse = " ")
}
j <- data.frame(matrix(unlist(j), nrow=length(j), byrow=T)
Any guidance would be greatly appreciated.
Thank you.
As you have tried yourself, the sapply function together with the collapse argument of paste should do it all wrapped into a data.frame:
# Toy data
set.seed(1)
j <- replicate(5, rep(sample(letters, 1), sample(1:10,1)))
print(j)
#[[1]]
#[1] "g" "g" "g" "g"
#
#[[2]]
# [1] "o" "o" "o" "o" "o" "o" "o" "o" "o" "o"
#
#[[3]]
#[1] "f" "f" "f" "f" "f" "f" "f" "f" "f"
#
#[[4]]
#[1] "y" "y" "y" "y" "y" "y" "y"
#
#[[5]]
#[1] "q"
# Collapse each element and wrap into a data.frame
res <- data.frame("Column_name" = sapply(j, paste, collapse = " "))
print(res)
# Column_name
#1 g g g g
#2 o o o o o o o o o o
#3 f f f f f f f f f
#4 y y y y y y y
#5 q
The sapply applies the paste-function on each element of the list to create a character vector of the concatenated list elements. The data.frame constructor simply converts that output to the wanted output.
Once provide the name to list and then use stack to convert list in a data.frame. Finally, dplyr package is used to collapse vector from common element separated by .
Sample Data is taken from #AndersEllernBilgrau's answer.
set.seed(1)
j <- replicate(5, rep(sample(letters, 1), sample(1:10,1)))
names(j) <- seq_along(j)
library(dplyr)
stack(j) %>% group_by(ind) %>%
summarise(Column_Name = paste0(values, collapse = " ")) %>%
ungroup() %>% select(-ind)
# # A tibble: 5 x 1
# Column_Name
# <chr>
# 1 g g g g
# 2 o o o o o o o o o o
# 3 f f f f f f f f f
# 4 y y y y y y y
# 5 q
#
I am looking to randomize the order of the sublists, but retaining the structure. To illustrate, I can do this with a data frame:
df1 <- data.frame("X1" = LETTERS[1:5], "X2" = letters[1:5])
df1
df1R <- df1[sample(df1[,1]),]
df1R
> df1
X1 X2
1 A a
2 B b
3 C c
4 D d
5 E e
>
> df1R <- df1[sample(df1[,1]),]
> df1R
X1 X2
2 B b
5 E e
1 A a
3 C c
4 D d
You can see here that the overall order is randomised, but rows remain together, this is what I mean by retaining the structure - A stays with a, B stays with b...
I'd like to implement this for a list:
m1 <- list(LETTERS[1:5], letters[1:5])
But I'm stuck on the how, I've had a good look round but not found a solution. Any advice?
The result would look like:
> m1R
[[1]]
[1] "B" "C" "E" "A" "D"
[[2]]
[1] "b" "c" "e" "a" "d"
You could do this to reorder all elements:
neworder <- sample.int(5)
lapply(m1, function(x) x[neworder])
I have a large data.frame and I need some conversion based by row. My purpose is convert all values in rows to NA after if there is specific character in column.
For example I provide little sample from my real data set:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
As an example in sample_df
First I want to turn all values to NA after first "I"
Sample data.frames
I tried base, dpylr, purrr but can not create an algorithm.
Thanks for your help.
Try this:
Find "I" values
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
Find positions from the first "I" seen
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
Replace needed values
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
Your output
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
Example 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
Here is a brute force approach, which should be the easiest to come up with but the least preferred. Anyway, here it is:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
Here's a possible answer using ddply from the plyr package
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
The plyr approach :
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
You have to use an if because x[min(which(x == "I"))] would returns numeric(0) for rows without at least one I
My Solution:
After #Julien Navarre recommendation, first I created toNA() function:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
First, I tried plyr::adply() and purrrlyr::by_row() functions to apply my toNA() function my data.frame which has over 3 million rows.
Both are very slow. (For 1000 rows they take 9 and 6 seconds respectively). These approaches are also slow with a simple function(x) x. I am not sure what is overhead.
So I tried base::apply() function: (result is my data set)
as.tibble(t(apply(result, 1, toNA ) ))
It only takes 0.2 seconds for 1000 rows.
I am not sure about programming style but for now this solution works for me.
Thanks for all your recommendations.
A pure base solution, we're building a boolean matrix of "=="I" or not", then with a double cumsum by row we can find where our NAs must be placed:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V
Given vector of N elements:
LETTERS[1:10]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
How can one get a data.table/frame (df) as follows?
>df
one two
A B
C D
E F
G H
I J
EDIT
Generalizing I would like to know given a vector to split as follows:
[A B C],[D E],[F G H I J]
and obtaining:
V1 V2 V3 V4 V5
A B C NA NA
D E NA NA NA
F G H I J
One option is the matrix way
as.data.frame(matrix(LETTERS[1:10], ncol=2,byrow=TRUE,
dimnames = list(NULL, c('one', 'two'))), stringsAsFactors=FALSE)
# one two
#1 A B
#2 C D
#3 E F
#4 G H
#5 I J
f we need to create an index, we can use gl to split the vector and rbind
do.call(rbind, split(v1, as.integer(gl(length(v1), 2, length(v1)))))
where
v1 <- LETTERS[1:10]
Update
Based on the update in OP's post
lst <- split(v1, rep(1:3, c(3, 2, 5)))
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# [,1] [,2] [,3] [,4] [,5]
#1 "A" "B" "C" NA NA
#2 "D" "E" NA NA NA
#3 "F" "G" "H" "I" "J"
Or otherwise
library(stringi)
stri_list2matrix(lst, byrow = TRUE)
Update2
If we are using a 'splitVec'
lst <- split(v1, cumsum(seq_along(v1) %in% splitVec))
and then proceed as above
I have 2 tables as below:
a = read.table(text=' a b
1 c
1 d
2 c
2 a
2 b
3 a
', head=T)
b = read.table(text=' a c
1 x i
2 y j
3 z k
', head=T)
And I want result to be like this:
1 x i c d
2 y j c a b
3 z k a
Originally I thought to use tapply to transform them to lists (eg. aa = tapply(a[,2], a[,1], function(x) paste(x,collapse=","))), then append it back to table b, but I got stuck...
Any suggestion to do this?
Thanks a million.
One way to do it:
mapply(FUN = c,
lapply(split(b, row.names(b)), function(x) as.character(unlist(x, use.names = FALSE))),
split(as.character(a$b), a$a),
SIMPLIFY = FALSE)
# $`1`
# [1] "x" "i" "c" "d"
#
# $`2`
# [1] "y" "j" "c" "a" "b"
#
# $`3`
# [1] "z" "k" "a"