Cbind bind's the integer number rather content - r

Cbind bind's the integer number rather content.But while using paste function i could see the content of the text.I'm not sure why it's binding the integer rather the content of the column.
It's not working:
data<-read.csv("NFL.CSV",head=T)
output <- cbind( data$content, cl$cluster)
Now I could see the content
output <- paste( data$content, cl$cluster)
Sample Data:There two columns one is content and another one is id
content ,id
NFL flexes Dallas Cowboys-Washington Redskins game , cbbbcf9395705611c3eeeffaa610a602
#special_event32 redskins still suck ,9b50b8be10460eab6c0f6f3590067bd7
RG3 leads Redskins over Eagles 27-20 (The Associated Press) PHILADELPHIA (AP) -- With one ,77e1a37031884642b8d1bccad99516c6

Since you didn't give any example data, I have to guess, but I strongly suspect that your columns content and/or cluster are factor columns in which case cbind will convert them to integer values:
> cbind(as.factor(c("a", "b")), as.factor(c("a", "c")))
[,1] [,2]
[1,] 1 1
[2,] 2 2
What you can do is put as.character around your vectors:
> cbind(as.character(as.factor(c("a", "b"))),
+ as.character(as.factor(c("a", "b"))))
[,1] [,2]
[1,] "a" "a"
[2,] "b" "b"
or in your example:
output <- cbind(as.character(data$content),
as.character(cl$cluster))
Another solution is to use cbind.data.frame
> cbind.data.frame(as.factor(c("a", "b")), as.factor(c("a", "b")))
as.factor(c("a", "b")) as.factor(c("a", "b"))
1 a a
2 b b
or just data.frame
output <- data.frame(content = data$content,
cluster = cl$cluster)

Related

Transforming a list with differing number of elements to a data frame

I am trying to create a data frame by combining a data frame with a list. The problem is that each element of the list is not the same length. I read this link: http://www.r-bloggers.com/converting-a-list-to-a-data-frame/ and example four is exactly what I need, but I want to automate the process of naming the rows. I have over 100 rows I need to name and I don't want to name them all explicitly.
example <- list("a", c("b", "c"), c("d", "e", "f"))
should look like this:
row1 | a <NA> <NA>
row2 | b c <NA>
row3 | d e f
We can use stri_list2matrix to convert the list to matrix. It will fill NA for the list elements that have length less than the max length found in the list.
library(stringi)
stri_list2matrix(example, byrow=TRUE)
# [,1] [,2] [,3]
#[1,] "a" NA NA
#[2,] "b" "c" NA
#[3,] "d" "e" "f"
Or another option is from base R, where we assign the length to the maximum length, thereby filling NA for the list elements with less length. We use sapply so that when the list elements are of equal length, it simplify to matrix.
t(sapply(example, `length<-`, max(lengths(example))))
# [,1] [,2] [,3]
#[1,] "a" NA NA
#[2,] "b" "c" NA
#[3,] "d" "e" "f"
NOTE: No packages are used here ... If you need a data.frame, wrap the output with as.data.frame.
Convert each list component to a "ts" class object, cbind them to an "mts" class object (this will pad with NAs), and transpose giving a character matrix, mat. Set the row names. Finally convert that to a data frame. No packages are used.
mat <- t(do.call(cbind, Map(ts, example)))
rownames(mat) <- paste0("row", seq_along(example)) ##
DF <- as.data.frame(mat, stringsAsFactors = FALSE) ###
giving:
> DF
V1 V2 V3
row1 a <NA> <NA>
row2 b c <NA>
row3 d e f
Note: The question asked for a data frame and for row names; however, if row names are not needed omit the line marked ## and if a character matrix is sufficient then omit the line marked ###.
Update Fixed spelling of stringsAsFactors. Also minor simplification, add Note.

Split data.table into roughly equal parts

To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id. Suppose:
N is the length of the data
k is the number of distinct values of id
M is the number of desired parts
The idea is that M << k << N, so splitting by id is no good.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
t(dt$id)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "f" "f" "f"
in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}}
and for M=4 is {{a,b}, {c}, {d,e}, {f}}
More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts.
What is an efficient way to do this?
Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting #eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use by
Sort the IDs by size:
ids <- names(sort(table(dt$id)))
n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M)))
res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])]
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data...
sapply(res, nrow)
# [1] 7 9 for M = 2
# [1] 5 5 6 for M = 3
# [1] 1 6 3 6 for M = 4
# [1] 1 4 2 3 6 for M = 5
Although I emphasized data.table at the top, this should work fine with a data.frame, too.
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
split(dt, as.numeric(as.factor(dt$id)) %% M)
It assigns id to the the bucket using factor-value mod number-of buckets.
For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.
If k is big enough, you can use this idea to split data into groups:
First, lets find size for each of ids
group_sizes <- dt[, .N, by = id]
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
grps_vals <- list()
grps_vals[1 : M] <- c(0)
grps_nms <- list()
grps_nms[1 : M] <- c(0)
(Here I specially added zero values to be able to create list of size M)
Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
for ( i in 1:nrow(group_sizes)){
sums <- sapply(groups, sum)
idx <- which(sums == min(sums))[1]
groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
}
Finally, delete first zero element from list of names :)
grps_nms <- lapply(grps_nms, function(x){x[-1]})
> grps_nms
[[1]]
[1] "a" "d" "f"
[[2]]
[1] "b"
[[3]]
[1] "c" "e"
Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
dt %>%
select(id) %>%
distinct() %>% # select distinct id values
mutate(group = ntile(id,3)) %>% # create grouping
inner_join(dt, by="id") # join back initial information
PS: I've learnt lots of useful stuff based on previous answers.

Matching Multiple Rows To Find A Value - R

I think that this is similiar but it is not the same as a previous question that I have asked here Pull specific rows
Here is the code that I am now working with:
City <- c("x","x","y","y","z","z")
Type <- c("a","b","a","b","a","b")
Value <- c(1,3,2,5,6,10)
cbind.data.frame(City,Type,Value)
Which produces:
City Type Value
1 x a 1
2 x b 3
3 y a 2
4 y b 5
5 z a 6
6 z b 10
I want to do something similar as before but now if two different conditions must be met to pull a specific number. Lets say we had a matrix,
testmat <- matrix(c("x","x","y","a","b","b"),ncol=2)
Which looks like this:
[,1] [,2]
[1,] "x" "a"
[2,] "x" "b"
[3,] "y" "b"
The desired outcome is
[,1] [,2] [,3]
[1,] "x" "a" 1
[2,] "x" "b" 3
[3,] "y" "b" 5
Another Question PLEASE ANSWER THIS PART
City <- c("x","x","x","x","y","y","x","z")
Type <- c("a","a","a","a","a","b","a","b")
Value <- c(1,3,2,5,6,10,11,15)
mat <- cbind.data.frame(City,Type,Value)
mat
testmat <- matrix(c("y","x","b","a"),ncol=2)
testmat <- data.frame(testmat)
testmat
test <- inner_join(mat,testmat,by = c("City"="X1", "Type"="X2"))
How come when I try to use the inner_join function it gives me a warning message. Here is the warning message that I get....
In inner_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
This is the desired output, is...
City Type Value
1 y b 10
2 x a 1
3 x a 3
4 x a 2
5 x a 5
6 x a 11
but it is producing...
City Type Value
1 x a 1
2 x a 3
3 x a 2
4 x a 5
5 y b 10
6 x a 11
I want the inner_join function to produce the values in which they are presented first in the testmat, as shown above. So if since City "y" of type "b" comes first in the testmat I want it to come first in the values for "test"
The solution is to just switch the order of testmat and mat, like so..
test <- inner_join(testmat,mat,by = c("X1"="City", "X2"="Type"))
I find it interesting that the order of the by parameter needs to be in the same order of the data frames being passed throught the innerjoin function.
The warning is because R treats string vectors as factor type. you can change this behaviour by running the following code at the start of your script:
options(stringsAsFactors = FALSE)
Answer to second part:
The warning states, that you try to join on two factors with different levels. Therefor, the variables are coerced into "character" before joining, theres no problem with that. As Mostafa Rezaei mentioned in his answer R is coercing factors from character-vectors when creating a dataframe. Usually it's best to leave characters:
mat <- data.frame(City,Type,Value, stringsAsFactors=F)
testmat <- data.frame(testmat, stringsAsFactors=F)
Concerning your real question:
The order of the result of a join is not defined. If order is crucial to you, you can use an additional sorting variable:
mat %>%
mutate(rn = row_number()) %>%
semi_join(testmat, by = c("City"="X1", "Type"="X2")) %>%
arrange(rn)
btw: I think your looking for an semi_join rather than an inner_join, read the help file for differences.

R: are there built-in functions to sort lists?

in R I have produced the following list L:
>L
[[1]]
[1] "A" "B" "C"
[[2]]
[1] "D"
[[3]]
[1] NULL
I would like to manipulate the list L arriving at a database df like
>df
df[,1] df[,2]
"A" 1
"B" 1
"C" 1
"D" 2
where the 2nd column gives the position in the list L of the corresponding element in column 1.
My question is: is(are) there a() built-in R function(s) which can do this manipulation quickly? I can do it using "brute force", but my solution does not scale well when I consider much bigger lists.
I thank you all!
You'll get a warning because of your NULL value, but you can use stack if you give your list items names:
L <- list(c("A", "B", "C"), "D", NULL)
stack(setNames(L, seq_along(L)))
# values ind
# 1 A 1
# 2 B 1
# 3 C 1
# 4 D 2
# Warning message:
# In stack.default(setNames(L, seq_along(L))) :
# non-vector elements will be ignored
If the warning displeases you, you can, of course, run stack on the non-NULL elements, but do it after you name your list elements so that the "ind" column reflects the correct value.
I'll show in 2 steps just for clarity:
names(L) <- seq_along(L)
stack(L[!sapply(L, is.null)])
Similarly, if you've gotten rid of the NULL list elements, you can use melt from "reshape2". You don't gain anything in brevity, and I'm not sure that you gain anything in efficiency either, but I thought I'd share it as an option.
library(reshape2)
names(L) <- seq_along(L)
melt(L[!sapply(L, is.null)])
Ananda's answer is seemingly better than this, but I'll put it up anyway:
> cbind(unlist(L), rep(1:length(L), sapply(L, length)))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "1"
[3,] "C" "1"
[4,] "D" "2"

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

Resources