I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]
Related
From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one substact, i.e column A- column B but not column B-column A and so on.
What I got is very manual, and this tend to be not so easy when there are lots of variables.
# Result
df_result <- as.data.frame(list(df$X1-df$X2,
df$X1-df$X3,
df$X1-df$X4,
df$X2-df$X3,
df$X2-df$X4,
df$X3-df$X4))
Also the colname of the feature name should describe the operation i.e.(x1_x2) being x1-x2.
You can use combn:
COMBI = combn(colnames(df),2)
res = data.frame(apply(COMBI,2,function(i)df[,i[1]]-df[,i[2]]))
colnames(res) = apply(COMBI,2,paste0,collapse="minus")
head(res)
X1minusX2 X1minusX3 X1minusX4 X2minusX3 X2minusX4 X3minusX4
1 0 0 -1 0 -1 -1
2 1 1 0 0 -1 -1
3 0 0 0 0 0 0
4 0 0 -1 0 -1 -1
5 1 1 1 0 0 0
6 -1 0 0 1 1 0
Suppose I have a binomial distribution where n=12, p=0.2. I split this sample into 4 chunks(groups), each chunk has group size 3. Then I remove the output whose sum is equal to 0, and combine the remaining outputs into a new dataset. Here are some of my code:
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
newvector=c()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
newvector=c(newvector,a.no0)
}
}
print(newvector)
and this is the result I got:
[1] 1 1 0 0 1 0 0 1 0
what I'm trying to do is randomly regroup this data, for example:
[1] 0 1 0 0 1 1 1 0 0
or
[1] 1 0 1 0 1 0 1 0 0
......
I tried to use 'regroup' in package 'LearnBayes' and 'caroline', but it didn't work. Any hints please?
I have this document term matrix from package R{tm} which i have coerced to as.matrix. MWE here:
> inspect(dtm[1:ncorpus, intersect(colnames(dtm), thai_list)])
<<DocumentTermMatrix (documents: 15, terms: 4)>>
Non-/sparse entries: 17/43
Sparsity : 72%
Maximal term length: 12
Weighting : term frequency (tf)
Terms
Docs toyota_suv gmotors_suv ford_suv nissan_suv
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 2 0 0
5 0 4 0 0
6 1 1 0 0
7 1 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
I need to subset this as.matrix(dtm), such that I get only documents (rows) which refer to toyota_suv but no other vehicle. I get a subset for one term (toyota_suv) using dmat<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota_suv")]) which works well. How do I set up a query: documents where toyota_suv is non-zero but values of non-toyota_suv columns are zero? I could have specified column-wise as ==0 but this matrix is dynamically generated. In some markets, there may be four cars, in some markets there may be ten. I cannot specify colnames beforehand. How do I (dynamically) club all the non-toyota_suv columns to be zero, like all_others==0?
Any help will be much appreciated.
You can accomplish this by getting the index position for toyota_suv, and then subsetting dtm to match that for non-zero, and all other columns using negative indexing on the same index variable to ensure they are all zero.
Here I modified your dtm slightly so that the two cases where toyota_sub are non-zero meet the criteria you are looking for (since none in your example met them):
dtm <- read.table(textConnection("
toyota_suv gmotors_suv ford_suv nissan_suv
0 1 0 0
0 1 0 0
0 1 0 0
0 2 0 0
0 4 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 1 0 0"), header = TRUE)
Then it works:
# get the index of the toyota_suv column
index_toyota_suv <- which(colnames(dtm) == "toyota_suv")
# select only cases where toyota_suv is non-zero and others are zero
dtm[dtm[, "toyota_suv"] > 0 & !rowSums(dtm[, -index_toyota_suv]), ]
## toyota_suv gmotors_suv ford_suv nissan_suv
## 6 1 0 0 0
## 7 1 0 0 0
Note: This is not really a text analysis question at all, but rather one for how to subset matrix objects.
It would be helpful if you provided the exact code you are running and a sample data set to work with so that we can replicate your work and provide a working example.
Given that, if I understand your question correctly you are looking for a way to label all non-toyota columns to be zero. You could try:
df[colnames(df) != "toyota"] <- 0
How to convert this
1,2,5,6,9
1,2
3,11
into this:
1,1,0,0,1,1,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0
0,0,1,0,0,0,0,0,0,0,1
I thought I can read my data by adding na if the index is not exist.
Then, replace each na with zero, and each not na with one.
But I don't know how, and I searched to similar code and I didn't find
You can do:
lapply(z,tabulate,nbins=max(unlist(z)))
[[1]]
[1] 1 1 0 0 1 1 0 0 1 0 0
[[2]]
[1] 1 1 0 0 0 0 0 0 0 0 0
[[3]]
[1] 0 0 1 0 0 0 0 0 0 0 1
where z is a list of vectors:
z <- list(c(1,2,5,6,9),c(1,2),c(3,11))
I'm not sure what your original numbers are stored as, but here's a solution assuming it's a list of vectors:
nums <-list(
c(1,2,5,6,9),
c(1,2),
c(3,11)
)
maxn <- max(unlist(nums))
lapply(nums, function(x) {
binary <- numeric(maxn)
binary[x] <- 1
binary
})
So I have a list that contains certain characters as shown below
list <- c("MY","GM+" ,"TY","RS","LG")
And I have a variable named "CODE" in the data frame as follows
code <- c("MY GM+","","LGTY", "RS","TY")
df <- data.frame(1:5,code)
df
code
1 MY GM+
2
3 LGTY
4 RS
5 TY
Now I want to create 5 new variables named "MY","GM+","TY","RS","LG"
Which takes binary value, 1 if there's a match case in the CODE variable
df
code MY GM+ TY RS LG
1 MY GM+ 1 1 0 0 0
2 0 0 0 0 0
3 LGTY 0 0 1 0 1
4 RS 0 0 0 1 0
5 TY 0 0 1 0 0
Really appreciate your help. Thank you.
Since you know how many values will be returned (5), and what you want their types to be (integer), you could use vapply() with grepl(). We can turn the resulting logical matrix into integer values by using integer() in vapply()'s FUN.VALUE argument.
cbind(df, vapply(List, grepl, integer(nrow(df)), df$code, fixed = TRUE))
# code MY GM+ TY RS LG
# 1 MY GM+ 1 1 0 0 0
# 2 0 0 0 0 0
# 3 LGTY 0 0 1 0 1
# 4 RS 0 0 0 1 0
# 5 TY 0 0 1 0 0
I think your original data has a couple of typos, so here's what I used:
List <- c("MY", "GM+" , "TY", "RS", "LG")
df <- data.frame(code = c("MY GM+", "", "LGTY", "RS", "TY"))