I have been using the wfm function in "qdap" package for transposing the text row values into columns and ran into problem when the data contains numbers along with text. For example if the row value is "abcdef" the transpose works fine but if the value is "ab1000" then the truncation of numbers happen. Can anyone help with suggestions on how to work around this?
Approach tried so far:
input <- read.table(header=F, text="101 ab0003
101 pp6500
102 sm2456")
colnames(input) <- c("id","channel")
require(qdap)
library(qdap)
output <- t(with(input, wfm(channel, id)))
output <- as.data.frame(output)
expected_output<- read.table(header=F,text="1 1 0
0 0 1")
colnames(expected_output) <- c("ab0003","pp6500", "sm2456")
I think maybe wfm isn't the right tool for this job. It seems you don't really have sentences that you want to split into words. So you're using a function with a lot of overhead unnecessarily. What you really want it to tabulate the values you have by another grouping variable.
Here are two approaches. One using qdapTools's mtabulate, another using base R's table:
library(qdapTools)
mtabulate(with(input, split(channel, id)))
## ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1
t(with(input, table(channel, id)))
## channel
## id ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1
It may be possible your MWE is not reflecting the complexity of the data, if this is the case it brings us back to the original problem. wfm uses tmpackage as a backend to make some of the manipulations. So we'd need to supply something to the ldots (...). I re-read the documentation and this is a bit confusing (I have added this info in the dev version) but we want to pass removeNumbers=FALSE to TermDocumentMatrix as seen here:
output <- t(with(input, wfm(channel, id, removeNumbers=FALSE)))
as.data.frame(output)
## ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1
Related
I have certain data in a list extracted from a bayesian processing from certain electrodes and I want to populate a dataframe out of a loop. First I have a list of 729 processing outcomes and an object elecs which is basically a list of 729 pairs of electrodes (27*27) as you can see.
> head(elecs)
X Elec1 Elec2
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
The thing is I would like to fill dataf1 with the outcome of this loop which happens to be a dataframe of 4000 rows.
dataf1 <- data.frame('Elec1'=rep(NA,4000*729),'Elec2'=rep(NA,4000*729),'int'=rep(NA,4000*729))
for (i in nrow(elecs)){
Elec1=as.data.frame(rep(elecs[i,]$Elec1,4000))
Elec2=as.data.frame(rep(elecs[i,]$Elec2,4000))
post <- posterior_samples(bayeslist[[i]])
int <- as.data.frame(post$b_Intercept)
df <- cbind(Elec1,Elec2,est)
colnames(df) <- c('Elec1','Elec2','int')
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
}
Everything works perfectly fine until the last line in the loop:
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
And I don't know why exactly this is not working as expected and populating the dataf1 preinitialised dataframe.
Any insight, as always, will be highly appreciated.
I realised I was missing the init in the for, so it's kinda newbie typo. Apart from this, the code works, in case anyone is wondering.
for (i in nrow(elecs)){
for (i in 1:nrow(elecs)){
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data
I believe in meaningful variable names. Unfortunately this often means that there are huge white gaps when I look at a data.frame in the R console:
Is there a way to tell R to print the column names vertically, like this:
It doesn't need to be in the console, maybe it is possible to plot a table to PDF that way?
Executable code, provided by Ben Bolker:
sample.table <- data.frame(a.first.long.variable.name=rep(1,7),
another.long.variable.name=rep(1,7),
this.variable.name.is.even.longer.maybe=rep(1,7)
)
As described in the comments, you may apply rotation via CSS:
library(DT)
df <- mtcars
names(df) <- sprintf('<div style="transform:rotate(-90deg);margin-top:30px;">%s</div>', names(df))
dt <- datatable(df, escape = FALSE)
htmlwidgets::saveWidget(dt, tf<-tempfile(fileext = ".html"))
shell.exec(tf)
This does not work in the RStudio Viewer, however it does work in the browser:
Not without using a graphics device.
A better and simpler workaround which works in a plain old console is:
Print the transpose of the table, now column-names become row-names:
> t(sample.table)
1 2 3 4 5 6 7
a.first.long.variable.name 1 1 1 1 1 1 1
another.long.variable.name 1 1 1 1 1 1 1
this.variable.name.is.even.longer.maybe 1 1 1 1 1 1 1
(To suppress the useless column-names you get by default, include sample.table <- data.frame(row.names=1:7, ... )
I do this all the time. Heatmaps, dendrograms, auto-named regression variables from expanding categoricals...
I have a dataframe,df, with many columns cola,colb etc each consisting of a sequence of integers 0 or 1
df$cola
[1] 1 0 0 1 0 1 0 0 etc.
I am using the subSeq function in the doBy package to obtain some sequences
and want to apply this to all columns
I have tried putting the columns into a vector
cols <- colnames(df) # "cola" "colb" etc.
and have tried without success this approach
subSeq(get(paste0("df$",cols[1]))) # error object 'df$cola' not found
Could not easily find an equivalent on site via search
I think you are looking for df[[cols[1]]].
Note that df[["foo"]] is the same as df$foo.
I have a data.frame called series_to_plot.df which I created by combining a number of other data.frames together (shown below). I now want to pull out just the .mm column from each of these, so I can plot them. So I want to pull out the 3rd column of each data.frame (e.g. p3c3.mm, p3c4.mm etc...), but I can't see how to do this for all data.frames in the object without looping through the name. Is this possible?
I can pull out just one set: e.g. series_to_plot.df[[3]] and another by
series_to_plot.df[[10]] (so it is just a list of vectors..) and I can reference directly with series_to_plot.df$p3c3.mm, but is there a command to get a vector containing all mm's from each data.frame? I was expecting an index something like this to work: series_to_plot.df[,3[3]] but it returns Error in [.data.frame(series_to_plot.df, , 3[3]) : undefined columns selected
series_to_plot.df
p3c3.rd p3c3.day p3c3.mm p3c3.sd p3c3.n p3c3.noo p3c3.no_NAs
1 2010-01-04 0 0.1702531 0.04003364 7 1 0
2 2010-01-06 2 0.1790594 0.04696674 7 1 0
3 2010-01-09 5 0.1720404 0.03801756 8 0 0
p3c4.rd p3c4.day p3c4.mm p3c4.sd p3c4.n p3c4.noo p3c4.no_NAs
1 2010-01-04 0 0.1076581 0.006542157 6 2 0
2 2010-01-06 2 0.1393447 0.066758781 7 1 0
3 2010-01-09 5 0.2056846 0.047722862 7 1 0
p3c5.rd p3c5.day p3c5.mm p3c5.sd p3c5.n p3c5.noo p3c5.no_NAs
1 2010-01-04 0 0.07987147 0.006508766 7 1 0
2 2010-01-06 2 0.11496167 0.046478767 8 0 0
3 2010-01-09 5 0.40326471 0.210217097 7 1 0
To get all columns with specified name you could do:
names_with_mm <- grep("mm$", names(series_to_plot.df), value=TRUE)
series_to_plot.df[, names_with_mm]
But if your base data.frame's all have the same structure then you can rbind them, something like:
series_to_plot.df <- rbind(
cbind(name="p3c3", p3c3),
cbind(name="p3c4", p3c4),
cbind(name="p3c5", p3c5)
)
Then mm values are in one column and its easier to plot.
To add to the other answers, I don't think it is a good idea to have useful information encoded in variable names. Much better to rearrange your data so that all useful information is in the value of some variable. I don't know enough about your data set to suggest the right format, but it might be something like
p c rd day date mm sd ...
3 3 2010-10-04 ...
Once you have done this the answer to your question becomes the simple df$mm.
If you are getting the data in a less useful form from an external source, you can rearrange it in a more useful form like the above within R using the reshape function or functions from the reshape package.
The R Language Definition has some good info on indexing (sec 3.4.1), which is pretty helpful.
You can then pull the names matching a sequence with the grep() command. Then string it all together like this:
dataWithMM <- series_to_plot.df[,grep("[P]", names(series_to_plot.df))]
to deconstruct it a little, this gets the number of the columns that match the "mm" pattern:
namesThatMatch <- grep("[mm]", names(series_to_plot.df)
Then we use that list to call the columns we want:
dataWithMM <- series_to_plot.df[, namesThatMatch ]