r data frames: delete variable names that all contain the same string - r

This may very well be a dupe, but I can't figure out the terminology to google it.
I know how to normally delete a dataframe. But now I'm importing Qualtrics data and there I kind of systematically assigned variable names like timer1_1, timer2_1, timer3_1, timer1_2, timer2_2, timer3_2 and so on.
Basically in this example I want to delete every column that contains the variable name "timer".
Is there a way how I do this? I 56 variable names named timer*, and I want them gone (among other variables that have the same type of structure).
The question that I saw which was similar was about the values in a column. So maybe some kind of grep() voodoo will work here as well.

You can do:
df <- df[grep("timer", names(df), value = TRUE, invert = TRUE)]
This will work with your typical case as well as any of these corner cases:
df <- data.frame(x = 1:2, y = 1:2)
df <- data.frame(x = 1:2, timer1 = 1:2)
df <- data.frame(timer1 = 1:2)

Related

Changing the type of Nth colum elements from List

This is what I came on. Supposedly, the list[[ ]] would mean all the data.frames from the list, which it did not work.
list <- as.Date(list[[]][,2])
Not working, tried something as this
list <- lapply(list[,2], as.Date)
Again an error.
So, that being said, and shown, how can explicit in R languages, that I want to apply the function in all the elements of a column, for all the data.frames of the list, in the best way
I don't know how to do want you want with any of the apply functions but a simple loop works
data1 = data.frame(y = seq(1,10),x=seq(from = 1789,to = 2789,length.out=10))
data2 = data.frame(y = seq(11,20),x=seq(from = 1789,to = 2789,length.out=10))
test = list(data1,data2)
for(i in 1:length(test)){
test[[i]][,2] = as.Date(test[[i]][,2],origin = "1899-12-30")
}
In this case I want to convert numeric excel dates in dates in R.
I tried modifying your second option to:
lapply(test,function(x) as.Date(x[,2],origin = "1899-12-30"))
but you would end up only with dates

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

R: Help using dummyVars and adding back into data.frame

I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind

Get column header stored as a variable

I have a column header stored in a variable as follows:
a <- get("colA")# this variable changes and was obtained using regexp
The value of a is actually a column header called Nimu.
I also have a data frame (BigData) having Nimu as a column header along with the other columns. How can I use cbind/data.frame to select a only a few columns, including Nimu, into a new data frame.
I have tried:
data <- cbind(BigData$Miu,BigData$sil,BigData$a)
But this did not work. R did not like BigData$a. Any suggestions? Thanks.
Something like this should work:
a <- get("colA")
b <- get("colB")
c <- get("colC")
cols = c(a, b, c)
df_subset = df[cols]
I do think your solution using get is probably sub-optimal and not needed, but without more context it is hard to say.

Importing values and labels from SPSS with memisc

I want to import both values and labels from a dataset but I don't understand how to do it with this package (the documentation is not clear). I know it is possible because Rz (a gui interface for R) uses memisc to do this. I prefer, though, not to depend on too many packages.
Here the only piece of code I have:
dataset <- spss.system.file("file.sav")
See the example in ?importer() which covers spss.system.file().
spss.system.file creates an 'importer' object that can show you variable names.
To actually use the data, you need to either do:
## To get the whole file
dataset2 <- as.data.set(dataset)
## To get selected variables
dataset2 <- subset(dataset, select=c(variable names)) to get selected variables.
You end up with a data.set object which is quite complex, but does have what you want. For analysis, you usually need to do: as.data.frame on dataset2.
I figured out a solution to this that I like
df <- suppressWarnings(read.spss("C:/Users/yada/yada/yada/ - SPSS_File.sav", to.data.frame = TRUE, use.value.labels = TRUE))
var_labels <- attr(df, "variable.labels")
names <- data.frame(column = 1:ncol(df), names(df), labels = var_labels, row.names=NULL)
names(df) <- names$labels
names(df) <- make.names(df))

Resources