set removes variable in all data frames in workspace - r

I have a simple question, to which I have not been able to find a solution here:
When I want to keep a selection of variables from a data frame, the variables get removed from all copies of that data frame loaded in my workspace.
Is there a way to only remove it from a single data frame?
A reproducible example (only remove it from df and not df2)?
require(data.table)
df <- structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), x = c(0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L),
time = c(1636L, 1637L, 1638L, 1639L, 1640L, 1641L, 1642L,
1683L, 1684L, 1685L, 1686L, 1687L, 1688L, 1689L, 1690L, 1691L,
1638L, 1639L, 1640L)), .Names = c("group", "x", "time"), class = "data.frame", row.names = c(NA,
-19L))
df2 <- df
varstokeep <- c("group","x")
vartodrop <- which(!names(df)%in%varstokeep)
set(df, i=NULL, j=vartodrop, value=NULL)
The reason is that I have a large file, which I use as the basis for multiple (more aggregated) files. Having to load the basic file 6 times would take a lot more time.

Related

Subset using 'IF' and 'BY' in R

For a sample dataframe:
df <- structure(list(id = 1:19, region.1 = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 5L
), .Label = c("AT1", "AT2", "AT3", "AT4", "AT5"), class = "factor"),
PoorHealth = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L)), .Names = c("id", "region.1",
"PoorHealth"), class = "data.frame", row.names = c(NA, -19L))
I want to subset using the BY command, and hoped somebody may be able to help me.
I want to INCLUDE regions (regions.1) in df that satisfy this condition:
Less than (or equal to) 3 occurrences of '1' in the variable 'PoorHealth'
OR this condition:
Where N (i.e. the respondents in each region) is less than or equal to 6.
If anyone has any ideas to help me, I should be very grateful.
This should work. Dno if there is a cleaner way:
library(data.table)
setDT(df)
qualified_regions = df[,which((sum(PoorHealth==1) <=3 | .N <= 6)),region.1][,region.1]
df[region.1 %in% qualified_regions,]
E: I removed the !-mark because OP changed "EXCLUDE" to "INCLUDE" in the original question.

identifying rows in data frame that exhibit patterns

Below I have code with 3 columns: a group field, a open/close field for the store, and the rolling sum of 3 month opens for the store. I also have the desired solution output.
My dataset can be thought of as an employees availability. You can assume each row to be a different time period (hour, day,month, year, whatever). In the open/closed column I have whether or not the employee was present. The 3month rolling column is a sum of the previous rows.
What I want to identify is the non-zero values in this rolling sum column following a gap of at least 3 zero rows for that particular group. While not present in this dataset, you can assume that there might be more than one 'gap' of zeros present.
structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), X0_closed_1_open = c(0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), X3month_roll_open = c(0L,
0L, 1L, 2L, 2L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), desired_solution = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("no", "yes"), class ="factor")), .Names = c("Group", "X0_closed_1_open", "X3month_roll_open", "desired_solution"), class = "data.frame", row.names = c(NA,
-26L))
One option is:
res <- unsplit(
lapply(split(df1, df1$Group), function(x) {
rl <- with(x,rle(X3month_roll_open==0))
indx <- cumsum(c(0,diff(inverse.rle(within.list(rl,
values[values] <- lengths[values]>=3)))<0))
x$Flag <- indx!=0 & x[,3]!=0
x}),
df1$Group)
NOTE: Instead of 'yes/no', it may be better to have 'TRUE/FALSE' for easing subsetting.
identical(c('no', 'yes')[res$Flag+1L], as.character(res$desired_solution))
#[1] TRUE

chi squared and basic statistics on multiple columns of a data frame

I would like to compute a chi squared test for each column in a dataframe and grouping for the variable Project.
Basically I would like to compute a two by two table for each column and then store the value in a new table.
Here an example of my dataframe.
structure(list(Project = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("discovery", "validation"), class = "factor"), MLL = c(1L, 1L, 1L, 1L, 1L, 1L), CREB = c(0L, 1L, 1L, 1L, 1L, 0L), TNR = c(1L, 1L, 0L, 0L, 1L, 1L)), .Names = c("Project", "MLL", "CREB", "TNR"), row.names = c(1L, 2L, 3L, 300L, 301L, 302L), class = "data.frame")
After the comment of Jaap I have tried:
pvalue <- data.frame(apply(cast_subset[-1] , 2 , function(i) chisq.test(table(cast_subset$Project , i ))$p.value))
colnames(pvalue) <- "p.value"
but i can not accces the column with the gene name for merging to other data set.

Error using predict with klaR package, NaiveBayes

I'm using the klaR package's predict method as mentioned in the post Naive bayes in R:
nb_testpred <- predict(mynb, newdata=testdata).
nb_testpred is my Naive Bayes model, developed on traindata; testdata is the remaining data.
However, I get this error:
Error in FUN(1:10[[4L]], ...) : subscript out of bounds
I'm not sure what's going on - testdata has fewer rows than traindata, and the same number of columns.
For reference, my code looks like this:
ind <- sample(2, nrow(mydata), replace=TRUE, prob=c(0.9,0.1))
traindata <- mydata[ind==1,]
testdata <- mydata[ind==2,]
myformula <- as.factor(dep) ~ X1 + as.factor(X2) + as.factor(X3) + as.factor(X4) + X5 + as.factor(X6) + as.factor(date) + as.factor(hour)
mynb <- NaiveBayes(myformula, data=traindata)
nb_testpred <- predict(mynb, newdata=testdata) #where I'm getting an error...
A sample of the data is here (the original file has 100,000+ rows):
sampledata <- structure(list(dep = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), X1 = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), X2 = c(200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L,
200L, 200L), X3 = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c(".", "1400000", "2400000", "900000"), class = "factor"), X4 = c(0L, 0L, 0L, 3L, 4L, 5L, 5L, 5L, 5L, 0L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 0L), X5 = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), X6 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L), .Label = c("9/23/2012",
"9/24/2012"), class = "factor"), hour = c(18L, 17L, 23L, 8L, 1L, 19L, 19L, 16L, 22L, 2L, 12L, 16L, 15L, 9L, 1L, 9L,
13L, 19L)), .Names = c("dep", "X1", "X2", "X3", "X4", "X5", "X6", "date", "hour"), class = "data.frame", row.names = c(NA, -18L))
Any help would be greatly appreciated!
You can act as follows:
traindata$dep=factor(traindata$dep)
mynb <- NaiveBayes(dep~.,traindata)
Then it works, however you should refine your data to have avoid constant columns.

Removing Survey non-response in R

So, I have a data frame with several continuous variables and several dummy variables. The survey that this data frame comes from uses 6,7,8 and 9 to denote different types of non-response. So, I would like to replace 6,7,8 and 9 with NA whenever they show up in a dummy variable column but leave them be in the continuous variable column.
Is there a concise way to go about doing this?
Here's my data:
> dput(head(sfsuse[c(4:16)]))
structure(list(famsize = c(3L, 1L, 2L, 5L, 3L, 5L), famtype = c(2L,
1L, 2L, 3L, 2L, 3L), cc = c(1L, 1L, 1L, 1L, 1L, 1L), nocc = c(1L,
1L, 1L, 3L, 1L, 1L), pdloan = c(2L, 2L, 2L, 2L, 2L, 2L), help = c(2L,
2L, 2L, 2L, 2L, 2L), budget = c(1L, 1L, 1L, 1L, 2L, 2L), income = c(340000L,
20500L, 0L, 165000L, 95000L, -320000L), govtrans = c(7500L, 15500L,
22000L, 350L, 0L, 9250L), childexp = c(0L, 0L, 0L, 0L, 0L, 0L
), homeown = c(1L, 1L, 1L, 1L, 1L, 2L), bank = c(2000L, 80000L,
25000L, 20000L, 57500L, 120000L), vehval = c(33000L, 7500L, 5250L,
48000L, 8500L, 50000L)), .Names = c("famsize", "famtype", "cc",
"nocc", "pdloan", "help", "budget", "income", "govtrans", "childexp",
"homeown", "bank", "vehval"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to subs in NA for 6,7,8 and 9 in columns 3:7 and column 11. I know how to do this one column at a time by the column names:
df$name[df$name %in% 6:9]<-NA
but I would have to do this for each column by name, is there a concise way to do it by column index?
Thanks
This function should work
f <- function(data,k) {
data[data[,k] %in% 6:9,k] <- NA
data
}
Now at the console:
> for (k in c(3:7,11)) { data <- f(data,k) }

Resources