I have a data set as below:
age sex Cond label
range1 M 1 0
range2 M 2 1
range3 F 4 1
with more rows..all data columns are discrete.
I intend to use the hc, gs, bn, tan of bnlearn package in R.What data transformation should I use? How should I convert the data to factors?
Regarding the second question, it is very straightforward to convert to factor. Just loop through the columns of interest with lapply and apply the factor. Then update the original dataset with the output.
df1[] <- lapply(df1, factor)
In case, we are only looking for subset of columns, say, 'age', 'sex', subset the dataset and then loop through those
df1[c('age', 'sex')] <- lapply(df1[c('age', 'sex')], factor)
Related
Within a data frame, I have a variable (var1) that has 3 levels (X, Y, Z) and I would like to print out all the observations on level Z. How could I do this? I've tried using a with() function, but haven't had any luck
table(var1)
var1
X Y Z
18 36 1
Is your variable a column in your data frame?
If it is you could just filter with dplyr
df %>%
filter(var1== "z")
I would like to known, how to subset in R based on condition. I have a large object with 10 columns, the 8 columns are logical. I want to extract all values TRUE for any 4 columns out of total 8 ?
See below. I create a vector that includes the names of the true/false variables. R will interpret TRUE as 1 and FALSE as 0; consequently, when summing across rows we want to keep rows that have a sum of 4 or greater. rowSums(df[,tf_vars]) >= 4 creates a TRUE/FALSE vector that indicates where the row has 4 or more trues. (Note that df[,tf_vars] will subset the columns of the dataframe, only keeping the variables in tf_vars). I then use that vector to subset the dataframe.
# Create dummy dataframe
df <- data.frame(matrix(nrow=100, ncol=0))
for(i in 1:8){
df[[paste0("TFvar",i)]] <- sample(100, x=c(T,F), prob=c(.5,.5), replace=T)
}
# Subset dataframe where at least 4 of the columns are true
tf_vars <- c("TFvar1", "TFvar2", "TFvar3", "TFvar4", "TFvar5", "TFvar6", "TFvar7", "TFvar8")
# (or you could use this to grab the variable names that are TRUE/FALSE variables.)
tf_vars <- names(apply(df, FUN=is.logical, 2))
df_subset <- df[rowSums(df[,tf_vars]) >= 4,]
My data frame has a first column of factors, and all the other columns are numeric.
Origin spectrum_740.0 spectrum_741.0 spectrum_742.0 etc....
1 Warthog 0.6516295 0.6520196 0.6523843
2 Tiger 0.4184067 0.4183569 0.4183805
3 Sperm whale 0.9028763 0.9031688 0.9034069
I would like to convert the data frame into two variables, a vector (the first column) and a matrix (all the numeric columns), so that I can do calculations on the matrix, such as applying msc from the pls package. Basically, I want the data frame to be like the gasoline data set from pls, which has one variable as a numeric vector, and a second variable called NIR as a matrix with 401 columns.
Alternatively, if you have any suggestions for applying calculations to the numeric data while keeping the Origin column connected, that would work too, but all the examples I have seen use gasoline or similarly formatted data frames to do the calculations on the NIR matrix.
Thank you!
M = as.matrix(df[,-1])
row.names(M) = df[,1]
M
spectrum_740.0 spectrum_741.0 spectrum_742.0
Warthog 0.6516295 0.6520196 0.6523843
Tiger 0.4184067 0.4183569 0.4183805
Sperm_whale 0.9028763 0.9031688 0.9034069
I have a dataframe that I need to split into smaller dataframes by groups of factors so that I can paginate tables and figures.
For example, say I wanted to split the diamonds dataset into mini dataframes with 2 cut levels per dataframe. That would mean a list of 2 dataframes with 2 levels, 1 one dataframe with 1 level.
levels(diamonds$cut)
# "Fair" "Good" "Very Good" "Premium" "Ideal"
I'm trying to use split() to accomplish this. split(diamonds, diamonds$cut) splits the set into dataframes by factor, but how would you split it up by groups of 2, 3, or n levels? Something like split(data,rep(1:round(nrow(data)/10),each=10)) works when each factor only has one row, but im working with a "long" dataframe so the factors are spread out along the length of the dataframe.
This question comes close, but uses a numeric variable that I don't have.
We split the levels of the 'cut' variable with a grouping variable created with gl and then subset the 'diamonds' in each of the list element using %in%.
v1 <- levels(diamonds$cut)
n <- 2
lapply(split(v1, as.numeric(gl(length(v1), n, length(v1)))),
function(x) diamonds[diamonds$cut %in% x,])
By using:
diamonds$splt <- c("B","A")[diamonds$cut %in% c("Very Good","Premium","Ideal") + 1L]
you create a new variable on which you can split the dataset in two with:
split(diamonds, diamonds$splt)
simple solution:
df_splt<-split(diamonds,ceiling(as.numeric(diamonds$cut)/2))
Note though there are empty levels in each data.frame.
>table(df_splt[[1]]$cut)
Fair Good Very Good Premium Ideal
1610 4906 0 0 0
I have a dataset with numeric and categorical variables with ~200,000 rows, but many variables are constants(both numeric and cat). I am trying to create a new dataset where the length(unique(data.frame$factor))<=1 variables are dropped.
Example data set and attempts so far:
Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","night")
Year=c(2015,2015,2015,2015,2015)
DF=data.frame(Temp,Feels,Time,Year)
I would think a loop would work, but something isn't working in my 2 below attempts. I've tried:
for (i in unique(colnames(DF))){
Reduced_DF <- DF[,(length(unique(DF$i)))>1]
}
But I really need a vector of the colnames where length(unique(DF$columns))>1, so I tried the below instead, to no avail.
for (i in unique(DF)){
if (length(unique(DF$i)) >1)
{keepvars <- c(DF$i)}
Reduced_DF <- DF[keepvars]
}
Does anyone out there have experience with this type of subsetting/dropping of columns with less than a certain level count?
You can find out how many unique values are in each column with:
sapply(DF, function(col) length(unique(col)))
# Temp Feels Time Year
# 5 2 1 1
You can use this to subset the columns:
DF[, sapply(DF, function(col) length(unique(col))) > 1]
# Temp Feels
# 1 26 cold
# 2 27 cold
# 3 28 cold
# 4 29 hot
# 5 30 hot
Another way with data.table
#Convert object to data.table object
library(data.table)
setDT(DF)
#Drop columns
todrop <- names(DF)[which(sapply(DF,uniqueN)<2)]
DF[, (todrop) := NULL]
One advantage to this method is that it does not make a copy (which might be useful when you have as many columns as you have).
If you are using data.table 1.9.4, you would change to the following:
#Drop columns
todrop <- names(DF)[which(sapply(DF,function(x) length(unique(x)<2))]
DF[, (todrop) := NULL]
I've also another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our dataframe
df with categorical column:
list=pd.DataFrame(df.categorical).columns
df= df.drop(list,axis=1)
df after running the code: