Is there a way to random pick a column in a dataframe and then avoid randomly pick it again? This should pick a random column
random_data_vector = data[, sample(ncol(data), 1)]
but I'm not sure how to avoid picking the column again. I thought about removing the column completely but there might be a better approach
You can first sample the columns with
random_cols <- sample(ncol(data))
and then select the random vectors like this
random_data_vector1 <- my_df[, random_cols[1]]
random_data_vector2 <- my_df[, random_cols[2]]
The default setting of sample is replace = FALSE, thus in the random_cols vector you won't have duplicated numbers and you won't select one column twice.
Related
In R code, I want to select all the variables from a dataset where same value occurs for each column is less than 40% for that column.
I am appling the sapply, but not getting the correct output.
Note: All the columns values are numeric.
train = train[, sapply(train, function(col) length(unique(col))) < 0.4*nrow(train)]
Please suggest how to proceed.
By playing around with a toy dataset, I found this code that works
train[, sapply(train, function(x) {(sort(table(x), decreasing = TRUE)/nrow(train))[[1]] < 0.4})]
Basically, I create the table of relative frequencies (sorted in decreasing order) for each numeric column in train, and then I check whether the most frequent value for each column occurs less than 40% of the times. If yes, this column is selected, otherwise discarded.
I'm trying do a basic operation (let's say, subtract 1) on some column, referenced by the value of another column (let's say the first column). For example: for a given row, the value in the first column is equal to 5. Then I want to subtract 1 from the same row, column 5.
Data looks like
tmp <- data.frame(replicate(5,sample(2:5,5,rep=TRUE)))
Of course, I can achieve this by a multiple row code, each time selecting a subsample of the total rows, satisfying a certain condition, but I'm sure this operation could be performed more clean and dynamically.
Additional question: is there an easy way to reference in the same manner to column names, rather than the index. For example the column name "S5"? Easiest way is through names(tmp), and then match the name of the column and use the index of names, or can you think of an easier way?
Any suggestions?
So the column index of the element is in the first column of the row:
for (i in 1:nrow(tmp)) tmp[i, tmp[i,1]] <- tmp[i, tmp[i,1]] - 1
It also can work if in the fist column are the names of the columns as character (not as factor!):
tmp <- data.frame(cInd=c("A", "B", "C"), A=1:3, B=11:13, C=21:23,
stringsAsFactors = FALSE)
for (i in 1:nrow(tmp)) tmp[i, tmp[i,1]] <- tmp[i, tmp[i,1]] - 1
I want to find entries in an R dataframe based on their value in order to be able to replace them by the number of the column each of these entries is located in. Well, it's easy to modify particular entries based on their location or based on their value. Let's say this would replace all zeros in the data frame with 1:
df[df==0]<-1
But how do you replace all zeros in your df by the number of the column they're in?
df[df==0] <- which(df==0, arr.ind = TRUE)[,2]
df[]<-lapply(1:ncol(df),function(i){
ifelse(df[,i]!=0,df[,i],i)
})
I have a data frame that looks like this
df <- data.frame(cbind(1:10, sample(c(1:5), 10, replace=TRUE)))
# in real case the columns could be more than two
# and the column name could be anything.
What I want to do is to remove all rows where the value of all its columns
is smaller than 5.
What's the way to do it?
df[!apply(df,1,function(x)all(x<5)),]
First of all ...please stop using cbind to create data.frames. You will be sorry if you continue. R will punish you.
df[ !rowSums(df <5) == length(df), ]
(The length() function returns the number of columns in a dataframe.)
Two questions about R:
1.) If I have a data set with the multiple column values and one of the column values is 'test_score' how can I exclude the rows with blank values (and / or non-numeric values) for that column? (using pie(), hist(), or cor())
2) If the dataset has a column named 'Teachers', how might I graph the column 'testscores' only for the rows where Teacher = Jones?
Creating separate vectors without the missing data:
dat.nomissing <- tenthgrade[!is.nan(Score),]
seems problematic as the two columns must remain paired.
I was thinking something such as:
hist(!is.nan(tenthgrade$Score)[tenthgrade$Teacher=='Jones'])
However, is.nan is creating a list of TRUE, FALSE values (as it should).
Use subscripting. For example:
dat[!is.na(dat$test_score),]
hist(dat$test_score[dat$Teachers=='Jones'])
And a more complete example with artificial data:
# Create artificial dataset
dat <- data.frame('test_score'=rnorm(500), 'Teachers'=sample(c('Jones', 'Smith', 'Clark'), 500, replace=TRUE))
# Introduce some random missingness
dat$test_score[sample(1:500, 50)] <- NA
# Keep if test_score is valid
dat.nomissing <- dat[!is.na(dat$test_score),]
# Plot subset of data
hist(dat$test_score[dat$Teachers=='Jones'])