Getting the minimum of the rows in a data frame - r

I am working with a dataframe that has 65 variables in it. The first variable catalogs a person, and the next 64 variables indicate the geographic distance that person is from each of 64 locations. Using R, I would like to create a new variable that catalogs the shortest distance for each person to one of those 64 locations.
For example: if person X is 35, 50, 79, 100, 450...miles away from the locations, I would like the new variable to automatically assign them a 35, because this is the shortest distance.
Any help with this would be much appreciated. Thanks.

Or, using the example of Justin:
df$shortest <- do.call(pmin,df[-1])
see also ?pmin and ?do.call, and note that you can drop the first variable in your data frame by using the list indices (so not using any comma at all, see also ?Extract )

df <- data.frame(let=letters[1:25], d1=sample(1:25,25), d2=sample(1:25,25), d3=sample(1:25,25))
df$shortest <- apply(df[,2:4],1,min)
The second line applies the function min to each row and assigns it to the new column in my data.frame df. See ?apply for more explanation of what the second line is doing. Careful to skip the first column, or any columns that aren't distances:
apply(df,1,min) gives completely difference answers since its finding the "min" of strings.
> min(2:10)
[1] 2
> min(as.character(2:10))
[1] "10"

I'd approach this with apply but transform or other approach could work.
#fake data set
ID=LETTERS[1:5], distance=matrixsample(
DF <- as.data.frame(matrix(sample(1:100, rep=T, 100), 5, 20))
DF <- data.frame(ID=LETTERS[1:5], DF)
#solution
DF$newvar <- apply(DF[,-1], 1, min)

Related

Trouble taking a subset of one data frame inside of another data frame

I’m trying to do two things in the code below.
The first: I’m trying to create subsets of the “regular_season” dataset that are split by season, category, and question. I think that part is correctly done below. Within each subset, I would like to do the second thing.
The second: I have two datasets: one is called “regular_season” and the other is called “championships,” and both have a column called “correct.” In “regular_season,” the “correct” column is 600 entries long and in “championships,” the column is 24 entries long. Within the subsets of the “regular_season” we created above, I am trying to replace the “correct” column in “championships” with a random subset of the “correct” column in “regular_season” that is also 24 entries long. To do all of this, I have tried the code:
#First half of the question
regular_season$flag <- with(regular_season, season %in%
c('76', '77', '78', '79', '80', '81', '82', '83'))
rs_scq_user_ <- split(regular_season, regular_season[c('category', 'question',
'flag')], drop = TRUE)
champ_correct_user_ <- subset(championships, round < 3)
champ_correct_user_ <- subset(champ_correct_user_, correct %in% champ_correct_user_)
champ_correct_user_ <- subset(champ_correct_user_, id %in% champ_correct_user_)
output_champ_correct_user_ <- vector("list", length(champ_correct_user_))
for(i in champ_correct_user_) {
champ_correct_user_[i] <- subset([i], id %in% [i])
output_champ_correct_user_[[i]] <- champ_correct_user_[i]
}
#Second half of the question
regular_season$flag <- with(regular_season, season %in%
c('76', '77', '78', '79', '80', '81', '82', '83'))
rs_scq_user_ <- split(regular_season, regular_season[c('category', 'question',
'flag')], drop = TRUE)
rs_scq_correct_sample_user_ <- regular_season[sample(1:nrow(rs_scq_user_$correct), 24, replace = FALSE),]
output_rs_scq_correct_sample_user_ <- vector("list", length(rs_scq_correct_sample_user_))
for (i in rs_scq_user) {
rs_scq_correct_sample_user_[i] <- regular_season[sample(1:nrow([i]$correct), 24, replace = FALSE),]
print(rs_scq_correct_sample_user_[i])
output_rs_scq_correct_sample_user_[[i]] <- rs_scq_correct_sample_user_[i]
}
But this is not working. For each user, I would want the code to replace the “correct” column in “championships” with a random subset of the “correct” column in “regular_season” that is 24 entries long. I’m not sure how to do that last part (my code is only to take a random sample of the regular_season’s “correct” column, and I’m not even sure it does that). If anyone can see a way to get the code to work, can you please point out the solution below?
Also, correct me if I'm wrong but putting an "i" in two brackets in a for loop (like "output_rs_scq_correct_sample_user_[[i]] <- rs_scq_correct_sample_user_[i]") saves the loop, right?
EDIT:
For example, if "regular_season$correct" contained
12,
34,
3,
56,
32,
...
(595 more entries)
and "championships$correct" contained
1,
0,
0,
1,
0,
...
(19 more entries)
I'm trying to take a random sample of the entries in "regular_season$correct" and use them to replace the entries in "championships$correct." I'm hoping the output would look something like:
"championships$correct"
32,
79,
56,
98,
8,
...
(19 more entries)
I hope this helps! I'm sorry if it is still a bit unclear, I'm not too great at explaining things, and I find things like this difficult, so I might have still been a bit unclear.
If we are trying to replace the 'correct' column of 'championships' from a sample of 'regular_season' 'correct' with the length same as the number of rows of 'championships', create an index of rows with sample
i1 <- sample(nrow(regular_season), size = nrow(championships))
Extract the elements of 'correct' based on that index from 'regular_season' data
v1 <- regular_season$correct[i1]
and assign it to the 'championships' data 'correct'
championships$correct <- v1
This can be written in a single line as well
championships$correct <- with(regular, correct[sample(length(correct),
size = nrow(championships))])

Creating multiple dimensional list to replace subseting - Is it worth?

Basic idea:
As said before, is a good idea to substitute subsisting a data frame, for a multidimensional list?
I have a function that need to generate a subset from a quite big data frame close to 30 thousand times. Thus, creating a 4 dimensional list, will give me instant access to the subset, without loosing time generating it.
However, I don't know how R treats this objects, so I would like you opinion on it.
More concrete example if needed:
What I was trying to do is to use the inputation method of KNN. Basically, the algorithm says that the value found as outliers has to be replaced with K(K in a number, it could be 1,2,3...) closest neighbor. The neighbor in this example are the rows with the same attributes in the first 4 columns. And, the closed neighbors are the one with the smallest difference between the fifth column. If it is not clear what I said, please still consider reading the code, because, I found it hard to describe in words.
This are the objects
#create a vector with random values
values <- floor(runif(5e7, 0, 50)
possible.outliers <- floor(runif(5e7, 0, 10000)
#use this values, in a mix way, create a data frame
df <- data.frame( sample(values), sample(values), sample(values),
sample(values), sample(values), sample(possible.outliers)
#all the values greater then 800 will be marked as outliers
df$isOutlier = df[,6] > 800
This is the function which will be used to replace the outliers
#with the generated data frame, do this function
#Parameter:
# *df: The entire data frame from the above
# *vector.row: The row that was marked that contains an outlier. The outlier will be replaced with the return of this function
# *numberK: The number of neighbors to take into count.
# !Very Important: Consider that, the last column, the higher the
# difference between their values, less attractive
# they are for an inputation.
foo <- function(df, vector.row, numberK){
#find the neighbors
subset = df[ vector.row[1] == df[,1] & vector.row[2] == df[,2] &
vector.row[3] == df[,3] & vector.row[4] == df[,4] , ]
#taking the "distance" from the rows, so It can find which are the
# closest neighbors
subset$distance = subset[,5] - vector.row[5]
#not need to implement
"function that find the closest neighbors from the distance on subset"
return (mean(ClosestNeighbors))
}
So, the function runtime is quite big. For this reason, I am searching for alternatives and I thought that, maybe, if I replace the subsisting for something like this:
list[[" Levels COl1 "]][[" Levels COl2 "]]
[[" Levels COl3 "]][[" Levels COl4 "]]
What this should do is an instant access to the subset, instead of generating it all the time inside the function.
Is it a reasonable idea? I`am a noob in R.
If you did not understood what is written, or would like something to be explained in more detain or in other words, please tell me, because I know it is not the most direct question.

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

cbind a vector of different length to a dataframe

I have a dataframe consisting of two samples. Only one sample has answered a questionnaire about state anxiety.
For this case, I have calculated a vector for somatic state anxiety with the following function "rowSums":
som_lp <- rowSums(sample1[,c(1, 7, 8, 10 )+108], na.rm = TRUE)
Now I would like to add this to my existing dataframe "data", but the function "cbind" doesn't work here, because of the different lengths (dataframe 88, som_lp 59).
data <- cbind(data, som_lp)
Can anyone help me and is there another option to calculate "som_lp" to avoid the different lengths?
We can use cbind.fill from rowr
library(rowr)
cbind.fill(data, som_lp, fill = NA)

Combine several columns under same name

I am trying to get the mvr function in the R-package pls to work. When having a look at the example dataset yarn I realized that all 268 NIR columns are in fact treated as one column:
library(pls)
data(yarn)
head(yarn)
colnames(yarn)
I would need that to use the function with my data (so that a multivariate datset is treated as one entity) but I have no idea how to achive that. I tried
TT<-matrix(NA, 2, 3)
colnames(TT)<-rep("NIR", ncol(TT))
TT
colnames(TT)
You will notice that while all columns have the same heading, colnames(TT) shows a vector of length three, because each column is treated separately. What I would need is what can be found in yarn, that the colname "NIR" occurs only once and applies columns 1-268 alike.
Does anybody know how to do that?
You can just assign the matrix to a column of a data.frame
TT <- matrix(1:6, 2, 3 )
# assign to an existing dataframe
out <- data.frame(desnity = 1:nrow(TT))
out$NIR <- TT
str(out)
# assign to empty dataframe
out <- data.frame(matrix(integer(0), nrow=nrow(TT))) ;
out$NIR <- TT

Resources